Discriminative Syntax-Based Word Ordering - IA de Investigación especializada en el MIT

Discriminative Syntax-Based Word Ordering
for Text Generation

Yue Zhang∗
Singapore University of Technology
y Diseño

Stephen Clark∗∗
University of Cambridge

Word ordering is a fundamental problem in text generation. In this article, we study word
ordering using a syntax-based approach and a discriminative model. Two grammar formalisms
are considered: Combinatory Categorial Grammar (CCG) and dependency grammar. Given the
search for a likely string and syntactic analysis, the search space is massive, making discrimi-
native training challenging. We develop a learning-guided search framework, based on best-ﬁrst
buscar, and investigate several alternative training algorithms.

The framework we present is ﬂexible in that it allows constraints to be imposed on output
word orders. To demonstrate this ﬂexibility, a variety of input conditions are considered. Primero, nosotros
investigate a “pure” word-ordering task in which the input is a multi-set of words, and the task is
to order them into a grammatical and ﬂuent sentence. This task has been tackled previously, y
we report improved performance over existing systems on a standard Wall Street Journal test set.
Segundo, we tackle the same reordering problem, but with a variety of input conditions, desde el
bare case with no dependencies or POS tags speciﬁed, to the extreme case where all POS tags and
unordered, unlabeled dependencies are provided as input (and various conditions in between).
When applied to the NLG 2011 shared task, our system gives competitive results compared with
the best-performing systems, which provide a further demonstration of the practical utility of our
sistema.

1. Introducción

Word ordering is a fundamental problem in natural language generation (NLG, Reiter
and Dale 1997). In this article we focus on text generation: Starting with a bag of words,
or lemmas, as input, the task is to generate a ﬂuent and grammatical sentence using

∗ Singapore University of Technology and Design. 20 Dover Drive, Singapur.

Correo electrónico: yue zhang@sutd.edu.sg.

∗∗ University of Cambridge Computer Laboratory, William Gates Building, 15 JJ Thomson Avenue,

Cambridge, Reino Unido. Correo electrónico: stephen.clark@cl.cam.ac.uk.

Envío recibido: 17 Abril 2013; versión revisada recibida: 23 Abril 2014; accepted for publication:
22 Junio 2014.

doi:10.1162/COLI a 00229

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 3

those words. Additional annotation may also be provided with the input—for example,
part-of-speech (POS) tags or syntactic dependencies. Applications that can beneﬁt from
better text generation algorithms include machine translation (Koehn 2010), abstractive
text summarization (Barzilay and McKeown 2005), and grammar correction (Lee and
Seneff 2006). Typically, statistical machine translation (SMT) sistemas (Chiang 2007;
Koehn 2010) perform generation into the target language as part of an integrated
sistema, which avoids the high computational complexity of independent word order-
En g. Por otro lado, performing word ordering separately in a pipeline has many
potential advantages. For SMT, it offers better modularity between adequacy (transla-
ción) and ﬂuency (linearization), and can potentially improve target grammaticality for
syntactically different languages (p.ej., Chinese and English). More importantly, a stand-
alone word ordering component can in principle be applied to a wide range of text
generation tasks, including transfer-based machine translation (Chang and Toutanova
2007).

Most word ordering systems use an n-gram language model, which is effective
at controling local ﬂuency. Syntax-based language models, in particular dependency
language models (Xu, Chelba, y jelinek 2002), are sometimes used in an attempt
to improve global ﬂuency through the capturing of long-range dependencies. En esto
artículo, we take a syntax-based approach and consider two grammar formalisms: Com-
binatory Categorial Grammar (CCG) and dependency grammar. Our system also em-
ploys a discriminative model. Coupled with heuristic search, a strength of the model
is that arbitrary features can be deﬁned to capture complex syntactic patterns in
output hypotheses. The discriminative model is trained using syntactically annotated
datos.

From the perspective of search, word ordering is a computationally difﬁcult prob-
lem. Finding the best permutation for a set of words according to a bigram language
modelo, Por ejemplo, is NP-hard, which can be proved by linear reduction from the trav-
eling salesman problem. En la práctica, exploring the whole search space of permutations
is often prevented by adding constraints. In phrase-based machine translation (Koehn,
Och, and Marcu 2003; Koehn et al. 2007), a distortion limit is used to constrain the posi-
tion of output phrases. In syntax-based machine translation systems such as Wu (1997)
and Chiang (2007), synchronous grammars limit the search space so that polynomial-
time inference is feasible. In ﬂuency improvement (Blackwood, de Gispert, and Byrne
2010), parts of translation hypotheses identiﬁed as having high local conﬁdence are held
ﬁxed, so that word ordering elsewhere is strictly local.

In this article we begin by proposing a general system to solve the word ordering
problema, which does not rely on constraints (which are typically task-speciﬁc). In partic-
ular, we treat syntax-based word ordering as a structured prediction problem, para cual
the input is a multi-set (bag) of words and the output is an ordered sentence, together
with its syntactic analysis (either CCG derivation or dependency tree, depending on the
grammar formalism being used). Given an input, our system searches for the highest-
scored output, according to a syntax-based discriminative model. One advantage of this
formulation of the reordering problem, which can perhaps be thought of as a “pure”
text realization task, is that systems for solving it are easily evaluated, because all
that is required is a set of sentences for reordering and a standard evaluation metric
such as BLEU (Papineni et al. 2002). Sin embargo, one potential criticism of the “pure”
problem is that it is unclear how it relates to real realization tasks, since in practice
(p.ej., in statistical machine translation systems) the input does provide constraints
on the possible output orderings. Our general formulation still allows task-speciﬁc
contraints to be added if appropriate. Hence as a test of the ﬂexibility of our system,

504

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Zhang and Clark

Discriminative Syntax-Based Word Ordering

and a demonstration of the applicability of the system to more realistic text gener-
ation scenarios, we consider two further tasks for the dependency-based realization
sistema.

The ﬁrst task considers a variety of input conditions for the dependency-based sys-
tema, determined by two parameters. The ﬁrst is whether POS information is provided
for each word in the input multi-set. The second is whether syntactic dependencies be-
tween the words are provided. The extreme case is when all dependencies are provided,
in which case the problem reduces to the tree linearization problem (Filippova and
Strube 2009; He et al. 2009). Sin embargo, the input can also lie between the two extremes
of no- and full-dependency information.

The second task is the NLG 2011 shared task, which provides a further demonstra-
tion of the practical utility of our system. The shared task is closer to a real realization
scenario, in that lemmas, rather than inﬂected words, are provided as input. Por eso
some modiﬁcations are required to our system in order that it can perform some word
inﬂection, as well as deciding on the ordering. The shared task data also uses labeled,
rather than unlabeled, syntactic dependencies, and so the system was modiﬁed to
incorporate labels. The ﬁnal result is that our system gives competitive BLEU scores,
compared to the best-performing systems on the shared task.

The structured prediction problem we solve is a very hard problem. Due to the
use of syntax, and the search for a sentence together with a single CCG derivation
or dependency tree, the search space is exponentially larger than the n-gram word
permutation problem. No efﬁcient algorithm exists for ﬁnding the optimal solution.
kay (1996) recognized the computational difﬁculty of chart-based generation, cual
has many similarities to the problem we address in his seminal paper. We tackle the
high complexity by using learning-guided best-ﬁrst search, exploring a small path in
the whole search space, which contains the most likely structures according to the dis-
criminative model. One of the contributions of this article is to introduce, and provide
a discriminative solution to, this difﬁcult structured prediction problem, which is an
interesting machine learning problem in its own right.

This article is based on, and signiﬁcantly extends, three conference papers (zhang
and Clark 2011; zhang, Blackwood, and Clark 2012; zhang 2013). It includes a more
detailed description and discussion of our guided-search approach to syntax-based
word ordering, bringing together the CCG- and dependency-based systems under one
uniﬁed framework. Además, we discuss the limitations of our previous work, y
show that a better model can be developed through scaling of the feature vectors. El
resulting model allows fair comparison of constituents of different sizes, and enables
the learning algorithms to expand negative examples during training, which leads to
signiﬁcantly improved results over our previous work. The competitive results on the
NLG 2011 shared task data are new for this article, and demonstrate the applicability of
our system to more realistic text realization scenarios.

The contributions of this article can be summarized as follows:

We address the problem of syntax-based word ordering, dibujo
attention to this challenging language modeling task and offering
a general solution that does not rely on constraints to limit the search
espacio.

We present a novel method for solving the word ordering problem
that gives the best reported accuracies to date on the standard Wall Street
Journal data.

505

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 3

We show how our system can be used with two different grammar
formalisms: Combinatory Categorial Grammar and dependency
gramática.

We show how syntactic constraints can be easily incorporated into the
sistema, presenting results for the dependency-based system with a range
of input conditions.

We demonstrate the applicability of the system to more realistic text
realization scenarios by obtaining competitive results on the NLG 2011
shared task data, including performing some word inﬂection as part of a
joint system that also performs word reordering.

More generally, we propose a learning-guided, best-ﬁrst search algorithm
for application of discriminative models to extremely large search spaces
containing structures of varying sizes. This method could be applied to
other complex structured prediction tasks in NLP and machine
aprendiendo.

2. Overview of the Search and Training Algorithms

En esta sección, the CCG-based system is used to describe the search and training
algoritmos. Sin embargo, the same approach can be used for the dependency-based system,
as described in Section 4: Instead of building hypotheses by applying CCG rules in a
bottom-up manner, the dependency-based system creates dependency links between
palabras.

Given a bag of words, the goal is to put them into an ordered sentence that has
a plausible CCG derivation. The search space of the decoding problem consists of all
possible CCG derivations for all possible word permutations, and the search goal is to
ﬁnd the highest-scored derivation in the search space. This is an NP-hard problem, como
mentioned in the Introduction. We apply learning-guided search to address the high
complejidad. The intuition is that, because the whole search space cannot be exhausted
in order to ﬁnd the optimal solution, we choose to explore a small area in the search
espacio. A statistical model is used to guide the search, so that only a small portion of the
search space containing the most plausible hypotheses is explored.

One natural choice for the decoding algorithm is best-ﬁrst search, which uses an
agenda to order hypotheses, and expands the highest-scored hypothesis on the agenda
at each step. The resulting hypotheses after each hypothesis expansion are put back
on the agenda, and the process repeats until a goal hypothesis (a full sentence) es
found. This search process is guided by the current scores of the hypotheses, y el
search path will contain the most plausible hypotheses if they are scored higher than
∗
implausible ones. An alternative to best-ﬁrst search is A
buscar, which makes use
∗
can potentially be more efﬁcient
of a heuristic function to estimate future scores. A
given an effective heuristic function; sin embargo, it is not straightforward to deﬁne an
admissible and accurate estimate of future scores for our problem, and we leave this
research question to future work.

In our formulation of the word ordering problem, a hypothesis is a phrase or
sentence together with its CCG derivation. Hypotheses are constructed bottom–up:
starting from single words, smaller phrases are combined into larger ones according
to CCG rules. To allow the combination of hypotheses, we use an additional struc-
ture to store a set of hypotheses that have been expanded, which we call accepted

506

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Zhang and Clark

Discriminative Syntax-Based Word Ordering

hypotheses. When a hypothesis from the agenda is expanded, it is combined with
all accepted hypotheses in all possible ways to produce new hypotheses. Los datos
structure for accepted hypotheses is similar to that used for best-ﬁrst parsing (Caraballo
and Charniak 1998), and we adopt the term chart for this structure. Sin embargo, nota
there are important differences to the parsing problem. Primero, the parsing problem has a
ﬁxed word order and is considerably simpler than the word ordering problem we are
tackling. Segundo, although we use the term chart, the structure for accepted hypotheses
is not a dynamic programming chart in the same way as for the parsing problem. En
our previous papers (Zhang and Clark 2011; zhang, Blackwood, and Clark 2012), nosotros
applied a set of beams to this structure, which makes it similar to the data structure
used for phrase-based MT decoding (Koehn 2010). Sin embargo, we will show later that
this structure is unnecessary when the model has more discriminative power, y un
conceptually simpler single beam can be used. We will also investigate the possibility
of applying dynamic-programming-style pruning to the chart.

We now give an overview of the training algorithm, which is crucial to both the
speed and accuracy of the resulting decoder. CCGBank (Hockenmaier and Steedman
2007) is used to train the model. For each training sentence, the corresponding CCGBank
derivation together with all its sub-derivations are treated as gold-standard hypotheses.
All other hypotheses that can be constructed from the same bag of words are non-gold
hypotheses. From the generation perspective this assumption is too strong, porque
sentences can have multiple orderings (with multiple derivations) that are both gram-
matical and ﬂuent. Sin embargo, it is the most feasible choice given the training data
disponible.

The efﬁciency of the decoding algorithm is dependent on the training algorithm
because the agenda is ordered according to the hypothesis scores. Por eso, a better model
will lead to a goal hypothesis being found more quickly. In the ideal situation, all gold-
standard hypotheses are scored higher than all non-gold hypotheses, and therefore only
gold-standard hypotheses are expanded before the gold-standard goal hypothesis is
found. En este caso, the minimum number of hypotheses is expanded and the output is
correcto. The best-ﬁrst search decoder is optimal not only with respect to accuracy but
also speed. This ideal situation can hardly be met in practice, but it determines the goal
of the training algorithm: to ﬁnd a model that scores gold-standard hypotheses higher
than non-gold ones.

Learning-guided search places more challenges on the training of a discriminative
model than standard structured prediction problems, Por ejemplo, CKY parsing for
CCG (Clark and Curran 2007b). If we take gold-standard hypotheses as positive training
examples, and non-gold hypotheses as negative examples, then the training goal is
to ﬁnd a large separating margin between the scores of all positive examples and all
negative examples. For CKY parsing, the highest-scored negative example can be found
via optimal Viterbi decoding, according to the current model, and this negative example
can be used in place of all negative examples during the updating of parameters. En
contrast, our best-ﬁrst search algorithm cannot ﬁnd an output in reasonable time unless
a good model has already been trained, and therefore we cannot run the decoding
algorithm in the standard way during training. In our previous papers (Zhang and
clark 2011; zhang, Blackwood, and Clark 2012), we proposed an approximate online
training algorithm, which forces positive examples to be kept in the hypothesis space
without being discarded, and prevents the expansion of negative examples during the
training process (so that the hypothesis space does not get too large). This algorithm
ensures training efﬁciency, but greatly limits the space of negative examples that is
explored during training (and hence fails to replicate the conditions experienced at test

507

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 3

tiempo). In this article, we will show that, with an improved scoring model, it is possible
to expand negative examples, which leads to improved performance.

A second and more subtle challenge for our training problem is that we need
a stronger model for learning-guided search than for dynamic programming (DP)–
based search, such as CKY decoding. For CKY decoding, the model is used to compare
hypotheses within each chart cell, which cover the same input words. A diferencia de, para
the best-ﬁrst search decoder, the model is used to order hypotheses on the agenda,
which can cover different numbers of words. It needs much stronger discriminating
fuerza, so that it can determine whether a single-word phrase is better than, decir, a
40-word sentence. In this article we use scaling of the hypothesis scores by size, entonces
that hypotheses of different sizes can be fairly compared. We also ﬁnd that, with this
new approach, negative examples can be expanded during training and a single beam
applied to the chart, resulting in a conceptually simpler and more effective training
algorithm and decoder.

3. CCG-Based Word Ordering

3.1 The CCG Grammar

We were motivated to use CCG as one of the grammar formalisms for our syntax-based
realization system because of its successful application to a number of related tasks,
such as wide-coverage parsing (Hockenmaier 2003; Clark and Curran 2007b; Auli and
López 2011), semantic parsing (Zettlemoyer and Collins 2005), wide-coverage semantic
análisis (Bos et al. 2004), and generation itself (Espinosa, Blanco, and Mehay 2008).
The grammar formalism has been described in detail in those papers, and so here we
provide only a short description.

CCG (Steedman 2000) is a lexicalized grammar formalism that associates words
with lexical categories. Lexical categories are detailed grammatical labels, típicamente
expressing subcategorization information. During CCG parsing, and during our search
procedimiento, categories are combined using CCG’s combinatory rules. Por ejemplo, a
verb phrase in English (S\NP) can combine with an NP to its left, in this case using
the combinatory rule of (backward) function application:

NP S\NP ⇒ S

In addition to binary rule instances, such as this one, there are also unary rules that
operate on a single category in order to change its type. Por ejemplo, forward type-
raising can change a subject NP into a complex category looking to the right for a verb
phrase:

NP ⇒ S/(S\NP)

Such a type-raised category can then combine with a transitive verb type using the rule
of forward composition:

S/(S\NP) (S\NP)/NP ⇒ S/NP

Following Fowler and Penn (2010), we extract the grammar by reading rule in-
stances directly from the derivations in CCGbank (Hockenmaier and Steedman 2007),
rather than deﬁning the combinatory rule schema manually as in Clark and Curran

508

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Zhang and Clark

Discriminative Syntax-Based Word Ordering

(2007b). Hence the grammar we use can be thought of as a context-free approximation
to the mildly content sensitive grammar arising from the use of generalized composition
normas (Weir 1988). Hockenmaier (2003) contains a detailed description of the grammar
that is obtained in this way, including the various unary type-changing rules, así como
additional rules needed to deal with naturally occurring text, such as punctuation rules.

3.2 The Edge Data Structure

For the rest of this article, the term edge is used to refer to a hypothesis in the decoding
algoritmo. An edge corresponds to a sentence or phrase with a CCG derivation. Edges
are built bottom–up, starting from leaf edges, which are constructed by assigning
possible lexical categories to input words. Each leaf edge corresponds to an input word
with a particular lexical category. Two existing edges can be combined if there exists a
CCG rule (extracted from CCGbank, as described earlier) that combines their category
labels, and if they do not contain the same input word more times than its total count
in the input. The resulting edge is assigned a category label according to the CCG
regla, and covers the concatenated surface strings of the two sub-edges in their order of
combination. New edges can also be built by applying unary rules to a single existing
borde. We deﬁne a goal edge as an edge that covers all input words.

Two edges are equivalent if they have the same surface string and identical CCG
derivations. Edge equivalence is used for comparison with gold-standard edges. Two
edges are DP-equivalent when they have the same DP-signature. Based on the feature
templates in Table 1, we deﬁne the DP-signature of an edge as the CCG category at the

Mesa 1
Feature template deﬁnitions, with example instances based on Figure 2.

condición

all edges

#words > 1

binario
bordes

feature

instancia

categoría + tamaño
categoría + head word
categoría + tamaño + head word
categoría + head POS

categoría + leftmost word
categoría + rightmost word
categoría + leftmost POS bigram
categoría + rightmost POS bigram
categoría + lmost POS + rmost POS

the binary rule
the binary rule + head word
regla + head word + non-head word
bigrams resulting from combination
POS bigrams resulting from combination
word trigrams resulting from combination
POS trigrams resulting from combination
resulting lexical category trigrams
resulting word + POS bigrams
resulting POS + word bigrams
resulting POS + palabra + POS trigrams

(S, 3)
(S\NP, bought)
(notario público, 1, Lotus)
(S\NP, VBD)

(S, IBM)
(S\NP, Lotus)
(S, (NNP, VBD))
(S, (VBD, NNP))
(S, (NNP, NNP))
notario público (S\NP) ⇒ S
(notario público (S\NP) ⇒ S, bought)
(notario público (S\NP) ⇒ S, bought, IBM)
(IBM, bought)
(NNP, VBD)
(IBM, bought, Lotus)
(NNP, VBD, NNP)
(notario público, (S\NP)/notario público, notario público)
(IBM, VBD)
(VBD, Lotus)
(NNP, bought, NNP)

unary
bordes

unary rule
unary rule + head word

509

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 3

root of its derivation, the head word associated with the root category, and the multi-
set of words it contains, together with the word and POS bigrams on either side of its
surface string.

3.3 The Scoring of Edges

Edges are built bottom–up from input words or existing edges. If we treat the assign-
ment of lexical categories to input words and the application of unary and binary CCG
rules to existing edges as edge-building actions, the structure of an edge can be deﬁned
recursively as the sub-structure resulting from its top action plus the structure of its
sub-edges (if any), as shown in Figure 1. Here the top action of an edge refers to the
most recent action that has been applied to build the edge.

In our previous papers we used a global linear model to score edges, donde el

score of an edge e is deﬁned as:

F (mi) = (cid:8)(mi) · ⃗θ

(cid:8)(mi) represents the feature vector of e and ⃗θ is the parameter vector of the model.
Similar to the structure of e, the feature vector (cid:8)(mi) can be deﬁned recursively:

( ∑

)

(cid:8)(mi) =

(cid:8)(es)

+ ϕ(mi)

∈e
es
( ∑

∈re

es
s

)

ϕ(es
s)

+ ϕ(mi)

∈ e represents a sub-edge of e. Leaf edges do not have any sub-
In this equation, es
bordes. Unary-branching edges have one sub-edge, and binary-branching edges have
∈r e represents a (strictly) recursive sub-edge of e. The feature vector
two sub-edges. es
s
ϕ(mi) represents the structure of the top action of e; it is extracted according to the
feature templates in Table 1. Example instances of the feature templates are given
according to the example string and CCG derivation in Figure 2. For leaf edges, ϕ(mi)
includes information about the lexical category label; for unary-branching edges, ϕ(mi)
includes information from the unary rule; for binary-branching edges, ϕ(mi) includes
information from the binary rule, and additionally the token, POS, and lexical cat-
egory bigrams and trigrams that result from the surface string concatenation of its
sub-edges.

Cifra 1
The structure of edges shown recursively.

510

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Zhang and Clark

Discriminative Syntax-Based Word Ordering

Cifra 2
Example string with its CCG derivation, used to give example features in Table 1.

By the given deﬁnition of (cid:8)(mi), F (mi), the score of edge e, can be computed recursively

as e is built during the decoding process:

F (mi) = (cid:8)(mi) · ⃗θ
(( ∑

)
(cid:8)(es)

∈e
es
( ∑

(cid:8)(es) · ⃗θ

)

+ ϕ(mi)

· ⃗θ

)

+ ϕ(mi) · ⃗θ

∈e
es
( ∑

∈e

)

F (es)

+ ϕ(mi) · ⃗θ

When the top action is applied, the score of f (mi) is computed as the sum of

F (es) (for all es

∈ e) plus ϕ(mi) · ⃗θ.

An important aspect of the scoring model is that it is used to compare edges with
different sizes during decoding. The size of an edge can be measured in terms of the
number of words it contains, or the number of syntax rules in its structure. We deﬁne
the size of an edge as the number of recursive sub-edges in the edge plus one (p.ej., el
size of a leaf edge is 1), which is equivalent to the number of actions (es decir., lexical category
assignment for leaf edges, and rule application for unary/binary edges) that have been
applied to construct the edge. Edges with different sizes can have signiﬁcantly different
numbers of features, which can make the training of a discriminative linear model more
difﬁcult. Note that it is common in structured prediction problems for feature vectors
to have slightly different sizes because of variant feature instantiation conditions. En
CKY parsing, Por ejemplo, constituents with different numbers of unary rules can be
kept in the same chart cell and compared with each other, provided that they cover
the same span in the input. In our case, sin embargo, the sizes of two feature vectors
under comparison can be very different indeed, since a leaf edge with one word can
be compared with an edge over the entire input sentence.

In our previous papers we observed empirical convergence of online learning using
this linear model, and obtained competitive results. Sin embargo, as explained in Section 2,
only positive examples were expanded during training, and the expansion of negative
examples led to non-convergence and made online training infeasible. In this article, en
order to increase the discriminating power of the model and to make use of negative
examples during training, we apply length normalization to the scoring function, entonces
that the score of an edge is independent of its size. To achieve this, we scale the original

511

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 3

linear model score by the number of recursive sub-edges in the edge plus one. Para
given edge e, the new score ˆf (mi) is deﬁned as:

ˆf (mi) =

F (mi)
|mi|
(cid:8)(mi) · ⃗θ
|mi|

∑
es
s

∈re ϕ(es
|{es
s

s) · ⃗θ + ϕ(mi) · ⃗θ
∈r e} + 1|

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

In the equation, |mi| represents the size of e, which is equal to the number of actions
that have been applied when e is constructed. By dividing the score f (mi) by the size of e,
∈r e) and ϕ(mi) · ⃗θ, averaged by
the score ˆf (mi) represents an averaged value of ϕ(es
the number of recursive sub-edges plus one (es decir., the total actions), and is independent
of the size of e. Given normalized feature vectors, the training of the parameter vector ⃗θ
needs to be adjusted correspondingly, which will be discussed subsequently.

s) · ⃗θ (es

3.4 The Decoding Algorithm

The decoding algorithm takes a multi-set of input words, turns them into a set of leaf
bordes, and searches for a goal edge by repeated expansion of existing edges. For best-
ﬁrst decoding, an agenda and a chart are used. The agenda is a priority queue on which
edges to be expanded are ordered according to their current scores. The chart is a ﬁxed-
size beam used to record a limited number of accepted edges. During initialization,
leaf edges are generated by assigning all possible lexical categories to each input word,
before they are put on the agenda. During each step in the decoding process, the highest-
scored edge on the agenda is popped off and expanded. If it is a goal edge, it is returned
as the output, and the decoding ﬁnishes. Otherwise it is extended with unary rules, y
combined with existing edges in the chart, using binary rules to produce new edges.
The resulting edges are scored and put on the agenda, and the original edge is put into
the chart. The process repeats until a goal edge is found, or a timeout limit is reached.

For the timeout case, a default output is produced by greedily combining existing
edges in the chart in descending order of size. En particular, edges in the chart are sorted
by size, and the largest is taken as the current default output. Then the sorted list is
traversed, with an attempt to greedily concatenate the current edges in the list to the
right of the current default output. If the combination is not allowed (es decir., the two edges
contain some input words more times than its count in the input), the current edge is
descartado. De lo contrario, the current default output is updated.

In our previous papers we used a set of beams for the chart, each storing a certain
number of highest-scored edges that cover a particular number of words. This structure
is similar to the chart used for phrase-based SMT decoding. The main reason for the
multiple beams is the non-comparability of edges in different beams, which can have
feature vectors of signiﬁcantly different sizes. In this article, sin embargo, our chart is a sin-
gle beam structure containing the top-scored accepted edges. This simple data structure
is enabled by the use of the scaled linear model, and leads to comparable accuracies
to the multiple-beam chart. In addition to its simplicity, it also ﬁts well with the use of
agenda-based search, because edges of different sizes will ultimately be compared with
each other on the agenda.

512

Zhang and Clark

Discriminative Syntax-Based Word Ordering

We apply DP-style pruning to the chart, keeping only the highest-scored edge
among those that have the same DP-signature. During decoding, before a newly con-
structued edge e is put into the chart, the chart is examined to check whether it contains
an existing edge e0 with the same DP-signature as e. If such an edge exists, it is popped
off the chart and compared with the newly constructed edge e, with the higher scored
edge ˜e being put into the chart and the lower scored edge e
being discarded. If the
newly constructed edge e is not discarded, then we expand e to generate new edges.

′

It is worth noting that, en este caso, a new edge that results from the expansion of
e can have DP-equivalent edges in the agenda or the chart, which had been generated
′ = e0. Putting such new edges on the
by expansion of its DP-equivalent predecessor e
agenda will result in the system keeping multiple edges with the same signature. Cómo-
alguna vez, because applying DP-style pruning to the agenda requires updating the whole
agenda, and is computationally expensive, we choose to tolerate such DP-equivalent
duplications in the agenda.

Pseudocode for the decoder is shown as Algorithm 1. INITAGENDA returns an
initialized agenda with all leaf edges. INITCHART returns a cleared chart. TIMEOUT

Algoritmo 1 The decoding algorithm.

a ← INITAGENDA( )
c ← INITCHART( )
while not TIMEOUT( ) hacer

new ← []
e ← POPBEST(a)
if GOALTEST(mi) entonces

return e

′

end if
, ˜e) ← DPCHARTPRUNE(C, mi)
(mi
′
if e

is e then

continue

end if
for e

′ ∈ UNARY(mi, gramática) hacer
′
APPEND(nuevo, mi
)

end for
for ˜e ∈ c do

if CANCOMBINE(mi, ˜e, gramática) entonces
′ ← BINARY(mi, ˜e, gramática)
mi
′
APPEND(nuevo, mi

)

end if
if CANCOMBINE(˜e, mi, gramática) entonces
′ ← BINARY(˜e, mi, gramática)
mi
′
APPEND(nuevo, mi

)

end if

end for
for e

′ ∈ new do
ADD(a, mi

)

′

end for
ADD(C, mi)

end while

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

513

Ligüística computacional

Volumen 41, Número 3

returns true if the timeout limit has been reached, and false otherwise. POPBEST pops
the top edge from the agenda and returns the edge. GOALTEST takes an edge and
returns true if and only if the edge is a goal edge. DPCHARTPRUNE takes an edge e
and checks whether there exists in the chart an edge e0 that is DP-equivalent to e. In case
e0 exists, it is popped off the chart and compared with e, with the lower scored edge
′
being discarded, and the higher scored edge ˜e being put into the chart. La función
mi
′
and ˜e. CANCOMBINE checks whether two edges can be combined
returns the pair e
in a given order. Two edges can be combined if they do not contain an overlapping
palabra (es decir., they do not contain a word more times than its count in the input), and their
categories can be combined according to the CCG grammar. ADD inserts an edge into
the agenda or the chart. In the former case, it is placed into the priority queue according
to its score, y, in the latter case, the lowest scored edge in the beam is pruned when
the chart is full.

3.5 The Learning Algorithm

We begin by introducing the training algorithm of our previous papers, shown in
Algoritmo 2, which has the same fundamental structure as the training algorithm of
this article but is simpler. The algorithm is based on the decoder, where an agenda is
used as a priority queue of edges to be expanded, and a set of accepted edges is kept
in a ﬁxed-size chart. The functions INITAGENDA, INITCHART, TIMEOUT, POPBEST,
GOALTEST, DPCHARTPRUNE, UNARY, CANCOMBINE, and BINARY are identical to
those used in the decoding algorithm. GOLDSTANDARD takes an edge and returns true
if and only if it is a gold-standard edge. MINGOLD returns the lowest scored gold-
standard edge in the agenda. UPDATEPARAMETERS represents the parameter update
algoritmo. RECOMPUTESCORES updates the scores of edges in the agenda and chart
after the model is updated.

Similar to the decoding algorithm, the agenda is intialized using all possible leaf
bordes. During each step, the edge e on top of the agenda is popped off. If it is a gold-
standard edge, it is expanded in exactly the same way as in the decoder, with the newly
generated edges being put on the agenda, and e being inserted into the chart. If e is not
a gold-standard edge, we take it as a negative example e−, and take the lowest scored
gold-standard edge on the agenda e+ as a positive example, in order to make an update
to the parameter vector ⃗θ. Note that there must exist a gold-standard edge in the agenda,
which can be proved by contradiction.1

The two edges e+ and e− used to perform a model update can be radically different.
Por ejemplo, they may not cover the same words, or even the same number of words.
This is different from online training for CKY parsing, for which both positive and
negative examples used to adjust parameter vectors reside in the same chart cell, y
cover the same sequence of words. The training goal of a typical CKY parser (clark
and Curran 2007a, 2007b) is to ﬁnd a large separation margin between feature vectors
of different derivations of the same sentence, which have comparable sizes. Our goal
is to score all gold-standard edges higher than all non-gold edges regardless of their
tamaño, which is a more challenging goal. After updating the parameters, the scores of the
agenda edges above and including e−, together with all chart edges, are updated, y
e− is discarded before the start of the next processing step.

1 An example proof can be based on induction, where the basis is that the agenda contains all gold leaf

edges at ﬁrst, and the induction step is based on edge combination.

514

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Zhang and Clark

Discriminative Syntax-Based Word Ordering

Algoritmo 2 The training algorithm of our previous papers.
1: a ← INITAGENDA( )
2: c ← INITCHART( )
3: while not TIMEOUT( ) hacer
4:

new ← []
e ← POPBEST(a)
if GOLDSTANDARD(mi) and GOALTEST(mi) entonces

5:
6:

7:
8:

9:
10:

11:

12:
13:

14:
15:

16:
17:

18:
19:

20:
21:

22:
23:

24:
25:

26:
27:

28:
29:

30:
31:

32:
33:

34:
35:

return e

end if
if not GOLDSTANDARD(mi) entonces

e− ← e
e+ ← MINGOLD(a)
UPDATEPARAMETERS(e+, e−)
RECOMPUTESCORES(a, C)
continue

end if
′
(mi
if e

, ˜e) ← DPCHARTPRUNE(C, mi)
′

is e then

continue

end if
for e

′ ∈ UNARY(mi, gramática) hacer
′
APPEND(nuevo, mi
)

end for
for ˜e ∈ c do

if CANCOMBINE(mi, ˜e, gramática) entonces
′ ← BINARY(mi, ˜e, gramática)
mi
′
APPEND(nuevo, mi

)

end if
if CANCOMBINE(˜e, mi, gramática) entonces
′ ← BINARY(˜e, mi, gramática)
mi
′
APPEND(nuevo, mi

)

end if

end for
for e

′ ∈ new do
ADD(a, mi

)

′

end for
ADD(C, mi)

36:
37: end while

One way of viewing the training process is that it pushes gold-standard edges to-
wards the top of the agenda, y, crucialmente, pushes them above non-gold edges (zhang
and Clark 2011). Given a positive example e+ and a negative example e−, a perceptron-
style update is used to penalize the score for ϕ(e−) and reward the score of ϕ(e+):

⃗θ ← ⃗θ0 + ϕ(e+) − ϕ(e−)

Here ⃗θ0 and ⃗θ denote the parameter vectors before and after the update, respectivamente.
This method proved effective empirically (Zhang and Clark 2011), but it did not

515

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 3

converge well when an n-gram language model was integrated into the system (zhang,
Blackwood, and Clark 2012).

Hence we applied an alternative method for score updates that proved more effec-
tive than the perceptron update and enabled the incorporation of a large-scale language
modelo (zhang, Blackwood, and Clark 2012). This method treats parameter update
as ﬁnding a separation between gold-standard and non-gold edges. Given a positive
example e+ and a negative example e−, we make a minimum update to the parameters
so that the score of e+ is higher than that of e− by a margin of 1:

⃗θ ← arg min

⃗(cid:18)′

∥ ⃗θ′ − ⃗θ0

∥, such that (cid:8)(e+)⃗θ′ − (cid:8)(e−)⃗θ′ ≥ 1

The update is similar to the parameter update of online large-margin learning algo-
rithms, such as 1-best MIRA (Crammer et al. 2006), and has a closed-form solution:

⃗θ ← ⃗θ0 +

F (e−) − f (e+) + 1
∥ (cid:8)(e+) − (cid:8)(e−) ∥2

)
(
(cid:8)(e+) − (cid:8)(e−)

This online learning method proved more effective than the perceptron algorithm
empirically, but still has an important shortcoming in that it did not provide competitive
results when allowing the expansion of negative examples during training, which can
potentially improve the discriminative model (since expanding negative examples can
result in a more representative sample of the search space). We address this issue
by introducing a scaled linear model in this article, cual, when combined with the
expansion of negative examples, signiﬁcantly improves performance. Aplicamos el
same online large-margin training principle; sin embargo, the parameter update has to be
adjusted for the scaled linear model. En particular, the new goal is to ﬁnd a separation
between ˆf (e+) and ˆf (e−) instead of f (e+) yf (e−), for which the optimization corre-
sponding to the parameter update becomes:

⃗θ ← arg min

⃗(cid:18)′

∥ ⃗θ′ − ⃗θ0

∥,

such that

(cid:8)(e+)⃗θ′
|e+|

− (cid:8)(e−)⃗θ′
|e−|

≥ 1

where ⃗θ0 and ⃗θ represent the parameter vectors before and after the update, respetar-
activamente. The equation has a closed-form solution:

⃗θ ← ⃗θ0 +

ˆf (e−) − ˆf (e+) + 1
|e+| − (cid:8)(e− )
∥ (cid:8)(e+ )
|e−| ∥2

( (cid:8)(e+)
|e+|

)

− (cid:8)(e−)
|e−|

Pseudocode for the new training algorithm of this article is shown in Algorithm 3,
where MAXNONGOLD returns the highest-scored non-gold edge in the chart. In ad-
dition to the aforementioned difference in parameter updates, new code is added to
perform additional updates when gold-standard edges are removed from the chart.
In our previous work, parameter updates happen only when the top edge from the
agenda is not a gold-standard edge. In this article, the expansion of negative training
examples will lead to negative examples being put into the chart during training, y
hence the possibility of gold-standard edges being removed from the chart. Hay
two situations when this can happen. Primero, if a non-gold edge is inserted into the chart,

516

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Zhang and Clark

Discriminative Syntax-Based Word Ordering

Algoritmo 3 The training algorithm of this article.
1: a ← INITAGENDA( )
2: c ← INITCHART( )
3: while not TIMEOUT( ) hacer
4:

new ← []
e ← POPBEST(a)
if GOLDSTANDARD(mi) and GOALTEST(mi) entonces

5:
6:

7:
8:

9:
10:

11:

12:
13:

14:
15:

16:
17:

18:
19:

20:
21:

22:
23:

24:
25:

26:
27:

28:
29:

30:
31:

32:
33:

34:
35:

36:
37:

38:
39:

40:

41:
42:

43:
44:

return e

end if
if not GOLDSTANDARD(mi) entonces

e− ← e
e+ ← MINGOLD(a)
UPDATEPARAMETERS(e+, e−)
RECOMPUTESCORES(a, C)

end if
′
(mi
if GOLDSTANDARD(mi

, ˜e) ← DPCHARTPRUNE(C, mi)

) entonces

′

′
UPDATEPARAMETERS(mi
REMOVE(C, ˜e)
′
ADD(C, mi

)

, ˜e)

else

′
if e

is e then

continue

end if

end if
for e

′ ∈ UNARY(mi, gramática) hacer
′
APPEND(nuevo, mi
)

end for
for ˜e ∈ c do

if CANCOMBINE(mi, ˜e, gramática) entonces
′ ← BINARY(mi, ˜e, gramática)
mi
′
APPEND(nuevo, mi

)

end if
if CANCOMBINE(˜e, mi, gramática) entonces
′ ← BINARY(˜e, mi, gramática)
mi
′
APPEND(nuevo, mi

)

end if

end for
for e

′ ∈ new do
ADD(a, mi

)

′

end for
′ ← ADD(C, mi)
mi
if GOLDSTANDARD(mi

′

) entonces

˜e ← MAXNONGOLD(C)
′
UPDATEPARAMETERS(mi
′
ADD(C, mi

)

, ˜e)

45:
end if
46:
47: end while

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

517

Ligüística computacional

Volumen 41, Número 3

and there exists a gold-standard edge in the chart with the same DP-signature but a
lower score, the gold-standard edge will be removed from the chart because of DP-style
pruning (since only the highest-scored edge with the same DP-signature is kept in the
chart).

Segundo, if the chart is full when a non-gold edge is put into the chart (recall that the
chart is a ﬁxed-size beam), then the lowest scored edge on the chart will be removed.
This edge can be a gold-standard edge. In both the ﬁrst and second case, a gold-standard
edge is pruned as the result of the expansion of a negative example. Por otro lado,
in order for the gold-standard goal edge to be constructed, all gold-standard edges
that have been expanded must remain in the chart. Como resultado, our training algorithm
triggers a parameter update whenever a gold-standard edge is removed from the chart,
the scores of all chart edges are updated, and the original pruned gold edge is returned
to the chart. The original pruned gold-standard edge is treated as the positive example
for the update. For the ﬁrst situation, the newly inserted non-gold edge with the same
DP-signature is taken as the negative example, and will be discarded after the parameter
update (with a new score that is lower than the new score of the corresponding gold-
standard). In the second situation, the highest-scored non-gold edge in the chart is taken
as the negative example, and removed from the chart after the update.

En resumen, there are two main differences between Algorithms 2 y 3. Primero,
line 14 in Algorithm 2, which skips the expansion of negative examples, is removed in
Algoritmo 3. Segundo, lines 16–20 and 42–46 are added in Algorithm 3, which correspond
to the updating of parameters when a gold-standard edge is removed from the chart.
Además, the deﬁnitions of UPDATEPARAMETERS are different for the perceptron
training algorithm (Zhang and Clark 2011), the large-margin training algorithm (zhang,
Blackwood, and Clark 2012), and the large-margin algorithm of this article, as explained
earlier.

4. Dependency-Based Word Ordering and Tree Linearization

As well as CCG, the same approach can be applied to the word ordering problem
using other grammar formalisms. En esta sección, we present a dependency-based word
ordering system, where the input is again a multi-set of words with gold-standard POS,
and the output is an ordered sentence together with its dependency parse. Except for
necessary changes to the edge data structure and edge expansion, the same algorithm
can be applied to this task.

In addition to abstract word ordering, our framework can be used to solve a
more informed, dependency-based word ordering task: tree linearization (Filippova
and Strube 2009; He et al. 2009), a task that is very similar to abstract word ordering
from a computational perspective. Both tasks involve the permutation of a set of input
palabras, and are NP-hard. The only difference is that, for tree linearization, full unordered
dependency trees are given as input. Como resultado, the output word permutations are more
constrained (under the projectivity assumption), and more information is available for
search disambiguation.

Tree linearization can be treated as a special case of word ordering, where a gram-
mar constraint is applied such that the output sentence has to be consistent with the
input tree. There is a spectrum of grammar constraints between abstract word ordering
(no constraints) and tree linearization (full tree constraints). Por ejemplo, one constraint
might consist of a set of dependency relations between input words, but which do
not form a complete unordered spanning tree. We call this word ordering task the
partial-tree linearization problem, a task that is perhaps closer to NLP applications

518

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Zhang and Clark

Discriminative Syntax-Based Word Ordering

than both the abstract word ordering task and the full tree linearization problem, en el
sense that NLG pipelines might provide some syntactic relations between words for the
linearization step, but not the full spanning tree.

The main content of this section is based on a conference paper (zhang 2013), cual
we extend by using the technique of expanding negative training examples (one of the
overall contributions of this article).

4.1 Lleno- and Partial-Tree Linearization

Given a multi-set of input words W and a set of head-dependent relations H between
the words in W, the task is to ﬁnd an ordered sentence consisting of all the words in W
and a dependency tree that contains all the relations in H. If each word in W is given
a POS tag and H covers all words in W, then the task is (full-)tree linearization; if not
then the task is partial-tree linearization. For partial-tree linearization, a subset of W is
given ﬁxed POS tags. En todos los casos, a word either has exactly one (oro) POS tag, or no
POS tags.

4.2 The Edge Data Structure

Similar to the CCG case, edge refers to the data structure for a hypothesis in the
decoding algorithm. Here a leaf edge refers to an input word with a POS tag, y un
non-leaf edge refers to a phrase or sentence with its dependency tree. Edges are con-
structed bottom–up, by recursively joining two existing edges and adding an unlabeled
dependency link between their head words.

As for the CCG system, edges are scored by a global linear model:

F (mi) =

(cid:8)(mi) · ⃗θ
|mi|

dónde (cid:8)(mi) represents the feature vector of e and ⃗θ is the parameter vector of the model.
Mesa 2 shows the feature templates we use, which are inspired by the rich feature
templates used for dependency parsing (Koo and Collins 2010; Zhang and Nivre 2011).
In the table, h, metro, s, hl, hr, ml, and mr are the indices of words in the newly constructed
borde, where h and m refer to the head and dependent of the newly constructed arc, s
refers to the nearest sibling of m (on the same side of h), and hl, hr, ml, and mr refer
to the left and rightmost dependents of h and m, respectivamente. WORD, POS, LVAL, y
RVAL are maps from indices to word forms, POS, left valencies, and right valencies of
palabras, respectivamente. Example feature instances extracted from the sentence in Figure 3
are shown in the example column. Because of the non-local nature of some of the feature
templates we deﬁne, we do not apply DP-style pruning for dependency-based tree-
linearization.

4.3 The Decoding Algorithm

The decoding algorithm is similar to that of the CCG system, where an agenda is a
priority queue for edges to expand, and chart is a ﬁxed-size beam for a list of accepted
bordes. During initialization, input words are assigned possible POS tags, Resultando en
a set of leaf edges that are put onto the agenda. For words with POS constraints, solo

519

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
1
3
5
0
3
1
8
0
6
5
3
0
/
C
oh

yo
i

_
a
_
0
0
2
2
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 41, Número 3

Mesa 2
Feature templates. Indices on the surface string: h = head on newly added arc; m = dependent on
arc; s = nearest sibling of m; b = any index between h and m; hl, hr = left/rightmost dependent of
h; ml, mr = left/rightmost dependent of m; s2 = nearest sibling of s towards h; B = boundary
between the conjoined phrases (index of the ﬁrst word of the right phrase). Variables: dir =
direction of the arc, normalized by NORM; dist = distance (h-m), normalized; size = number of
words in the dependency tree. Functions: WORD = word at index; POS = POS at index; NORM =
normalize absolute value into 1, 2, 3, 4, 5, (5, 10], (10, 20], (20, 40], 40+.

dependency syntax
WORD(h) · POS(h) · NORM(tamaño) ,
WORD(h) · NORM(tamaño), POS(h) · NORM(tamaño) ,
WORD(h) · POS(h) · dir, WORD(h) · dir, POS(h) · dir ,
WORD(metro) · POS(metro) · dir, WORD(metro) · dir, POS(metro) · dir ,
WORD(h) · POS(h) · dist, WORD(h) · dist, POS(h) · dist ,
WORD(metro) · POS(metro) · dist, WORD(metro) · dist, POS(metro) · dist ,
WORD(h) · POS(h) · WORD(metro) · POS(metro) · dir,
WORD(h) · POS(h) · WORD(metro) · POS(metro) · dist ,
WORD(h) · POS(h) · POS(metro) · dir,
WORD(h) · POS(h) · POS(metro) · dist ,
POS(h) · WORD(metro) · POS(metro) · dir,
POS(h) · WORD(metro) · POS(metro) · dist,
POS(h) · POS(metro) · dir, POS(h) · POS(metro) · dist,
POS(h) · POS(metro) · POS(b) · dir,
POS(h) · POS(h − 1) · POS(metro) · POS(metro + 1) · dir (h > m),
POS(h) · POS(h + 1) · POS(metro) · POS(m − 1) · dir (h < m), WORD(h) · POS(m) · POS(ml ) · dir, WORD(h) · POS(m) · POS(mr ) · dir, POS(h) · POS(m) · POS(ml ) · dir, POS(h) · POS(m) · POS(mr ) · dir, POS(h) · POS(m) · POS(s) · dir, POS(h) · POS(s) · dir, POS(m) · POS(s) · dir, WORD(h) · WORD(s) · dir, WORD(m) · WORD(s) · dir, POS(h) · WORD(s) · dir, POS(m) · WORD(s) · dir, WORD(h) · POS(s) · dir, WORD(m) · POS(s) · dir, WORD(h) · POS(m) · POS(s) · POS(s2 ) · dir, POS(h) · POS(m) · POS(s) · POS(s2 ) · dir, dependency syntax for completed words WORD(h) · POS(h) · WORD(hl ) · POS(hl ), POS(h) · POS(hl ), WORD(h) · POS(h) · POS(hl ), POS(h) · WORD(hl ) · POS(hl ), WORD(h) · POS(h) · WORD(hr ) · POS(hr ), POS(h) · POS(hr ), WORD(h) · POS(h) · POS(hr ), POS(h) · WORD(hr ) · POS(hr ), WORD(h) · POS(h) · LVAL(h), WORD(h) · POS(h) · RVAL(h), WORD(h) · POS(h) · LVAL(h) · RVAL(h), POS(h) · LVAL(h), POS(h) · RVAL(h), POS(h) · LVAL(h) · RVAL(h), surface string patterns WORD(B − 1) · WORD(B), POS(B − 1) · POS(B), WORD(B − 1) · POS(B), POS(B − 1) · WORD(B), WORD(B − 1) · WORD(B) · WORD(B + 1), WORD(B − 2) · WORD(B − 1) · WORD(B), POS(B − 1) · POS(B) · POS(B + 1), POS(B − 2) · POS(B − 1) · POS(B), POS(B − 1) · WORD(B) · POS(B + 1), POS(B − 2) · WORD(B − 1) · POS(B), 520 example (bought, VBD, 4) (VBD, NN, NNP, right) (VBD, NNP, NN, – END – , right) (VBD, NN, NNP, right) (VBD, 1, 2) (NNP, bought, NNP) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Zhang and Clark Discriminative Syntax-Based Word Ordering Table 2 (continued) dependency syntax example surface string patterns for complete sentences WORD(0), WORD(0) · WORD(1), WORD(size − 1), WORD(size − 1) · WORD(size − 2), POS(0), POS(0) · POS(1), POS(0) · POS(1) · POS(2), POS(size − 1), POS(size − 1) · POS(size − 2), POS(size − 1) · POS(size − 2) · POS(size − 3), (VBD, NNP, NN) the allowed POS tag is assigned. For unconstrained words, we assign all possible POS tags according to a tag dictionary compiled from the training data, following standard practice for POS-tagging (Ratnaparkhi 1996). When an edge is expanded, it is combined with all edges in the chart in all possible ways to generate new edges. Two edges can be combined by concatenation of the surface strings in both orders and, in each case, constructing a dependency link between their heads in two ways (corresponding to the two options for the head of the new link). When there is a head constraint on the dependent word, a dependency link can be constructed only if it is consistent with the constraint. This algorithm implements abstract word ordering, partial-tree linearization, and full tree linearization—all gener- alized word ordering tasks—in a uniﬁed method. Pseudocode for the decoder is shown as Algorithm 4. Many of the functions have the same deﬁnition as for Algorithm 1: INITAGENDA, INITCHART, TIMEOUT, POPBEST, GOALTEST, ADD. CANCOMBINE checks whether two edges do not contain an overlapping word (i.e., they do not contain a word more times than its count in the input); unlike the CCG case, all pairs of words are allowed to combine according to the dependency model. COMBINE creates a dependency link between two words, with the word order determined by the order in which the arguments are supplied to the function, and the head coming from either the ﬁrst (HeadLeft) or second (HeadRight) argument (so there are four combinations considered and COMBINE is called four times in Algorithm 4). 4.4 The Learning Algorithm As for the CCG system, an online large-margin learning algorithm based on the decod- ing process is used to train the model. At each step, the expanded edge e is compared with the gold standard. If it is a gold edge, decoding continues; otherwise e is taken as a negative example e− and the lowest-scored gold edge in the agenda is taken Figure 3 Feature template example. 521 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 3 Algorithm 4 The decoding algorithm for partial-tree linearization. a ← INITAGENDA( ) c ← INITCHART( ) while not TIMEOUT( ) do new ← [] e ← POPBEST(a) if GOALTEST(e) then return e end if for ˜e ∈ c do if CANCOMBINE(e, ˜e) then ′ ← COMBINE(e, ˜e, HeadLeft) e ′ APPEND(new, e ) ′ ← COMBINE(e, ˜e, HeadRight) e ′ APPEND(new, e ) end if if CANCOMBINE(˜e, e) then ′ ← COMBINE(˜e, e, HeadLeft) e ′ APPEND(new, e ) ′ ← COMBINE(˜e, e, HeadRight) e ′ APPEND(new, e ) end if end for for e ′ ∈ new do ′ ADD(a, e ) end for ADD(c, e) end while as a positive example e+, and a parameter update is executed (repeated here from Section 3.4): ⃗θ ← ⃗θ0 + ˆf (e−) − ˆf (e+) + 1 |e+| − (cid:8)(e− ) ∥ (cid:8)(e+ ) |e−| ∥2 ( (cid:8)(e+) |e+| ) − (cid:8)(e−) |e−| The training process is essentially the same as in Algorithm 3, but with the CCG grammar and model replaced with the dependency-based grammar and model. In our conference paper describing the earlier version of the dependency-based system (Zhang 2013), the decoding step is ﬁnished immediately after the parameter update; in this article we expand the negative example, as in Algorithm3, putting it onto the chart and thereby exploring a larger part of the search space (in particular that part containing negative examples). Our later experiments show that this method yields improved results, consistent with the CCG system. 522 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Zhang and Clark Discriminative Syntax-Based Word Ordering Table 3 Training, development, and test data from the Penn Treebank. Sections Sentences Words Training Development Test 2–21 22 23 39,832 1,700 2,416 950,028 40,117 56,684 5. Experiments We use CCGBank (Hockenmaier and Steedman 2007) and the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993) for CCG and dependency data, respectively. CCGbank is the CCG version of the Penn Treebank. Standard splits were used for both: Sections 02–21 for training, Section 00 for development, and Section 23 for the ﬁnal test. Table 3 gives statistics for the Penn Treebank. For the CCG experiments, original sentences from CCGBank are transformed into bags of words, with sequence information removed, and passed to our system as input data. The system outputs are compared to the original sentences for evaluation. Following Wan et al. (2009), we use the BLEU metric (Papineni et al. 2002) for string comparison. Although BLEU is not the perfect measure of ﬂuency or grammaticality, being based on n-gram precision, it is currently widely used for automatic evaluation and allows us to compare directly with existing work (Wan et al. 2009). Note also that one criticism of BLEU for evaluating machine translation systems (i.e., that it can only register exact matches between the same words in the system and reference translation), does not apply here, because the system output always contains the same words as the original reference sentence. For the dependency-based experiments, gold-standard dependency trees were derived from bracketed sentences in the treebank using the Penn2Malt tool.2 For fair comparison with Wan et al. (2009), we keep base NPs as atomic units when preparing the input. Wan et al. used base NPs from the Penn Treebank annotation, and we follow this practice for the dependency-based experiments. For the CCG experi- ments we extract base NPs from CCGBbank by taking as base NPs those NPs that do not recursively contain other NPs. These base NPs mostly correspond to the base NPs from the Penn Treebank: In the training data, there are 242,813 Penn Treebank base NPs with an average size of 1.09, and 216,670 CCGBank base NPs with an average size of 1.19. 5.1 Convergence of Training The plots in Figure 4 show the development test scores of three CCG models by the number of training iterations. The three curves represent the scaled model of this article, the online large-margin model from Zhang, Blackwood, and Clark (2012), and the perceptron model from Zhang and Clark (2011), respectively. For each curve, the BLEU score generally increases as the number of training iterations increases, until it reaches its maximum at a particular iteration. We use the number of training iterations 2 http://w3.msi.vxu.se/~nivre/research/Penn2Malt.html. 523 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 3 Figure 4 BLEU scores of the perceptron, large-margin, and scaled large-margin CCG models by the number of training iterations. that gives the best development test scores for the training of our model when testing on the test data. Another way to observe the convergence of training is to measure the training times for each iteration at different numbers of iterations. The per-iteration training times for the large-margin and the scaled CCG models are shown in Figure 5. For each model, the training time for each iteration decreases as the number of training iterations increases, reﬂecting the convergence of learning-guided search. When the model gets better, fewer non-gold hypotheses are expanded before gold hypotheses, and hence it takes less time for the decoder to ﬁnd the gold goal edge. Figure 6 shows the corresponding curve for dependency-based word ordering, with similar observations. Because of the expanding of negative examples, the systems of this article took more time to train than those of our previous conference papers. However, the convergence Figure 5 Training times of the large-margin model and the scaled CCG models by the number of training iterations. 524 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Zhang and Clark Discriminative Syntax-Based Word Ordering Figure 6 Training times of the large-margin and scaled dependency models by the number of training iterations. rate is also faster when negative training examples are expanded, as demonstrated by the rate of speed improvement as the number of training iterations increases. The train- ing times of the perceptron algorithm are close to those of the large-margin algorithm, and hence are omitted from Figures 5 and 6. The new model gives the best development test scores, as shown in Figure 4. The next section investigates the effects of two of the innovations of this article: use of negative examples during training and the scaling of the model by hypothesis size. 5.2 The Effect of the Scaled Model and Negative Examples Table 4 shows a set of CCG development experiments to measure the effect of the scaled model and the expansion of negative examples during training. With the standard linear model (Zhang, Blackwood, and Clark 2012) and no expansions of negative examples, our system obtained a BLEU score of 39.04. The scaled model improved the BLEU score by 1.41 BLEU points to 40.45, and the expansion of negative examples gave a further improvement of 3.02 BLEU points. These CCG development experiments show that the expansion of negative exam- ples during training is an important factor in achieving good performance. When no negative examples are expanded, the higher score of the scaled linear model demon- strates the effectiveness of fair comparison between edges with different sizes. However, Table 4 The effect of the scaled model and expansion of negative examples during training for the CCG system. no expansion of negative examples expansion of negative examples 39.04 no convergence 40.45 43.47 standard linear model scaled linear model 525 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 3 Figure 7 The effect of search time for the CCG system on the development test data. it is a more important advantage of the scaled linear model that it allows the expansion of negative examples during training, which was not possible with the standard linear model. In the latter case, training failed to converge when negative examples were expanded, reﬂecting the limitations of the standard linear model in separating the training data. Similar results were found for dependency-based word ordering, where the best development BLEU score improved from 44.71 (Zhang 2013) to 46.44 with the expansion of negative training examples. 5.3 The Effect of Search Time Figure 7 shows the BLEU scores for the CCG system on the development data when the timeout limit for decoding a single sentence is set to 5 sec, 10 sec, 15 sec, 20 sec, 30 sec, 40 sec, 50 sec, and 60 sec, respectively. The timeout was applied during decoding at test time. The scaled model with negative training examples was used for this set of experiments, and the same model was used for all timeout settings. The results demonstrate that better outputs can be recovered given more search time, which is expected for a time-constrained best-ﬁrst search framework. Recall that output is created greedily by combining the largest available edges, when the system times out. Similar results were obtained with the dependency-based system of Zhang (2013), where the development BLEU scores improved from 42.89 to 43.42, 43.58, and 43.72 when the timeout limit increased from 5 sec to 10 sec, 30 sec, and 60 sec, respectively. The scaled dependency-based model without expansion of negative examples was used in this set of experiments. 5.4 Example Outputs Example output for sentences in the development set is shown in Tables 5 and 6, grouped by sentence length. The CCG systems of our previous conference papers and this article are compared, all with the timeout value set to 5 sec. All three systems perform relatively better with smaller sentences. For longer sentences, the ﬂuency of 526 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 43.4 43.6 43.8 44 44.2 44.4 44.6 44.8 10 20 30 40 50 60BLEUDecoding time (seconds) Zhang and Clark Discriminative Syntax-Based Word Ordering Table 5 Example development output for the CCG-based systems and sentences with fewer than 20 words. Zhang and Clark (2011) Zhang, Blackwood, and Clark (2012) this article reference . A Lorillard spokewoman said This is an old story, A Lorillard spokewoman said, This is an old story. A Lorillard spokewoman said, This is an old story. A Lorillard spokewoman said, “ This is an old story. It today has no bearing on our work force. It today has no bearing on our work force. It has no bearing on our work force today. It has no bearing on our work force today. No price for the new shares has been set. No price for the new shares has been set. No price has been set for the new shares. No price for the new shares has been set. Previously he was vice president of Eastern Edison. Previously he was vice president of Eastern Edison. Previously the was vice president of Eastern Edison. Previously he was vice president of Eastern Edison. Big mainframe computers for had for years been around business. had been Big mainframe computers for business for years around. Big mainframe computers had for years been around for business. Big mainframe computers for business had been around for years. The ﬁeld of reserves has 21 million barrels. The ﬁeld has 21 million barrels of reserves. The ﬁeld has 21 million barrels of reserves. The ﬁeld has reserves of 21 million barrels. is some heavy-duty competition Behind all the hoopla. all the hoopla Behind is some heavy-duty competition. some heavy-duty competition is Behind all the hoopla. Behind all the hoopla is some heavy-duty competition. Pamela Sebastian contributed to New York in this article. Pamela Sebastian in New York contributed to this article. Pamela Sebastian in New York contributed to this article. Pamela Sebastian in New York contributed to this article. painted oils in by Lighthouse II was the playwright in 1901 ... Lighthouse II was the playwright in 1901 ... in oils oils in Lighthouse II was painted by the playwright in 1901 ... Lighthouse II was painted in oils by the playwright in 1901 ... Example output for the CCG-based systems and sentences with fewer than ten words. Zhang and Clark (2011) Zhang, Blackwood, and Clark (2012) this article reference Mr. Vinken is Elsevier N.V., chairman of the Dutch publishing group. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. asbestos-related diseases have Four, including three of the ﬁve surviving workers with recently diagnosed cancer. have Four of asbestos-related diseases with the ﬁve surviving workers, including recently diagnosed cancer. three Four with the ﬁve surviving workers, including three of asbestos-related diseases have recently diagnosed cancer. Four of the ﬁve surviving workers have asbestos-related diseases, including three with recently diagnosed cancer. , interest rates expect to slide amid further declines in Yields on money-market mutual funds that portfolio managers continued. signs further declines expect signs that Yields continued to slide in interest rates on money-market mutual funds, amid portfolio managers. Yields expect further declines in interest rates on money-market mutual funds, amid signs that portfolio managers continued to slide. Yields on money-market mutual funds continued to slide, amid signs that portfolio managers expect further declines in interest rates. The thrift holding company expects it said obtain regulatory approval by year-end and to complete the transaction. The thrift holding company said it expects regulatory approval by to complete the transaction and obtain year-end. The thrift holding company said it expects the transaction to complete by year-end and obtain regulatory approval. The thrift holding company said it expects to obtain regulatory approval and complete the transaction by year-end. Finmeccanica is an Italian state-owned holding company with interests in the mechanical engineering industry. Finmeccanica is an Italian state-owned holding company with interests in the mechanical engineering industry. Finmeccanica is an Italian state-owned holding company with interests in the mechanical engineering industry. Finmeccanica is an Italian state-owned holding company with interests in the mechanical engineering industry. Example output for the CCG-based systems and sentences between 11 and 20 words. 527 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 3 Table 6 Example development output for the CCG-based systems and sentences with more than 20 words. Zhang and Clark (2011) Dr. Talcott from Boston University and a team led researchers and of the National Cancer Institute of the medical schools Harvard University Zhang, Blackwood, and Clark (2012) the medical schools from Dr. Talcott and Harvard University of the National Cancer Institute and a team of led Boston University this article reference researchers led Dr. Talcott of Harvard University and the National Cancer Institute and a team of the medical schools from Boston University. Dr. Talcott led a team of researchers from the National Cancer Institute and the medical schools of Harvard University and Boston University. because they indicate portfolio managers declining interest rates permit to retain a longer period for Longer maturities are. to relatively higher rates thought because declining interest rates thought Longer maturities permit relatively higher rates for to retain to indicate portfolio managers are a longer period. they declining interest rates permit portfolio managers to retain relatively higher rates for Longer maturities because they are thought to indicate a longer period. Longer maturities are thought to indicate declining interest rates because they permit portfolio managers to retain relatively higher rates for a longer period. Royal Trustco Ltd. said Paciﬁc First Financial Corp. approved $ 27 or a share, for its acquisition of shareholders by Toronto. $ 212 million by Toronto shareholders said Paciﬁc First Financial Corp. approved its acquisition of Royal Trustco Ltd. for $ 212 million, or $ 27 a share. Toronto said Paciﬁc First Financial Corp. approved its acquisition of Royal Trustco Ltd. for shareholders by $ 212 million, or $ 27 a share. Paciﬁc First Financial Corp. said shareholders approved its acquisition by Royal Trustco Ltd. of Toronto for $ 27 a share, or $ 212 million. Example output for the CCG-based systems and sentences between 21 and 40 words. this article reference Zhang and Clark (2011) European history, an educational research organization, the test and only French history questions, Education, a European history class, John Cannell, an Albuquerque, N.M., psychiatrist and founder their kids standardized testing Zhang, Blackwood, and Clark (2012) European history, only French history questions, an educational research organization, John Cannell, Education, France and a European history class when they decided, standardized testing has studied an Albuquerque, N.M., psychiatrist of their kids and founder. says the test when an Albuquerque, N.M., psychiatrist, say an educational research organization decided to give students and a European history class, which has studied Friends for their kids and founder of only French history questions, European history the test standardized testing John Cannell which than at about # 15 billion over United Kingdom institutional funds (Still, Richard Barﬁeld, said, Much less {index-arbitrage activity} manages $ 23.72 billion.) Standard Life Assurance Co. chief investment manager the U.S. Standard Life Assurance Co. in United Kingdom institutional funds, chief investment manager here is done at about # 15 billion than (, Still, Richard Barﬁeld said Much less {index-arbitrage activity} manages $ 23.72 billion in the U.S..) over which done in Much less {index-arbitrage activity} in the U.S., Richard Barﬁeld (United Kingdom institutional funds, which manages $ 23.72 billion), chief investment manager at Standard Life Assurance Co. about # 15 billion It’s as if France decided to give only French history questions to students in a European history class, and when everybody aces the test, they say their kids are good in European history, says John Cannell, an Albuquerque, N.M., psychiatrist and founder of an educational research organization, Friends for Education, which has studied standardized testing. Still, Much less {index-arbitrage activity} is done over here than in the U.S. said Richard Barﬁeld, chief investment manager at Standard Life Assurance Co., which manages about # 15 billion ($ 23.72 billion) in United Kingdom institutional funds. Example output for the CCG-based systems and sentences with more than 40 words. the output is signiﬁcantly reduced. One source of errors is confusion between differ- ent noun phrases, and where they should be positioned, which becomes more severe with increased sentence length and adds to the difﬁculty in reading the outputs. The system of this article gave observably improved outputs compared with the two other systems. 528 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Zhang and Clark Discriminative Syntax-Based Word Ordering Table 7 Development BLEU scores for partial-tree linearization, with different proportions of input POS and dependency information randomly selected from full gold-standard trees. no POS no dep 50% POS no dep all POS no dep no POS 50% dep 50% POS 50% dep all POS 50% dep no POS all dep 50% POS all dep all POS all dep 42.89 43.41 44.71 50.46 51.40 52.18 73.34 74.68 76.28 5.5 Partial-Tree Linearization In the previous section, the same input settings were used for both training and testing, and the assumption was made that the input to the system would be a bag of words, with no constraints on the output structure. This somewhat artiﬁcial assumption allows a standardized evaluation but, as discussed previously, text generation applications are unlikely to satisfy this assumption and, in practice, the realization problem is likely to be easier compared with our previous set-up. In this section, we simulate practical situations in dependency-based pipelines by measuring the performance of our sys- tem using randomly chosen input POS tags and dependency relations. For maximum ﬂexibility, so that the same system can be applied to different input scenarios, our system is trained without input POS tags or dependencies. However, if POS tags and dependencies are made available during testing, they will be used to provide hard constraints on the output (i.e., the output sentence with POS tags and dependencies must contain those in the input). From the perspective of search, input POS tags and dependencies greatly constrain the search space and lead to an easier search problem, with correspondingly improved outputs. Table 7 shows a set of development results with varying amounts of POS and dependency information in the input. For each test, we randomly sampled a percentage of words for which the gold-standard POS tags or dependencies are given in the input. As can be seen from the table, increased amounts of POS and dependency information in the input lead to higher BLEU scores, and dependencies were more effective than POS tags in determining the word order in the output. When all POS tags and dependencies are given, our constraint-enabled system gave a BLEU score of 76.28.3 Table 8 shows the output of our system for the ﬁrst nine development test sentences with different input settings. These examples illustrate the positive effect of input dependencies in specifying the outputs. Consider the second sentence as an example. When only input words are given, the output of the system is largely grammatical but nonsensical. With increasing amounts of dependency relations, the output begins to look more ﬂuent, sometimes with the system reproducing the original sentence when all dependencies are given. 5.6 Final Results Table 9 shows the test results of various systems. For the system of this article, we take the optimal setting from the development tests, using the scaled linear model and 3 When all POS tags and dependencies are also provided during training, the BLEU score is reduced to 74.79, showing the value in the system, which can adapt to varying amounts of POS and dependency information in the input at test time. 529 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 3 Table 8 Partial-tree linearization outputs for the ﬁrst nine development test sentences with various input information. no POS, no dep 50% unlabeled dep, no POS gold unlabeled dep, 50% POS reference the board as a nonexecutive director will join Pierre Vinken, 61 years old, Nov. 29. Pierre Vinken, 61 years old, will join the board Nov. 29 as a nonexecutive director. Pierre Vinken, 61 years old, will join the board Nov. 29 as a nonexecutive director. Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken, the Dutch publishing group of Elsevier N.V. is chairman. chairman of Elsevier N.V., the Dutch publishing group is Mr. Vinken. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. and Consolidated Gold Fields PLC, former chairman of Rudolph Agnew, 55 years old was named a nonexecutive director of this British industrial conglomerate. , Rudolph Agnew, was 55 years old and former chairman of Consolidated Gold Fields PLC named a nonexecutive director of this British industrial conglomerate. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a nonexecutive director of this British industrial conglomerate. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a nonexecutive director of this British industrial conglomerate. it reported A form of asbestos, a high percentage of Kent cigarette ﬁlters used to make once exposed to researchers among workers a group of cancer deaths has caused more than 30 years ago. , reported has A form asbestos once used to make Kent cigarette ﬁlters caused a high percentage of cancer deaths among a group of workers exposed to researchers more than 30 years ago. , researchers reported A form of asbestos once used to make Kent cigarette ﬁlters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago. A form of asbestos once used to make Kent cigarette ﬁlters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago, researchers reported. The asbestos ﬁber with the lungs once it is causing even brief exposures to show researchers that enters up decades later. it enters The asbestos ﬁber causing it later show up with the lungs that is unusually resilient to researchers even brief exposures. once it enters the lungs crocidolite, is unusually resilient, with symptoms that show up decades later causing even brief exposures to it, researchers said. The asbestos ﬁber, crocidolite, is unusually resilient once it enters the lungs, with even brief exposures to it causing symptoms that show up decades later, researchers said. crocidolite stopped its Micronite cigarette ﬁlters in Kent cigarettes that makes Lorillard Inc. in 1956, the unit of New York-based Loews Corp., using. new attention Although a forum more than a year ago, the problem were likely to appear to bring preliminary ﬁndings of Medicine in today’s New England Journal, the latest results. Lorillard Inc. stopped using the unit of New York-based Loews Corp., that makes Kent cigarettes, in 1956. Lorillard Inc. the unit of New York-based Loews Corp., that makes Kent cigarettes, stopped using crocidolite in its Micronite cigarette ﬁlters in 1956. Lorillard Inc., the unit of New York-based Loews Corp. that makes Kent cigarettes, stopped using crocidolite in its Micronite cigarette ﬁlters in 1956. of Medicine the problem bring new attention to preliminary ﬁndings in a forum, more than a year the latest results were reported to today’s New England Journal. Although preliminary ﬁndings were reported more than a year ago, the latest results appear in today’s New England Journal of Medicine, a forum likely to bring new attention to the problem. Although preliminary ﬁndings were reported more than a year ago, the latest results appear in today’s New England Journal of Medicine, a forum likely to bring new attention to the problem. A Lorillard spokewoman said, “This is an old story. A Lorillard spokewoman said, “This is an old story. A Lorillard spokewoman said, “This is an old story. A Lorillard spokewoman said, “This is an old story. We’re anyone before talking about asbestos of having any questionable properties heard years ago. We’re talking about anyone having heard before any questionable properties of asbestos years ago. We’re talking about years ago before anyone heard of having asbestos any questionable properties. We’re talking about years ago before anyone heard of asbestos having any questionable properties. 530 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Zhang and Clark Discriminative Syntax-Based Word Ordering Table 9 Final test results on the standard word ordering task. System Wan et al. (2009) (dependency) Zhang and Clark (2011) (CCG) Zhang, Blackwood, and Clark (2012) (CCG) Zhang, Blackwood, and Clark (2012) +LM (CCG) Zhang (2013) (dependency) This article (CCG) This article (dependency) BLEU 33.7 40.1 42.5 43.8 46.8 46.5 48.7 expansion of negative examples during training. For direct comparison with previous work, the timeout threshold was set to 5 sec. Our new system of this article signiﬁcantly outperforms all previous systems and achieves the best published BLEU score on this task. It is worth noting that our systems without a language model outperform the system of our 2012 paper using a large-scale language model. Interestingly, the dependency-based systems performed better than the CCG systems of this article. One of the main reasons is that the CCG systems generated shorter outputs by not ﬁnding full spanning derivations for a larger proportion of inputs. Because of the rigidity in combinatory rules, not all hypotheses in the chart can be combined with the hypothesis being expanded, leading to an increased likelihood of full spanning derivations being unreachable. Overall, the CCG system recovered 93.98% of the input words in the test set, and the dependency system recovered 97.71%. 5.7 Shared Task Evaluation The previous sections report evaluations on the task of word ordering, an abstract yet fundamental problem in text generation. One question that is not addressed by these experiments is how the abstract task can be utilized to beneﬁt full text generation, for which more considerations need to be taken into account in addition to word ordering. We investigate this question using the 2011 Generation Challenge shared task data, which provide a common-ground for the evaluation of text generation systems (Belz et al. 2011). The data are based on the CoNLL 2008 shared task data (Surdeanu et al. 2008), which consist of selected sections of the Penn WSJ Treebank, converted to syntactic dependencies via the LTH tool (Johansson and Nugues 2007). Sections 2–21 are used for training, Section 24 for development, and Section 23 for testing. A small number of sentences from the original WSJ sections are not included in this set. The input format of the shared task is an unordered syntactic dependency tree, with nodes being lemmas, and dependency relations on the arcs. Named entities and hyphenated words are broken into individual nodes, and special dependency links are used to mark them. Information on coarse-grained POS, number, tense, and participle features is given to each node where applicable. The output is a fully ordered and inﬂected sentence. We developed a full-text generation system according to this task speciﬁcation, with the core component being the dependency-based word ordering system of Section 4. In addition to minor engineering details that were required to adapt the system to this new 531 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 3 task, one additional task that the generation system needs to carry out is morphological generation—ﬁnding the appropriate inﬂected form for each input lemma. Our approach is to perform joint word ordering and inﬂection using the learning-guided search framework, letting one statistical model decide the best order as well as the inﬂections of ambiguous lemmas. For a lemma, we generate one or more candidate inﬂections by using a lexicon and a set of inﬂection rules. Candidate inﬂections for an input lemma are generated according to the lemma itself and its input attributes, such as the number and tense. Some lemmas are unambiguous, which are inﬂected before being passed to the word ordering system. For the other lemmas, more than one candidate’s inﬂections are passed as input words to the word ordering system. To ensure that each lemma occurs only once in the output, a unique ID is given to all the inﬂections of the same lemma, making them mutually exclusive. Four types of lemmas need morphological generation, including nouns, verbs, adjectives, and miscellaneous cases. The last category includes a (a or an) and not (not or n‘t), for which the best inﬂection can be decided only when n-gram information is available. For these lemmas, we pass all possible inﬂections to the search module. For nouns and adjectives, the inﬂection is relatively straightforward, since the number (e.g., singular, plural) of a lemma is given as an attribute of the input node, and comparative and superlative adjectives have speciﬁc parts of speech. For those cases where the necessary information is not available from the input, all possible inﬂections are handed over to the search module for further disambiguation. The most ambiguous lemma types are verbs, which can be further divided into be and other verbs. The uniqueness of be is that the inﬂections for the ﬁrst and second person can be different. All verb in- ﬂections are disambiguated according to the tense and participle attributes of the input node. In addition, for verbs in the present tense, the subject needs to be determined in order to differentiate between third-person singular verbs and others. This can be straightforward when the subject is a noun or pronoun, but can be ambiguous when the subject is a wh-pronoun, in which case the real subject might not be directly identiﬁable from the dependency tree. We leave all possible inﬂections of be and other verbs to the word ordering system whenever the ambiguity is not directly solvable from the subject dependency link. Overall, the pre-processing step generates 1.15 inﬂections for each lemma on average. For word ordering, the search procedure of Algorithm 4 is applied directly, and the feature templates of Table 2 are used with additional labeled dependency features described subsequently. The main reason that the dependency-based word ordering algorithm can perform joint morphological disambiguation is that it uses rich syntac- tic and n-gram features to score candidate hypotheses, which can also differentiate between correct and incorrect inﬂections under particular contexts. For example, an honest person and a honest person can be differentiated by n-gram features, while Tom and Sally is and Tom and Sally are can be differentiated by higher-order dependency features. In addition to lemma-formed inputs, one other difference between the shared task and the word ordering problem solved by Algorithm 4 is that the former uses labeled dependencies whereas Algorithm 4 constructs unlabeled dependency trees. We address this issue by assigning dependency labels in the construction of dependency links, and applying an extra set of features. The new features are deﬁned by making a duplicate of all the features from Table 2 that contain dir information, and associating each feature in the new copy with a dependency label. The training of the word ordering system requires fully ordered dependency trees, while references in the shared task data are raw sentences. We perform a pre-processing 532 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Zhang and Clark Discriminative Syntax-Based Word Ordering Table 10 Results and comparison with the top-performing systems on the shared task data System BLEU STUMABA DCU this article 89.1 85.8 89.6 step to obtain gold-standard training data by matching the input lemmas to the ref- erence sentence in order to obtain their gold-standard order. More speciﬁcally, given a training instance, we generate all candidate inﬂections for each lemma, resulting in an exponential set of possible mappings between the input tree and the reference sentence. We then prune these mappings bottom–up, assuming that the dependency tree is projective, and therefore that each word dominates a continuous span in the reference. After such pruning, only one correct mapping is found for the majority of the cases. For the cases where more than one mapping is found, we randomly choose one as the gold-standard. There are also instances for which no correct ordering can be found, and these are mostly due to non-projectivity in the shared task data, with a few cases being due to conﬂicts between our morphological generation system and the shared task data, or inconsistency in the data itself. Out of the 39K training instances, 2.8K con- ﬂicting instances are discarded, resulting in 36.2K gold-standard ordered dependency trees. Table 10 shows the results of our system and the top two participating systems of the shared task. Our system outperforms the STUMABA system by 0.5 BLEU points, and the DCU system by 3.8 BLEU points. More evaluation of the system was published in Song et al. (2014). 6. Related Work There is a recent line of research on text-to-text generation, which studies the lineariza- tion of dependency structures (Barzilay and McKeown 2005; Filippova and Strube 2007, 2009; He et al. 2009; Bohnet et al. 2010; Guo, Hogan, and van Genabith 2011). On the other hand, Wan et al. (2009) study the ordering of a bag of words without any depen- dency information given. We generalize the word ordering problem, and formulate it as a task of ordering a multi-set of words, regardless of input syntactic constraints. Our bottom–up, chart-based generation algorithm is inspired by the line of work on chart-based realization (Kay 1996; Carroll et al. 1999; White 2004, 2006; Carroll and Oepen 2005). Kay (1996) ﬁrst proposed the concept of chart realization, drawing analo- gies between realization and parsing of free order languages. He discussed efﬁciency issues and provided solutions to speciﬁc problems. For the task of realization, efﬁciency improvement has been further investigated (Carroll et al. 1999; Carroll and Oepen 2005). The inputs to these systems are logical forms, which form natural constraints on the interaction between edges. In our case, one constraint that has been leveraged in the dependency system is a projectivity assumption—we assume that the dependents of a word must all have been attached before the word is attached to its head word, and that spans do not cross during combination. In addition, we assume that the right dependents of a word must have been attached before a left dependent of the word is attached. This constraint avoids spurious ambiguities. The projectivity assumption 533 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 3 is an important basis for the feasibility of the dependency system; it is similar to the chunking constraints of White (2006) for CCG-based realization. White (2004) describes a system that performs CCG realization using best-ﬁrst search. The search process of our algorithm is similar to that work, but the input is different: logical forms in the case of White (2004) and bags of words in our case. Further along this line, Espinosa, White, and Mehay (2008) also describe a CCG-based realization system, applying “hypertagging”—a form of supertagging—to logical forms in order to make use of CCG lexical categories in the realization process. White and Rajkumar (2009) further use perceptron reranking on n-best realization output to improve the quality. The use of perceptron learning to improve search has been proposed in guided learning for easy-ﬁrst search (Shen, Satta, and Joshi 2007) and LaSO (Daum´e and Marcu 2005). LaSO is a general framework for various search strategies. Our learning algorithm is similar to LaSO with best-ﬁrst inference, but the parameter updates are different. In particular, LaSO updates parameters when all correct hypotheses are lost, but our algorithm makes an update as soon as the top item from the agenda is incorrect. Our algorithm updates the parameters using a stronger precondition, because of the large search space. Given an incorrect hypothesis, LaSO ﬁnds the corresponding gold hypothesis for a perceptron update by constructing its correct sibling. In contrast, our algorithm takes the lowest scored gold hypothesis currently in the agenda to avoid updating parameters for hypotheses that may have not been constructed. Our parameter update strategy is closer to the guided learning mechanism for the easy-ﬁrst algorithm of Shen, Satta, and Joshi (2007), which maintains a queue of hypotheses during search, and performs learning to ensure that the highest-scored hypothesis in the queue is correct. However, in easy-ﬁrst search, hypotheses from the queue are ranked by the score of their next action, rather than the hypothesis score. Moreover, Shen, Satta, and Joshi use aggressive learning and regenerate the queue after each update, but we perform non-aggressive learning, which is faster and is more fea- sible for our large and complex search space. Similar methods to Shen, Satta, and Joshi (2007) have also been used in Shen and Joshi (2008) and Goldberg and Elhadad (2010). Another framework that closely integrates learning and search is SEARN (Daum´e, Langford, and Marcu 2009), which addresses structured prediction problems that can be transformed into a series of simple classiﬁcation tasks. The transformation is akin to greedy search in the sense that the complex structure is constructed by sequential classiﬁcation decisions. The key problem that SEARN addresses is how to learn the tth decision based on the previous t − 1 decisions, so that the overall loss in the resulting structure is minimized. Similar to our framework, SEARN allows arbitrary features. However, SEARN is more oriented to greedy search, optimizing local decisions. In contrast, our framework is oriented to best-ﬁrst search, optimizing global structures. Learning and search also interact with each other in a global discriminative learning and beam-search framework for incremental structured prediction (Zhang and Clark 2011). In this framework, an output is constructed incrementally by a sequence of tran- sitions, while a beam is used to record the highest scored structures at each step. Online training is performed based on the search process, with the objective function being the margin between correct and incorrect structures. The method involves an early- update strategy, which stops search and updates parameters immediately when the gold structure falls out of the beam during training. It was ﬁrst proposed by Collins and Roark (2004) for incremental parsing, and later gained popularity in the investigations of many NLP tasks, including POS-tagging (Zhang and Clark 2010), transition-based dependency parsing (Zhang and Clark 2008; Huang and Sagae 2010), and machine 534 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Zhang and Clark Discriminative Syntax-Based Word Ordering translation (Liu 2013). Huang, Fayong, and Guo (2012) propose a theoretical analysis to the early-update training strategy, pointing out that it is a type of training method that ﬁxes score violations in inexact search. When the score of a gold-standard structure is lower than that of a non-gold structure, a violation exists. Our parameter udpate strategy in this article can also be treated as a mechanism for violation ﬁxing. 7. Conclusion We investigated the general task of syntax-based word ordering, which is a fundamental problem for text generation, and a computationally very expensive search task. We provide a principled solution to this problem using learning-guided search, a frame- work that is applicable to other NLP problems with complex search spaces. We com- pared different methods for parameter updates, and showed that a scaled linear model gave the best results by allowing better comparisons between phrases of different sizes, increasing the separability of hypotheses and enabling the expansion of negative examples during training. We formulate abstract word ordering as a spectrum of tasks with varying input speciﬁcity, from “pure” word ordering without any syntactic information to fully- informed word ordering with a complete unordered dependency tree given. Experi- ments show that our proposed method can effectively use available input constraints in generating output sentences. Evaluation on the NLG 2011 shared task data shows that our system can be suc- cessfully applied to a more realistic application scenario, in particular one where some dependency constraints are provided in the input and word inﬂection is required as well as word ordering. Additional tasks that may be required in a practical text generation scenario include word selection, including the determination of content words and generation of function words. The joint modeling solution that we have proposed for word ordering and inﬂection could also be adopted for word selection, although the search space is greatly increased when the words themselves need deciding, particularly content words. Acknowledgments This work was carried out partly while Yue Zhang was a postdoctoral research associate at the University of Cambridge Computer Laboratory, where he and Stephen Clark were supported by the European Union Seventh Framework Programme (FP7-ICT-2009-4) under grant agreement no. 247762, and partly after Yue Zhang joined Singapore University of Technology and Design, where he was supported by the MOE grant T2-MOE-2013-01. We thank Bill Byrne, Marcus Tomalin, Adri`a de Gispert, and Graeme Blackwood for numerous discussions; Anja Belz and Mike White for kindly providing the NLG 2011 shared task data; Kai Song for helping with tables and ﬁgures in the draft; Yijia Liu for helping with the bibliography; and the anonymous reviewers for the many constructive comments that have greatly improved this article since the ﬁrst draft. References Auli, Michael and Adam Lopez. 2011. Training a log-linear parser with loss functions via softmax-margin. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 333–343, Edinburgh. Barzilay, Regina and Kathleen McKeown. 2005. Sentence fusion for multidocument news summarization. Computational Linguistics, 31(3):297–328. Belz, Anja, Michael White, Dominic Espinosa, Eric Kow, Deirdre Hogan, and Amanda Stent. 2011. The ﬁrst surface realisation shared task: Overview and evaluation results. In Proceedings of the 13th European Workshop on Natural Language 535 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 3 Generation, ENLG ’11, pages 217–226, Stroudsburg, PA. Blackwood, Graeme, Adri`a de Gispert, and William Byrne. 2010. Fluency constraints for minimum Bayes-risk decoding of statistical machine translation lattices. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 71–79, Beijing. Bohnet, Bernd, Leo Wanner, Simon Mill, and Alicia Burga. 2010. Broad coverage multilingual deep sentence generation with a stochastic multi-level realizer. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 98–106, Beijing. Bos, Johan, Stephen Clark, Mark Steedman, James R. Curran, and Julia Hockenmaier. 2004. Wide-coverage semantic representations from a CCG parser. In Proceedings of COLING-04, pages 1240–1246, Geneva. Caraballo, Sharon A. and Eugene Charniak. 1998. New ﬁgures of merit for best-ﬁrst probabilistic chart parsing. Computational Linguistics, 24:275–298. Carroll, John, Ann Copestake, Dan Flickinger, and Victor Poznanski. 1999. An efﬁcient chart generator for (semi-) lexicalist grammars. In Proceedings of the 7th European Workshop on Natural Language Generation (EWNLG99), pages 86–95, Toulouse. Carroll, John and Stephan Oepen. 2005. High efﬁciency realization for a wide-coverage uniﬁcation grammar. In Proceedings of the Second International Joint Conference on Natural Language Processing, IJCNLP’05, pages 165–176, Berlin. Chang, Pi-Chuan and Kristina Toutanova. 2007. A discriminative syntactic word order model for machine translation. In Proceedings of ACL, pages 9–16, Prague. Chiang, David. 2007. Hierarchical phrase-based translation. Computational Linguistics, 33(2):201–228. Clark, Stephen and James R. Curran. 2007a. Perceptron training for a wide-coverage lexicalized-grammar parser. In Proceedings of the ACL 2007 Workshop on Deep Linguistic Processing, pages 9–16, Prague. Clark, Stephen and James R. Curran. 2007b. Wide-coverage efﬁcient statistical parsing with CCG and log-linear models. Computational Linguistics, 33(4):493–552. Collins, Michael and Brian Roark. 2004. Incremental parsing with the perceptron algorithm. In Proceedings of ACL, pages 111–118, Barcelona. 536 Crammer, Koby, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. 2006. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551–585. Daum´e, Hal, John Langford, and Daniel Marcu. 2009. Search-based structured prediction. Machine Learning, 75(3):297–325. Daum´e, Hal and Daniel Marcu. 2005. Learning as search optimization: approximate large margin methods for structured prediction. In ICML, pages 169–176, Bonn. Espinosa, Dominic, Michael White, and Dennis Mehay. 2008. Hypertagging: Supertagging for surface realization with CCG. In Proceedings of ACL-08: HLT, pages 183–191, Columbus, OH. Filippova, Katja and Michael Strube. 2007. Generating constituent order in German clauses. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 320–327, Prague. Filippova, Katja and Michael Strube. 2009. Tree linearization in English: Improving language model based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages 225–228, Boulder, CO. Fowler, Timothy A. D. and Gerald Penn. 2010. Accurate context-free parsing with combinatory categorial grammar. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 335–344, Uppsala. Goldberg, Yoav and Michael Elhadad. 2010. An efﬁcient algorithm for easy-ﬁrst non-directional dependency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 742–750, Los Angeles, CA. Guo, Yuqing, Deirdre Hogan, and Josef van Genabith. 2011. DCU at generation challenges 2011 surface realisation track. In Proceedings of the Generation Challenges Session at the 13th European Workshop on Natural Language Generation, pages 227–229, Nancy. He, Wei, Haifeng Wang, Yuqing Guo, and Ting Liu. 2009. Dependency based Chinese sentence realization. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Zhang and Clark Discriminative Syntax-Based Word Ordering Conference on Natural Language Processing of the AFNLP, pages 809–816, Suntec. Hockenmaier, Julia. 2003. Data and Models for Statistical Parsing with Combinatory Categorial Grammar. Ph.D. thesis, School of Informatics, University of Edinburgh. Hockenmaier, Julia and Mark Steedman. 2007. CCGbank: A corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Computational Linguistics, 33(3):355–396. Huang, Liang, Suphan Fayong, and Yang Guo. 2012. Structured perceptron with inexact search. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 142–151, Montr´eal. Huang, Liang and Kenji Sagae. 2010. Dynamic programming for linear-time incremental parsing. In Proceedings of ACL, pages 1077–1086, Uppsala. Liu, Yang. 2013. A shift-reduce parsing algorithm for phrase-based string-to-dependency translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1–10, Soﬁa. Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, PA. Ratnaparkhi, Adwait. 1996. A maximum entropy model for part-of-speech tagging. In Proceedings of EMNLP, pages 133–142, Somerset, NJ. Johansson, Richard and Pierre Nugues. 2007. Reiter, Ehud and Robert Dale. 1997. Building Extended constituent-to-dependency conversion for English. In 16th Nordic Conference of Computational Linguistics, pages 105–112, Tartu. Kay, Martin. 1996. Chart generation. In Proceedings of ACL, pages 200–204, Santa Cruz, CA. Koehn, Phillip. 2010. Statistical Machine Translation. Cambridge University Press. Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague. Koehn, Philipp, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54, Edmonton, Canada. Koo, Terry and Michael Collins. 2010. Efﬁcient third-order dependency parsers. In Proceedings of ACL, pages 1–11, Uppsala. Lee, J. and S. Seneff. 2006. Automatic grammar correction for second-language learners. In Proceedings of Interspeech, pages 1978–1981, Pittsburgh, PA. applied natural language generation systems. Natural Language Engineering, 3(1):57–87. Shen, Libin and Aravind Joshi. 2008. LTAG dependency parsing with bidirectional incremental construction. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 495–504, Honolulu. HI. Shen, Libin, Giorgio Satta, and Aravind Joshi. 2007. Guided learning for bidirectional sequence classiﬁcation. In Proceedings of ACL, pages 760–767, Prague. Song, Linfeng, Yue Zhang, Kai Song, and Qun Liu. 2014. Joint morphological generation and syntactic linearization. In Proceedings of the Twenty-Eighth AAAI Conference, Quebec. Steedman, Mark. 2000. The Syntactic Process. The MIT Press, Cambridge, MA. Surdeanu, Mihai, Richard Johansson, Adam Meyers, Llu´ıs M`arquez, and Joakim Nivre. 2008. The CONLL-2008 shared task on joint parsing of syntactic and semantic dependencies. In Proceedings of the Twelfth Conference on Computational Natural Language Learning, pages 159–177, Manchester. Wan, Stephen, Mark Dras, Robert Dale, and C´ecile Paris. 2009. Improving grammaticality in statistical sentence generation: Introducing a dependency spanning tree algorithm with an argument satisfaction model. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 852–860, Athens. 537 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 41, Number 3 Weir, David. 1988. Characterizing Mildly Context-Sensitive Grammar Formalisms. Ph.D. thesis, University of Pennsylviania. White, Michael. 2004. Reining in CCG chart realization. In Proceedings of INLG-04, pages 182–191, Brockenhurst. White, Michael. 2006. Efﬁcient Realization of Coordinate Structures in Combinatory Categorial Grammar. Research on Language & Computation, 4(1):39–75. White, Michael and Rajakrishnan Rajkumar. 2009. Perceptron reranking for CCG realization. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 410–419, Singapore. Wu, Dekai. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377–403. Xu, Peng, Ciprian Chelba, and Frederick Jelinek. 2002. A study on richer syntactic dependencies for structured language modeling. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 191–198, Philadelphia, PA. Zhang, Yue. 2013. Partial-tree linearization: Generalized word ordering for text synthesis. In Proceedings of IJCAI, pages 2232–2238, Beijing. Zhang, Yue, Graeme Blackwood, and Stephen Clark. 2012. Syntax-based word ordering incorporating a large-scale language model. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 736–746, Avignon. Zhang, Yue and Stephen Clark. 2008. Joint word segmentation and POS tagging using a single perceptron. In Proceedings of ACL/HLT, pages 888–896, Columbus, OH. Zhang, Yue and Stephen Clark. 2010. A fast decoder for joint word segmentation and POS-tagging using a single discriminative model. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 843–852, Cambridge, MA. Zhang, Yue and Stephen Clark. 2011. Syntactic processing using the generalized perceptron and beam search. Computational Linguistics, 37(1):105–151. Zettlemoyer, Luke S. and Michael Collins. Zhang, Yue and Joakim Nivre. 2011. 2005. Learning to map sentences to logical form: Structured classiﬁcation with probabilistic categorial grammars. In Proceedings of the 21st Conference on Uncertainty in Artiﬁcial Intelligence, pages 658–666, Edinburgh. Transition-based dependency parsing with rich non-local features. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 188–193, Portland, OR. 538 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 1 3 5 0 3 1 8 0 6 5 3 0 / c o l i _ a _ 0 0 2 2 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3
Descargar PDF