Incremental, Predictive Parsing with - Specialized Research AI at MIT

Incremental, Predictive Parsing with
Psycholinguistically Motivated
Tree-Adjoining Grammar

Vera Demberg
Saarland University

∗

∗∗

Frank Keller
University of Edinburgh

Alexander Koller
University of Potsdam

†

Psycholinguistic research shows that key properties of the human sentence processor are incre-
mentality, connectedness (partial structures contain no unattached nodes), and prediction (up-
coming syntactic structure is anticipated). There is currently no broad-coverage parsing model
with these properties, however. In this article, we present the ﬁrst broad-coverage probabilistic
parser for PLTAG, a variant of TAG that supports all three requirements. We train our parser
on a TAG-transformed version of the Penn Treebank and show that it achieves performance
comparable to existing TAG parsers that are incremental but not predictive. We also use our
PLTAG model to predict human reading times, demonstrating a better ﬁt on the Dundee eye-
tracking corpus than a standard surprisal model.

1. Introduction

Evidence from psycholinguistic research suggests that human language comprehension
is incremental. Comprehenders do not wait until the end of the sentence before they
build a syntactic representation for the sentence; rather, they construct a sequence of
partial representations for sentence preﬁxes. Experimental results indicate that each
new word that is read or heard triggers an update of the representation constructed
so far (Tanenhaus et al. 1995; Konieczny 2000).

There is also evidence for connectedness in human language processing (Sturt
and Lombardo 2005). Connectedness means that all input words are attached to the

∗ Cluster of Excellence Multimodal Computing and Interaction (MMCI), Postfach 151150, 66041

Saarbr ¨ucken, Germany. E-mail: vera@coli.uni-saarland.de.

∗∗ Institute for Language, Cognition, and Computation, School of Informatics, 10 Crichton Street,

Edinburgh EH8 9AB, UK. E-mail: keller@inf.ed.ac.uk.

† Department of Linguistics, Karl-Liebknecht-Straße 24–25, 14476 Potsdam, Germany.

E-mail: koller@ling.uni-potsdam.de.

Submission received: 4 July 2011; revised submission received: 22 December 2012; accepted for publication:
22 January 2013.

doi:10.1162/COLI a 00160

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

same syntactic structure (though connected structures can be constructed in parallel);
comprehenders build no unconnected tree fragments, even for the sentence preﬁxes
that arise during incremental processing.

Furthermore, a range of studies show that comprehenders make predictions about
upcoming material on the basis of sentence preﬁxes. There is experimental evidence
that listeners predict complements of verbs based on their selectional restrictions
(Altmann and Kamide 1999); readers predict a phrase introduced by or on encoun-
tering the word either (Staub and Clifton 2006); also the subcategorization frame of a
verb can be used for prediction (Arai and Keller 2013). These studies ﬁnd processing
facilitation if predictions can be veriﬁed successfully, compared with sentences where
predictions cannot be made or turn out to be incorrect. Presumably, the human sen-
tence processor uses prediction mechanisms to enable efﬁcient comprehension in real
time.

The three concepts of incrementality, connectedness, and prediction are fundamen-
tally interrelated: Maintaining connected partial analyses is only nontrivial if the pars-
ing process is incremental, and prediction means that a connected analysis is required
also for words the parser has not yet seen. In this article, we exploit the interrelat-
edness of incrementality, connectedness, and prediction to develop a parsing model
for psycholinguistically motivated TAG (PLTAG; Demberg and Keller 2008b). This for-
malism augments standard tree-adjoining grammar (TAG; Joshi, Levy, and Takahashi
1975) with a predictive lexicon and a veriﬁcation operation for validating predicted
structures. As we show in Section 2, these operations are motivated by psycholinguistic
ﬁndings.

We argue that our PLTAG parser can form the basis for a new model of human
sentence processing. We successfully evaluate the predictions of this model against
reading time data from an eye-tracking corpus, showing that it provides a better ﬁt
with the psycholinguistic data than the standard surprisal model of human sentence
processing. Crucially, this evaluation relies on the broad-coverage nature of our PLTAG
parser, that is, the fact that it achieves high coverage and good parsing accuracy on
corpus data. Only a broad-coverage parser can be used to model naturalistic data
such as reading times from an eye-tracking corpus; this sets our approach apart from
most other psycholinguistic models, for which only small-scale implementations for
restricted data sets are available.

On the technical side, our key contribution is a novel parsing algorithm for prob-
abilistic PLTAG. Incremental fully connected parsing is fundamentally more difﬁcult
than non-incremental parsing or parsing without connectedness: Explicit hypotheses
about how the words in a sentence are connected have to be made before all of the
relevant evidence has been encountered in the input. The number of connected analyses
grows quickly with the length of the sentence, and this problem gets worse in the
presence of predicted structure. Our parsing algorithm addresses this by grouping
equivalent analyses together by only considering the fringes of trees, and by controlling
the prediction process via supertagging. We evaluate our parser on a TAG-converted
version of the Penn Treebank, achieving a coverage of 98.09% and an F-score of 79.41.
These results approach the performance of previous (non-predictive) incremental TAG
parsers.

We present a formalization of PLTAG in Section 3, introduce the PLTAG parsing
algorithm and probability model in Section 4, show how a PLTAG lexicon can be
induced from an augmented version of the Penn Treebank in Section 5, test parsing
performance in Section 6, and ﬁnally provide a psycholinguistic evaluation on an eye-
tracking corpus in Section 7.

1026

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

2. Background and Related Work

This section situates the current work with respect to the experimental literature on
human parsing, and with respect to prior work on incremental parsing.

2.1 Prediction, Incrementality, and Connectedness in Human Parsing

We start with a short review of the experimental evidence for incremental, predictive,
and connected processing in human parsing. In a classic study, Altmann and Kamide
(1999) showed that listeners can predict the complement of a verb based on its selec-
tional restrictions. Participants heard sentences such as:

(1)

The boy will eat the cake.

a.
b. The boy will move the cake.

while viewing images that depicted sets of relevant objects—in this example, a cake, a
train set, a ball, and a model car. Altmann and Kamide (1999) monitored participants’
eye-movements while they heard the sentences and found an increased number of looks
to the cake during the word eat compared with the control condition, that is, during the
word move (only the cake is edible, but all depicted objects are movable). This indicates
that selectional preference information provided by the verb is not only used as soon as
it is available (i.e., incremental processing takes place), but this information also triggers
the prediction of upcoming arguments of the verb. Subsequent work has generalized
this effect, demonstrating that syntactic information such as case marking is also used
for prediction (Kamide, Scheepers, and Altmann 2003).

More recently, Arai and Keller (2013) used the same experimental paradigm to
show that verb subcategorization information is used for prediction. They compared
transitive and intransitive verbs in sentences such as:

(2)

The inmate offended the judge.
a.
b. The inmate frowned at the judge.

Participants’ eye-movements indicate which subcategorization frame they assume
when they process the verb. While hearing offended, listeners predict upcoming patient
information and look at the judge. While hearing frowned, no such prediction is possible,
and there is no increase of looks at the judge (this increase is observable later, during
at). This shows that the human parser uses the subcategorization frame of the verb to
anticipate upcoming syntactic structure, working out whether this structure contains a
noun phrase argument or not.

Selectional restrictions, case marking, and subcategorization arguably are all en-
coded as part of lexical items, which raises the question whether the prediction of larger
structural units is also possible. This was addressed by a study of Staub and Clifton
(2006), who investigated prediction in coordinate structures. They compared sentences
such as:

(3)

Peter read either a book or an essay in the school magazine.

a.
b. Peter read a book or an essay in the school magazine.

By monitoring eye-movements during reading, they found that the presence of either
leads to shorter reading times on or and on the noun phrase that follows it in Exam-
ple (3a), compared to the control condition Example (3b). This suggests that the word
either makes it possible to anticipate an upcoming noun phrase conjunction, ruling out

1027

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

verb phrase conjunction (which remains possible in Example (3b)). This result can be
taken as evidence for structural prediction, that is, prediction that goes beyond the
lexical information (case marking, subcategorization, etc.) encoded in the word either.

Let us now turn to the evidence for connectedness in human parsing. Connected-
ness means that the parser only generates syntactic trees that cover all of the input re-
ceived so far. The claim is that comprehenders do not build unconnected tree fragments,
even when the syntactic heads needed to build a connected structure are not available
yet during incremental processing. Evidence for this claim comes from an experiment
by Sturt and Lombardo (2005), who investigated the binding of pronouns in sentences
such as:

(4)

The pilot embarrassed Mary and put herself in an awkward situation.

a.
b. The pilot embarrassed Mary and put her in an awkward situation.

They found increased reading times on the word herself in Example (4a), but not on
her in Example (4b). They attribute this to a gender mismatch between herself and its
antecedent pilot, whose stereotypical gender is masculine. No such mismatch occurs
in Example (4b), as the antecedent of her is Mary.

Crucially, this gender mismatch can only be detected if the anaphor is c-commanded
by its antecedent. The c-command relationship can only be established if the parser
builds a fully connected structure, which includes a path from the anaphor to its an-
tecedent. A parser that operates on unconnected sentence fragments therefore is unable
to predict the contrast in Example (4); Sturt and Lombardo (2005) use this to argue for
TAG as the basis for a model of human sentence processing, and against formalisms
with a weaker notion of connectedness, such as Combinatory Categorial Grammar
(CCG; Steedman 2000). Subsequent work has provided evidence for connectedness in
a range of other phenomena, including sluicing and ellipsis (Aoshima, Yoshida, and
Phillips 2009; Yoshida, Walsh-Dickey, and Sturt 2013).

2.2 Incremental Parsing Models

In the previous section, we identiﬁed incrementality, connectedness, and prediction as
key desiderata for computational models of human parsing. In what follows, we will
review work on parsing in computational linguistics in the light of these desiderata.

Incremental parsers for a range of grammatical formalisms have been proposed
in the literature. An example is the work of Shen and Joshi (2005), who propose an
efﬁcient incremental parser for a variant of TAG, spinal TAG. This approach, however,
allows multiple unconnected subtrees for a sentence preﬁx and uses a look-ahead of two
words, that is, it does not build connected structures. An example of a TAG parser that
is both incremental and builds connected structures is the work of Kato, Matsubara, and
Inagaki (2004). This comes at the price of strong simplifying assumptions with respect
to the TAG formalism, such as not distinguishing modiﬁers and arguments. (We will
return to a discussion of other TAG parsers in Section 6.1.)

An example of an incremental parser based on context-free grammars is the one
proposed by Roark (2001). That parser uses a top–down algorithm to build fully con-
nected structures; it is also able to compute probabilities for sentence preﬁxes, which
makes it attractive for psycholinguistic modeling, where preﬁx probabilities are often
used to predict human processing difﬁculty (see Section 7 for details). The Roark parser
has been shown to successfully model psycholinguistic data from eye-tracking corpora
(Demberg and Keller 2008a; Frank 2009) and other reading time data (Roark et al. 2009).
It therefore is a good candidate for a broad-coverage model of human parsing, and

1028

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

will serve as a standard of comparison for the model proposed in the current article
in Section 7. The Roark parser has been extended with discriminative training (Collins
and Roark 2004), resulting in a boost in parsing accuracy. Preﬁx probabilities cannot
be computed straightforwardly in a discriminative framework, however, making this
approach less interesting from a psycholinguistic modeling point of view.

Wu et al. (2010) propose another approach based on preﬁx probabilities over
context-free structures. These are generated in their approach using a bottom–up pars-
ing algorithm based on hierarchical hidden Markov models (HMMs) (Schuler et al.
2010). They show that preﬁx probabilities, as well as a new measure based on the
embedding depth of the HMM, successfully predicts human reading time data.

In the dependency parsing literature, Nivre (2004) proposes a parser that builds
dependency structures word-by-word, based on a shift-reduce algorithm. This
approach is highly efﬁcient, but has two disadvantages from a psycholinguistic point
of view: Firstly, it cannot guarantee that only connected structures are built, as the
stack potentially contains unconnected words (though Nivre [2004] shows that 68.9%
of all parse conﬁgurations contain only connected components, rising to 87.1% if only
valid dependency graphs are considered). Secondly, Nivre (2004) uses a discriminative
probability model over parser actions, which means that preﬁx probabilities cannot be
computed directly. It is, however, possible to predict reading times using probabilities
over parser actions rather than preﬁx probabilities, as Boston et al. (2008) have shown
by using the Nivre (2004) parser to model reading times in an eye-tracking corpus of
German sentences.

An interesting alternative to Nivre’s approach has been proposed by Beuck, K ¨ohn,
and Menzel (2011), who introduce an incremental version of Weighted Constraint
Dependency Grammar (WCDG). The proposed parsing framework is able to produce
structures that are both connected and predictive; this is achieved by the introduction
of virtual nodes in the dependency tree, an idea akin to our use of prediction trees
in TAG (detailed subsequently). WCDG parsing is non-monotonic, that is, it uses a
mechanism by which the current analysis can be revised if it becomes incompatible
with the input. This contrasts with the fully monotonic approach we use in the present
article. In terms of evaluation, Beuck, K ¨ohn, and Menzel (2011) present a comparison of
their incremental WCDG parser with the model of Nivre (2004) for parsing German.

What is common to all of these approaches it that they lack an explicit prediction
and veriﬁcation mechanism (WCDG includes prediction, but not veriﬁcation), which
means that they cannot be used to model psycholinguistic results that involve veriﬁca-
tion cost.1 A simple form of prediction can be achieved in a chart parser (incomplete
edges in the chart can be seen as predictive), but in order to maintain psycholinguistic
plausibility, an arc-eager left-corner parsing strategy needs to be used. Other parsing
strategies fail to predict human processing difﬁculty that arises in certain cases, such
as for center embedding (Thompson, Dixon, and Lamping 1991; Resnik 1992a). This
is an argument against using a top–down parser such as Roark’s for psycholinguistic
modeling. Furthermore, it is important to emphasize that a full model of human parsing
needs to not only model prediction, but also account for processing difﬁculty associated
with the veriﬁcation of predictions (we will return to this point in Section 7). None of
the existing incremental parsing models includes an explicit veriﬁcation component.

1 As Demberg and Keller (2009) show, some psycholinguistic results can be accounted for by a model
without veriﬁcation, such as the either . . . or ﬁnding, whereas other results, such as the relative clause
asymmetry, require a veriﬁcation component; see Section 7.4 for more discussion.

1029

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

In this article, we propose the ﬁrst parser that instantiates the properties of incre-
mentality, connectedness, and prediction in a psycholinguistically motivated way. We
achieve this by exploiting the fact that these three concepts are closely related: In order
to guarantee that the syntactic structure of a sentence preﬁx is fully connected, it may be
necessary to build phrases whose lexical anchors (the words that they relate to) have not
been encountered yet. In other words, the parser needs to predict upcoming syntactic
structure in order to ensure connectedness. This prediction scheme is complemented by
an explicit veriﬁcation mechanism in our approach. Furthermore, unlike most existing
psycholinguistic models (see Keller 2010 for an overview), our model achieves broad
coverage and acceptable parsing performance on a standard test corpus. This property
is essential for testing psycholinguistic models on realistic data, including eye-tracking
corpora.

The PLTAG formalism was ﬁrst proposed by Demberg-Winterfors (2010), who also
presents an earlier version of the parsing algorithm, probability model, implementation,
and evaluation described in the current article.

3. The PLTAG Formalism

We start by introducing the PLTAG formalism, which we will use throughout the article.

3.1 Incremental TAG Parsing

Tree Adjoining Grammar (TAG) is a grammar formalism based on combining trees. In
what follows we will focus on lexicalized TAG (TAG; Joshi and Schabes 1992), which
is the most widely used version of TAG. In this formalism, a TAG lexicon consists of
a ﬁnite set of elementary trees whose nodes are labeled with nonterminal or terminal
symbols. Each elementary tree contains an anchor, a leaf node labeled with a terminal
symbol. At most one other leaf—the foot node—may carry a label of the form A∗, where
A is a nonterminal symbol. All other leaves are substitution nodes and labeled with
symbols of the form A↓. Elementary trees that contain a foot node are called auxiliary
trees; those that contain no foot nodes are initial trees. We will generally call leaves of
trees that are labeled with words lexical leaves. An example TAG lexicon is shown in
Figure 1a–c.

TAG builds grammatical derivations out of these elementary trees using two tree-
combining operations, substitution and adjunction. Figure 2 shows a derivation of the

Figure 1
Elementary trees in an example PLTAG lexicon. The predictive status of nodes in a prediction
tree (see Section 3.2) is marked with the markers k and k.

1030

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

Figure 2
Example of a (non-incremental) TAG derivation of Peter often sleeps using the elementary trees
from Figure 1.

string Peter often sleeps. In the ﬁrst step, this derivation substitutes the elementary tree (a)
from Figure 1 into the substitution node of (b). This is allowed because (a) is an initial
tree, the nonterminal at the root of (a) and the nonterminal on the substitution node are
the same; the substitution operation then replaces this substitution node by (a). Second,
the derivation adjoins the auxiliary tree (c) into the VP node of the tree for Peter sleeps; we
call the VP node the adjunction site. The resulting tree is like the original from the root
down to the VP node; then it continues with (c), with the foot node replaced by what was
below the VP node before. The result is a tree whose leaves are all labeled with terminal
symbols, which read Peter often sleeps from left to right. We call such a tree a derived tree
for this string. We generically call a substitution or adjunction operation an integration,
and the (inner or substitution) node at which it is applied the integration site.

Notice that one can think of the adjunction operation as cutting the adjunction site
in two halves. The upper half is identiﬁed with the root of the auxiliary tree; if one thinks
of root nodes as being only the lower half of a node, these two halves recombine into a
complete node. The lower half of the adjunction site is identiﬁed with the foot node; we
think of the foot node as only having an upper half, which again makes a whole node.
We assume that lexical leaves only have an upper half, too; this makes no difference,
as no substitution or adjunction can be performed on those nodes anyway. The process
is illustrated in Figure 3, which shows the recombination of node halves from different
elementary trees in the adjunction step of Figure 2: Black node halves come from the
elementary tree for sleeps, gray node halves from Peter, and white ones from often. The
idea of distinguishing upper and lower node halves that are pushed apart by adjunction
comes from FTAG (Vijay-Shanker and Joshi 1988), which equips each node half with a
separate feature structure; at the end of the derivation process, the upper and lower
feature structures of each node are uniﬁed with each other. Node halves will also play
a crucial role in PLTAG.

3.2 Prediction Trees

We have argued earlier that a psycholinguistic model of sentence processing should be
incremental. In the context of TAG and related formalisms, this means that a derivation

Figure 3
Fine structure of adjunction. The semicircles represent node halves; all node halves from the
same elementary tree are drawn in the same color.

1031

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Figure 4
Example of an (incremental) PLTAG derivation of Peter often sleeps using the elementary trees
from Figure 1.

starts with a lexicalized elementary tree for the ﬁrst word of the sentence, then combines
it with an elementary tree for the second word, and so on. The derivation in Figure 2 is
not incremental in this sense, because it combines a tree for the ﬁrst word directly with a
tree for the third word in the sentence, and only then adjoins a tree for the second word.
In fact, it is not possible to produce an incremental derivation of Peter often sleeps with
the TAG grammar in Figure 1a–c (or any other linguistically motivated TAG grammar
for English), because the tree for often must adjoin into a VP node, which is not yet
available after processing Peter.

The PLTAG formalism (for psycholinguistically motivated TAG) solves this prob-
lem by introducing prediction trees in addition to the usual initial and auxiliary trees
of TAG (which we will call canonical trees). Prediction trees are elementary trees that
may or may not contain a lexical anchor (see the right part of Figure 1 for an example).
PLTAG derivations can use them to predict syntactic structure that will be introduced
by later words in the incremental derivation, in order to keep the syntactic structure
connected and provide adjunction sites. Each node in a prediction tree carries one or
two markers to indicate that this node has only been predicted, and is not actually
part of the syntactic structure introduced by any word that has been processed so far.
These markers can be thought of as decorating the upper and lower half of the node,
as described earlier; therefore internal nodes always have an upper and lower marker,
root nodes only have lower markers, and substitution and foot nodes only have upper
markers. Note that unlike canonical trees, prediction trees can have leaves that are not
substitution nodes, foot nodes, or lexical leaves, and that therefore have two halves (and,
therefore, two markers); the VP leaf of the tree in Figure 1d is an example. A discussion
of how the prediction tree lexicon is acquired, and how the conﬁgurations of prediction
trees relate to the canonical elementary trees, is provided in Section 5.1.

When a prediction tree is used in a PLTAG derivation, all of its markers are ﬁrst
instantiated with fresh symbols, so we can always tell apart the markers introduced by
different prediction trees. In the example PLTAG derivation of Figure 4, all the k-markers
from the lexicon entry are instantiated with 1-markers. The operations of substitution
and adjunction are then applied to prediction tree instances in exactly the same way
as to canonical trees. In particular, adjoining into a node that carries markers pushes
the two markers apart. The upper marker becomes the upper marker of the root of the
auxiliary tree, whereas the lower marker becomes the lower marker of the foot node (see
the second step of Figure 4). Note that if a prediction tree is adjoined into a node that
already carries markers, this may create nodes that have an upper and lower marker
with different values.

The use of prediction trees in PLTAG is conceptually similar to the use of type
raising in incremental derivations in CCG (Steedman 2000). For example, the prediction
tree in Figure 1d effectively raises the NP in Figure 1a to type (S/(S\NP)) so that it can
compose with the adverb in Figure 1c. Prediction trees, however, are more powerful
in terms of the incremental derivations they support: Some psycholinguistically crucial

1032

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

constructions (such as object relative clauses) are handled easily by PLTAG, but are not
incrementally derivable in standard CCG (Demberg 2012). According to Demberg, this
problem can be overcome by generalizing the CCG categories involved (in the case of
object relative clauses, the category of the relative pronoun needs to be changed).

3.3 Veriﬁcation

Markers are eliminated from a partial derived tree through a new operation called
veriﬁcation. Recall that markers indicate nodes that were predicted during the deriva-
tion, without having been introduced by a word that was actually observed so far. The
veriﬁcation operation removes these markers by matching them with the nodes of the
canonical elementary tree for a word in the sentence. An example is shown in the last
step of Figure 4. This is a veriﬁcation step for the marker 1, using the canonical tree for
sleeps as the veriﬁcation tree τv.

Informally speaking, the job of the veriﬁcation operation for the marker i is to (a)
check that the i-marked node halves in the preﬁx tree correspond to nodes in the veri-
ﬁcation tree with respect to their labels and their tree structure; (b) remove the markers
from these node halves in the preﬁx tree; and (c) add nodes to the preﬁx tree that are
only present in the veriﬁcation tree, but not the prediction tree. The formal deﬁnition of
the veriﬁcation operation is complicated by the fact that the nodes that were introduced
by one prediction tree may have been pushed apart through subsequent adjunctions.
This means that the i-marked node halves whose labels and relative positions have to
be matched to the veriﬁcation tree in (a) may be scattered throughout the preﬁx tree.
Consider the example in Figure 5. In the middle preﬁx tree (for Peter often), the black
node halves were contributed by the same prediction tree (with marker 1), but they were
pushed apart by the adjunction of a canonical tree for often. These node halves are still
in their original conﬁguration in the veriﬁcation tree in the box (for sleeps); furthermore,
the canonical tree contains additional nodes that were not present in the prediction tree.
The veriﬁcation operation solves this problem by establishing a correspondence
between node halves in the preﬁx tree τ and the veriﬁcation tree τv. Intuitively, a
mapping between nodes can only be called a correspondence if the marked node halves
in τ are arranged in the same conﬁguration as the corresponding node halves in τv.

Figure 5
Fine structure of the veriﬁcation operation. The dotted arrows indicate the correspondence f
between the 1-marked node halves and the node halves of the sleeps tree from Figure 1.

1033

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

All node halves with a given marker in the preﬁx tree must have corresponding node
halves in τv. Conversely, τv may contain node halves without correspondents in τ. These
unmatched nodes are added to τ.

Technically, a mapping f of node halves in the tree τ to node halves in the veriﬁca-

tion tree τv is called a correspondence if it has the following properties:

(cid:2)
(cid:2)

(cid:2)

for each node half h, the node labels of h and f (h) are the same;

for each lower (upper) node half h, f (h) is also a lower (upper) node half;
(cid:4)= h2 entails f (h1) (cid:4)= f (h2) for all node

the mapping is injective, that is, h1
halves h1 and h2;

for any two node halves h1 and h2, if h1 is above h2 in τ, then the node half
f (h1) is above the node half f (h2) in τv;

for any two node halves h1 and h2, if h1 precedes h2, then f (h1) precedes
f (h2).

(cid:2)
of τv matches the node half h in τ if f (h) = h

(cid:2)
. Because f need not be
A node half h
subjective, not every node half in τv necessarily matches something. We will say for
short that an entire node u in τv matches a node half h if it is clear from the context
which half of u matches h.

Not all correspondences are useful when performing a veriﬁcation. In particular,
we must require that the region of τv that matches node halves in τ is contiguous and
starts at the root of τv; and if some node in τv has unmatched children, then these must
be the rightmost children of their parent. A correspondence is called admissible if the
following holds:

(cid:2)

if a node in τv has two halves and one half matches something, then the
other half also matches something;

if the upper half of the i + 1-st child of some node u of τv matches
anything, then the upper half of the i-th child of u must also match
something (for any i ≥ 1);

if the upper half of any child of some node u of τv matches something,
then the lower half of u must match something as well.

The veriﬁcation operation for the marker i can now be performed on the preﬁx tree
τ with a veriﬁcation tree τv if there is an admissible correspondence f that maps all
i-marked node halves of τ to node halves of τv. The effect of the operation is to add
subtrees to certain nodes of τ: If some node u in τv with children u1, . . . , un (from left to
right) matches a (lower) node half h, and u1, . . . , uk but not uk+1, . . . , un match (upper)
node halves in τ (i.e., upper node halves of children of h), then the subtrees of τv below
uk+1, . . . , un are added to τ as the k + 1-st to n-th child of h. Furthermore, the marker i is
removed from all node halves in τ; these predicted nodes have now been veriﬁed.

We illustrate the deﬁnition with the example in Figure 5. The veriﬁcation rule ﬁrst
establishes a correspondence between the four 1-marked node halves of the tree for Peter
often and the node halves of the canonical tree for sleeps below, drawn as dotted arrows
in the ﬁgure. This correspondence maps one lower half to the root, one upper half to
the substitution node, and one upper and one lower half to the VP node, and the lexical
anchor node sleeps does not match any node half. This correspondence is admissible.

1034

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

We can therefore apply the veriﬁcation rule. It removes all 1-markers from the tree.
Furthermore, because the node corresponding to the lower black VP node half has an
unmatched child (the sleeps node), this node is added to the tree by the veriﬁcation
operation. In addition to the new lexical anchor, the veriﬁcation tree could also contain
further nodes below and to the right of predicted nodes; all of these would then be
added.

3.4 Derivations in PLTAG

A PLTAG derivation starts with an elementary tree for the ﬁrst input word, and
then applies substitution, adjunction, and veriﬁcation operations. We require a PLTAG
derivation to be incremental, in the sense that after each derivation step, the ﬁrst i leaves
of the partial derived tree are unmarked and labeled with the words w1 . . . wi, for some
i, and there are no other unmarked lexical leaves. Because of this property, the partial
derived trees are preﬁx trees. In the case of substitution, adjunction, or veriﬁcation steps
with canonical trees, we need to always add an elementary tree with anchor wi+1 to a
preﬁx tree for the ﬁrst i words; this constraint does not apply to unlexicalized prediction
trees. We call a derivation of a sentence w1 . . . wn complete if i = n, the preﬁx tree
contains no more substitution nodes, foot nodes, or prediction markers, and the root
symbol of the preﬁx tree is S. The string language of a PLTAG grammar is the set of
string yields of its complete preﬁx trees.

As explained in Section 2, incrementality and connectedness are key desiderata for
a psycholinguistically motivated parsing model. In order to implement these properties
in PLTAG we need to ensure that the parser always constructs well-formed preﬁx trees.
This entails that PLTAG parsing differs in crucial ways from standard non-incremental
TAG parsing. The reason for this is that not all operations that could be performed
in a non-incremental parsing regime lead to a well-formed preﬁx tree. The simplest
example involves a preﬁx tree that contains two substitution nodes: A derivation which
substitutes a canonical tree into the rightmost substitution node will produce a derived
tree in which a substitution node comes before a lexical leaf, and which therefore
violates the incrementality requirement. Similarly, if a veriﬁcation tree contains any
unmatched nodes to the left of its spine (the spine is the path from the root to the lexical
anchor), the result of the veriﬁcation operation will also contain leaves that are to the
left of the rightmost leaf that is labeled with a word. Such derivation steps can never be
used in an incremental PLTAG derivation.

To conclude the exposition of the deﬁnition of PLTAG, let us compare the expressive
capacities of PLTAG and TAG. First, PLTAG cannot be more expressive than TAG.
In any derivation of some PLTAG grammar G, the nodes that each prediction tree
contributes to the derived tree are replaced by the nodes of some (canonical) veriﬁcation
tree. Therefore the resulting derived tree can be built (non-incrementally) from only
canonical trees, using substitution and adjunction. Thus if we build a TAG grammar
(cid:2)
G
describe the same set of derived
trees.

(cid:2)
from G by removing all prediction trees, G and G

(cid:2)
Conversely, for any TAG grammar G we can also build a PLTAG grammar G

that
(cid:2)
describes the same derived trees. The basic idea is that G
contains all elementary
(cid:2)
trees of G; for each elementary tree α of G, G
furthermore contains an unlexicalized
version of α as a prediction tree. Given some derived tree of G, we can then build an
(cid:2)
incremental derivation of G
which ﬁrst builds an unlexicalized version of the derived
tree using substitution and adjunction of prediction trees, and then veriﬁes this derived

1035

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

tree from left to right. In other words, anything that TAG can derive, PLTAG can derive
incrementally.

3.5 Heads

We will now describe a probability model for PLTAG derivations. Having such a model
is crucial for two reasons. First, our psycholinguistic model depends on preﬁx proba-
bilities to predict surprisal and reading times (Section 7). Second, the probability model
will allow our parser to perform a more focused search for an incremental derivation
(Section 4).

The PLTAG probability model is a bilexical probability model; that is, the prob-
ability for an operation combining elementary and preﬁx trees depends both on the
(lexicalized) elementary tree and on the lexical head of the integration site. We use
two different perspectives on lexical heads. First, we keep track of what (possibly
lexicalized) elementary tree τ contributed each node in the preﬁx tree. When we perform
a substitution or adjunction, we can condition the probability of the operation on the
identity of τ.

Second, we consider a notion of heads based on the phrase structure of the (preﬁx
or elementary) tree. We assume that this phrase structure is linguistically valid, in that
the syntactic categories represent various types of phrases, and we can recognize from
the node labels of a parent and its children which of the children is the head child. For
instance, in the ﬁnal derived tree in Figure 4, the head daughter of the S node is the VP,
whereas the head daughter of the upper VP is the lower VP. If u is any node in the tree,
we can follow the head daughters down to a leaf; we call this leaf the head of u. The
head is often a lexical leaf (in the ﬁnal derived tree in Figure 4, the head of the S node
is sleeps and the head of the NP node is Peter), but could also be the non-lexical leaf of
a prediction tree (the head of the upper VP node in the third preﬁx tree is the lower VP
node). The head of any node on the spine of a canonical elementary tree is always the
lexical anchor.

3.6 Probability Model

We are now ready to deﬁne the probability model for PLTAG. This model allows us to
deﬁne a probability distribution over the derivations of any given PLTAG grammar. It
makes the same independence assumptions as standard models for probabilistic TAG
(Resnik 1992b; Chiang 2000): Any two applications of derivation rules are statistically
independent events. We deviate from these models, however, with regard to what these
events are. Earlier approaches always modeled the probability of substituting or adjoin-
ing the lower elementary tree, given the upper elementary tree and the integration site.
This is inconsistent with the incremental perspective we take here, which assumes that
the preﬁx tree is given, and we must decide how to integrate an elementary tree for
the next word with it. We therefore model the probability of substituting, adjoining, or
verifying the elementary tree, given the preﬁx tree.

Because a substitution or adjunction step may either integrate the elementary tree
into the preﬁx tree or vice versa, we must distinguish the direction in which the
operation is applied. We do this by conditioning the probability of integrating the
elementary tree on the preﬁx tree, as well as the lower tree’s root node and the upper
tree’s integration site. We refer to these nodes as ue for the root or integration site node
in the elementary tree, and up for the preﬁx tree root or integration site node, depending
on which out of the preﬁx and elementary tree is the upper tree in the operation. We

1036

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

obtain probability distributions PS and PA for substitution and adjunction, as well as
PI for the initial operation that introduces the ﬁrst tree, as follows. τe stands for the
(possibly lexicalized) elementary tree and τp for the preﬁx tree.

(cid:2)

Initial:

PI(τe) = 1

Substitution:

Adjunction:

τe
(cid:2)

τe

PS(τe

|τp, ue, up) = 1

PA(τe

|τp, ue, up) + PA(NONE|τp, up) = 1

(5)

(6)

(7)

Each sum is over all elementary trees τe in the lexicon, that is, all canonical and predic-
tion trees. For the adjunction probabilities PA, we must also consider the possibility that
the derivation process chooses not to adjoin anything at a given node up of the preﬁx
tree; this is modeled as the probability of NONE.

For a veriﬁcation step, we condition on the original integration site of the prediction
tree upredict and the prediction tree τpredict that is matched by the veriﬁcation tree; areas
of the preﬁx tree τp outside of τpredict are ignored. The probability distribution PV for
veriﬁcation operations is deﬁned as follows:

Veriﬁcation:

(cid:2)

τv

PV(τv

|τpredict, upredict) = 1

(8)

Here the sum is over all canonical trees τv in the lexicon. The probability of an entire
derivation is then the product of the probabilities of the individual operations in the
derivation.

Because we are unlikely to ever have seen the exact preﬁx tree before, we will
approximate its probability using only certain features of the trees and nodes involved.
We will discuss the features we use in Section 5.3. Furthermore, we circumvent the usual
sparse data problem involved in probability models of lexicalized TAG by assuming
that the choice of the unlexicalized elementary tree and the lexical anchor are indepen-
dent. For substitution and adjunction probabilities, we assume that:

P(τe

|τp, ue, up) = P(unlex(τe)|τp, ue, up) · P(anchor(τe)|unlex(τe), head(ur))

(9)

where ur is the root of the elementary tree that introduced up in the preﬁx tree,
unlex(τe) is the elementary tree τe without its lexical anchor (i.e., the tree template),
anchor(τe) is the lexical anchor of the elementary tree τe, and head(u) is the head of the
node u.

If the elementary tree τe is substituted or adjoined into the preﬁx tree, then up is
the integration site at which τe is added to the preﬁx tree τp, and thus head(ur) is the
anchor of the elementary tree that introduced the integration site. On the other hand, if
the preﬁx tree is substituted or adjoined into the elementary tree, then up is the root of
the preﬁx tree τp, and head(ur) amounts to the head of the entire preﬁx tree that we have
seen so far; this may be a word (Peter in the ﬁrst tree of Figure 4), or a syntactic category
(e.g., the lower VP node in the second tree of Figure 4). Note that if τe is an unlexicalized
prediction tree, anchor(τe) is undeﬁned; in these cases, τe = unlex(τe).

1037

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

For veriﬁcation, we condition the choice of the lexical anchor of the veriﬁcation
tree on the head of the node upredict, that is, the root or integration site of the tree that
was combined with τpredict when τpredict was added in the derivation. For instance, in
the last step of Figure 4, head(upredict) is Peter, because the prediction tree was added to
the derivation by substituting the elementary tree for Peter into it. This replicates the
situation in a non-incremental, top–down TAG derivation, in which the veriﬁcation tree
would have been directly integrated with upredict without the intermediate prediction
step. Thus we factorize the bilexical probability as follows:

P(τv

|τpredict, upredict) = P(unlex(τv)|τpredict) · P(anchor(τv)|unlex(τv), head(upredict))

(10)

In order to calculate the probability of a derived tree using this probability model, each
node is visited twice during the derivation, once from the left side and once from the
right side. At each step, a substitution,2 an adjunction, or no adjunction can happen,
and the probability of the event is factored into the derivation probability.

4. The Parsing Algorithm

Now that we have deﬁned PLTAG, we can address the question of how PLTAG can be
parsed efﬁciently. Parsing, in the context of PLTAG, means ﬁnding the most probable
incremental derivation whose yield is the given input string.

A naive parser might proceed as follows. It starts with a canonical tree for the ﬁrst
word as the initial preﬁx tree. Then it nondeterministically adds an arbitrary number
of prediction trees (or possibly none at all) using substitution and adjunction. Third,
it nondeterministically adds a canonical tree for the second word using substitution,
adjunction, or veriﬁcation. It repeats the second and third step until it either gets stuck
(no suitable canonical tree for the next word is available) or it has found a complete
derivation for the input sentence.

An actual implementation of this algorithm must perform a search that spells out
all the nondeterministic choices that the algorithm makes. An unconstrained implemen-
tation is therefore prohibitively slow. The naive algorithm shares this problem with
all other incremental parsing algorithms that maintain connected partial structures.
In particular, parsing algorithms based on bottom–up dynamic programming are not
applicable because they necessarily compute constituents for substrings that are not
preﬁxes. A parser for PLTAG faces the additional challenge that the use of predictive
trees adds a large amount of non-determinism to the parsing task, as most predictive
trees are not lexically anchored in the current input, and their use is therefore less
constrained.

We will now describe how the naive parser can be reﬁned into an incremental parser
for PLTAG that is efﬁcient enough for practical use. We ﬁrst introduce the concept
of fringes of trees in Section 4.1, and describe their use in an efﬁcient schema for
tabulating parse items (Section 4.2). Then we will show how to use supertagging to
control the use of prediction trees (Section 4.3). We will focus on the parsing algorithm
from an implementation perspective here. The rule schema that our parser implements
is formalized in Appendix A.

2 Substitution can only happen if the node being considered is a substitution node.

1038

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

Figure 6
Substitution in an incremental derivation. Substitution into the second substitution node (b)
leads to an invalid preﬁx tree. The dashed line indicates the current fringe.

4.1 Fringes

The crucial insight for efﬁcient PLTAG parsing is that an incremental derivation is
highly constrained in the nodes at which it may integrate a canonical tree with a preﬁx
tree. Consider, for instance, the situation of Figure 6. The preﬁx tree has two substitution
nodes, for B and C. Only substitution into B, however, leads to a valid preﬁx tree; if we
substitute into C, we obtain the tree in Figure 6(b), which does not represent a preﬁx
of the input string. A similar situation occurs for adjunction, as in Figure 7. We may
adjoin an auxiliary tree β1 whose foot node is the leftmost leaf into the left A node;
the result is shown in (a). Alternatively, we may adjoin an auxiliary tree β2 whose foot
node is the rightmost leaf into the right A node, to obtain (b). If we try to adjoin β1 into
the right A node, we get (c), which is not a preﬁx tree. If we try to adjoin β2 into the
left A node, the result is (d); this is a tree whose ﬁrst two leaves are lexical, but we have
processed the second word before the ﬁrst in the derivation. Thus, only the steps (a) and
(b) are valid.

We can generalize this observation using the concept of fringes. Imagine a depth-
ﬁrst, left-to-right traversal of an (elementary or preﬁx) tree; this produces a sequence
σ of nodes of the tree in which each node occurs exactly twice, once when it is ﬁrst
visited and once when it is left after its entire subtree has been processed. If the node
is a leaf, its “downward” and “upward” occurrences are adjacent in σ. We can cut σ
into components at these points. A fringe is a maximal subsequence of σ that starts at
the downward visit of the root or the upward visit of a leaf, and ends at the upward
visit of the root or the downward visit of a leaf. Consider, for example, the second preﬁx
tree in Figure 4. Abusing node labels to indicate nodes to simplify the presentation, the
−
traversal sequence σ is S+NP+Peter+Peter
, where + indicates down-
ward visits and − upward visits (see Figure 8). This sequence consists of three fringes,
namely, S+NP+Peter+ (= root to ﬁrst leaf), Peter
VP+ (= ﬁrst leaf to second leaf),
(= second leaf to root).
and VP

VP+VP

−

Figure 7
Adjunction in an incremental derivation. Only adjunction of auxiliary trees with leftmost foot
nodes into upward visits of the current fringe (a) or with rightmost foot nodes into downward
visits (b) result in valid preﬁx trees; the other two combinations are invalid (c, d).

1039

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Figure 8
A depth-ﬁrst, left-to-right traversal of a tree consisting of three fringes. The second fringe is
current.

All substitution and adjunction operations that are applied to a (preﬁx or elemen-
tary) tree τ in a successful incremental derivation must use a node on the current fringe
of τ. This is the fringe that starts with the upward visit of the rightmost lexical leaf that
has no markers. Consider the examples in Figure 6; the current fringe of the original
preﬁx tree is indicated by a dashed line. Here the ﬁrst substitution node is at the end
of the current fringe, and is therefore available for substitution. The second substitution
node is not on the current fringe, and can therefore not be used at this point. For the
adjunctions in Figure 7, we must distinguish further between the “upward” and the
“downward” part of the current fringe. This distinction is important because the current
fringe contains both an upward visit of the left A, which we may right-adjoin into (see
Figure 7(a)), and a downward visit of the right A, at which we may left-adjoin (see
Figure 7(b)). Right-adjoining into a downward visit, as in Figure 7(c), or left-adjoining
into an upward visit, as in Figure 7(d), is not allowed, and the downward visit to the
left A and the upward visit to the right A are not on the current fringe.

Substitution and adjunction of prediction trees is not constrained in the same way.
A derivation step that substitutes an unlexicalized prediction tree into the C substitution
node of Figure 6 would produce a preﬁx tree, because the derivation step does not
contain lexical leaves that the original preﬁx tree did not. The prediction tree might
even contain a lexical leaf for some word wk: Because all nodes in a prediction tree carry
markers, the resulting tree would be a valid preﬁx tree, and eventually the marker on
the wk leaf would be removed by a veriﬁcation step at the right time. Nevertheless, we
may still assume that prediction trees are only integrated with the preﬁx tree at a time
when the integration site is on the current fringe, because an integration to the right of
the current fringe does not affect the operations that can be performed to the current
fringe. For the purposes of the following parsing algorithm, we therefore assume that
prediction trees are only integrated at the current fringe as well.

Finally, veriﬁcation is a more global operation whose applicability cannot be deter-
mined just from the current fringe, because the marked nodes of the matched prediction
tree may be dispersed throughout the preﬁx tree. A veriﬁcation step for the marker i
will only be successful, however, if the lower half of the last node on the current fringe
is marked with i. Thus the current fringe can at least provide a necessary condition for
ﬁltering out obviously useless veriﬁcation operations.

4.2 Chart Parsing for PLTAG

Our parser exploits fringes to tabulate intermediate results. It manipulates a chart-like
data structure with two dimensions: the index of the word up to which we have parsed
the sentence, and the current fringe. The cell for (i, f ) contains all the preﬁx trees whose
ﬁrst i leaves are the ﬁrst i words (without markers) and whose current fringe is f .

The parser successively computes all chart entries for i from 1 to the sentence length.
It starts by initializing the chart column for 1 with all canonical trees for the ﬁrst word.

1040

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

The current fringe of each of these trees is the sequence from the ﬁrst to the second leaf,
so each of these trees is stored under that fringe.

To extend the preﬁx trees for i to preﬁx trees for i + 1, the parser then retrieves all
current fringes f such that the chart has entries in the cells (i, f ). For each such fringe,
it determines the canonical trees that can be combined with f using substitution or
adjunction, based on the heuristics outlined earlier. The preﬁx tree is integrated into
the canonical tree if f ends in the root; otherwise the canonical tree is integrated into
the preﬁx tree. In this way, the test whether a substitution or adjunction is possible can
be done efﬁciently, for all preﬁx trees in (i, f ) at once, simply by comparing f with the
current fringes of the canonical trees in the lexicon. For those canonical trees that can
be combined with f , the parser computes the resulting preﬁx trees, determines their
probabilities, and enters each resulting preﬁx tree with current fringe f
into the cell
(i + 1, f

(cid:2)

Determining and performing the applicable ways of combining the entries of (i, f )
with prediction trees proceeds in exactly the same way, except that the parser is re-
stricted to never combine the preﬁx tree with two prediction trees in a row. Strictly
speaking, this makes it incomplete for the problem of computing PLTAG derivations:
There can be PLTAG derivations that can only be obtained by combining a preﬁx tree
with two or more prediction trees in a row. This restriction keeps the parser from
guessing prediction trees in a completely uncontrolled fashion; unlike canonical trees,
prediction trees are not necessarily licensed by a word in the input string and could
therefore be added arbitrarily. In practice, the restriction to single prediction steps is
counterbalanced by our extraction of “pre-combined” prediction trees from the treebank
(see Section 5.1), which can contain nodes from multiple canonical trees at once.

Finally, the parser also attempts a veriﬁcation step for all preﬁx trees in (i, f ) for
which the last node on the current fringe carries some marker k. It looks up the pre-
diction tree that introduced the marked node and determines all veriﬁcation trees that
match this prediction tree and whose lexical anchor is the i + 1-st word of the sentence.
These veriﬁcations are then applied to the preﬁx tree, and the results are added to the
appropriate cells as before.

In order to increase parsing speed, the parser uses beam search, which only retains
the most likely preﬁx trees (see Section 6) by pruning both unlikely chart entries and
unlikely analyses within a chart entry.

4.3 Supertagging for Prediction Trees

A major remaining source of complexity is the fact that (unlexicalized) prediction trees
can be combined freely with preﬁx trees, which creates a great number of new preﬁx
trees: At each prediction step, thousands of prediction trees can potentially be combined
with all preﬁx trees; this is computationally not feasible. Non-incremental parsers,
which do not use the unlexicalized prediction trees, have to deal with the much lower
level of ambiguity among canonical trees (about 50 trees per word on average if using a
lexicon the size of our canonical lexicon).

In our parser implementation, we use supertagging to select only the best pre-
diction trees in each step, which reduces the search space considerably. Supertagging
(Bangalore and Joshi 1999) is a common approach used in the context of TAG and CCG
parsing; the idea is to limit the elementary trees for each word to those that are evaluated
highly by some shallow statistical model. We only use supertagging for prediction trees;
for canonical trees, we use all (lexicalized) trees that the grammar contains for the word
(rare words are replaced by “UNK”).

1041

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Because our parser must run incrementally, the supertagger should not be allowed
to have any look-ahead. We found, however, that not having any look-ahead has a
detrimental impact on supertagging quality, and subsequently on parsing accuracy. We
therefore allow the supertagger a restricted look-ahead. The supertagger estimates the
usefulness of a prediction tree τpredict based on a current fringe of the preﬁx tree fp and
the POS tag of the next word twi+1. Note that the parser has at that point already read
and fully processed word wi, and its task is to choose possible prediction trees for the
prediction step just before processing word wi+1. Knowing the POS tag of that next
word wi+1 does not compromise incrementality if the POS tag is determined without
any further look-ahead.3 It does, however, make the interpretation of prediction weaker:
Predictions for maintaining a fully connected structure are only made once the identity
of the next word is known and its POS tag has been estimated.

The probability model of the supertagger is parametrized as follows:

(cid:2)

τpredict

P(τpredict

| fp, twi+1 ) = 1

(11)

where P(τpredict

| fp, twi+1 ) = P(τpredict

| fpredict, slpredict)P( fpredict, slpredict

| fp, twi+1 )

In order to reduce data sparsity, we independently estimate the probability of a
particular prediction tree given its ﬁrst fringe fpredict and category of the leaf node on the
spine slpredict, and the probability of some tree with ﬁrst fringe fpredict and category of the
leaf node on the spine slpredict given a preﬁx tree with current fringe fp and estimated
POS tag of the next word twi+1. A further simpliﬁcation is that we represent the current
fringes fpredict and fp as an alphabetically ordered set of the categories occurring on it.
The reasoning behind this decision is that the order of nodes is less important than the
identity of the nodes as possible integration sites. The supertagging model is smoothed
with the procedure described by Brants (2000), as it yielded better results than Witten-
Bell smoothing (which suffers from data sparsity in the supertagging task). We use
| fp, ti+1) based only on the most
one level of back-off where we estimate P( fpredict, slpredict
likely integration site np instead of the whole fringe fp:

P(fpredict, tpredict

|np, twi+1 )

max
np

(12)

The reason for backing off to the most probable integration site is that a fringe
with more unique categories should not have a lower probability of a particular tree
adjoining into it than a fringe containing the same category, but fewer other categories.

5. Treebank Conversion and Lexicon Acquisition

For lexicon induction and parameter estimation, our parsing model requires training
data, which we obtain by converting a standard treebank for English into TAG format.
The procedures used are based on well-established approaches in the TAG literature
(in particular Xia, Palmer, and Joshi 2000); we will therefore only give a brief overview

3 In the case of the PLTAG parser, we ﬁrst retrieve the elementary trees for the upcoming lexeme. If the

word occurred with more than one POS tag, we choose the POS tag with highest conditional probability
given the previous two POS tags.

1042

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

of treebank conversion and lexicon induction here; the reader is referred to Demberg-
Winterfors (2010) for full details.

Our PLTAG lexicon (both canonical trees and prediction trees) is derived from the
Wall Street Journal section of the Penn Treebank, complemented by noun phrase anno-
tation (Vadas and Curran 2007), and Propbank (Palmer, Gildea, and Kingsbury 2003),
as well as a slightly modiﬁed version of the head percolation table of Magerman (1994).
These additional resources are used to determine the elementary trees for a TAG lexicon,
following the procedures proposed by Xia, Palmer, and Joshi (2000). This involves ﬁrst
adding noun phrase annotation to the Penn Treebank, and then determining heads
with the head percolation table, augmented with more detailed heuristics for noun
phrases.4 As a next step, information from Propbank is used to establish argument
and modiﬁer status and to determine which lexical items should be encoded in the
same elementary tree (currently, this is restricted to particle verbs like show up and some
hand-coded constructions in which the ﬁrst part is predictive of the second part, such
as either . . . or or both . . . and). Then we remove quotation marks, brackets, sentence-
ﬁnal punctuation, and some of the traces from the Treebank. Finally, we heuristically
insert more structure into ﬂat quantiﬁer phrases, adding explicit right branching and
additional nodes wherever adjunction is possible. Furthermore, auxiliaries are assigned
a special POS tag in order to enable the lexicon induction algorithm to extract them as
auxiliary trees. Deviating from the standard TAG analysis, copula verbs are treated in
PLTAG as subcategorizing for two NPs and are therefore marked as copula verbs during
treebank conversion.

Given the converted trees, which are less ﬂat and contain information about head-
edness and the argument/modiﬁer distinction, we extract the canonical lexicon by
traversing the converted tree from each leaf up towards the root, as long as the top
node is the head child of its parent. If a subtree is not the head child of its parent, we
extract it as an elementary tree and proceed in this way for each word of the converted
tree. Given the argument/modiﬁer distinction, we then create substitution nodes in the
parent tree or a root and foot node in the child tree.

5.1 Creating the Prediction Tree Lexicon

After converting the treebank and extracting the canonical lexicon as described previ-
ously, the next step is to generate a prediction tree lexicon. The prediction tree lexicon
contains the prediction trees needed to derive the treebank sentences using PLTAG.
Remember that prediction trees are usually not lexicalized, and can in principle have
any (tree) shape. Only those prediction trees that are veriﬁable with a canonical tree can
yield valid PLTAG derivations, however.

As we explained in Section 3.4, veriﬁable prediction trees are prediction trees that
are identical to canonical TAG trees from the lexicon, except subtrees below and to
the right of nodes on the spine may be missing. Other prediction trees are technically
allowed in the lexicon, but they can never be used in successful derivations. When
building prediction trees from canonical trees by removing subtrees at the bottom and
to the right, we must make a choice about what and how many subtrees are removed.
This design choice has implications on the granularity of prediction in the parser: For

4 The head percolation table and the code for the NP heuristics are available at

http://www.coli.uni-saarland.de/∼vera/page.php?id=corpora. The PLTAG-converted version of
the Penn Treebank is also available from this URL.

1043

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

example, if substitution sites to the right of the spine are included in a verbal prediction
tree, then this amounts to predicting the subcategorization frame of a verb we have
not yet seen. As there is not enough psycholinguistic evidence to determine the correct
level of granularity of prediction, we decided to take a conservative view, that is, we
do not predict any substitution nodes to the right of the spine. Similarly, there is only
psycholinguistic evidence for lexical prediction in very constrained contexts, therefore
we do not usually predict unary nodes at the bottom of the spine, and only predict
lexical anchors for a limited set of constructions, for example, either . . . or, which has
been explicitly implicated in prediction in experimental studies (see Section 2). For
future work, we plan to extend the prediction of lexical anchors to predictable parts
of collocations and idioms.

Prediction trees are learned directly from the converted version of the Penn Tree-
bank by calculating the connection path (Mazzei, Lombardo, and Sturt 2007) at each
word in a tree. A connection path for words w1 . . . wn is the minimal amount of structure
that is needed to connect all words w1 . . . wn into the same syntactic tree. We use the
connection paths and the knowledge of how the sentence tree can be decomposed into
the canonical trees to determine which parts of the structure need to be included in
the connection path for w1 . . . wn, but are not part of any of the elementary trees with
anchors w1 . . . wn. These remaining nodes need to be contained in a prediction tree (see
Figure 9, in which prediction trees are necessary for connectedness at words 2 and 4).

The nodes necessary to achieve connectedness can belong to one or more ele-
mentary trees with anchors beyond wn. Because we restricted our parser to only use
one prediction tree per step, we need to generate a “pre-combined” prediction tree
whenever nodes originating from more than one elementary tree are needed to achieve
connectedness. For example, if we changed the sentence from Figure 9 to the cunning
fox very often lures rabbits, we would have to predict nodes from both the adverb tree
and the verbal tree when processing very, resulting in the extraction of a pre-combined
prediction tree, as shown in Figure 10. Nodes originating from different elementary
trees have distinct markers to indicate which of them should be veriﬁed in a single step.
In particular, the nodes at the integration site of two prediction trees has two different
markers, as its upper and lower half originate from different elementary trees. In our
example, the two VP nodes in Figure 10 have different top and bottom indices because
the auxiliary prediction tree based on the auxiliary tree for often adjoined into VP node
of the prediction tree, which is generated from the lures elementary tree.

Figure 9
Illustration of connection paths and prediction trees generated from the syntactic tree.

1044

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

ADVP

ADV

ADVP*

very

NPi ↓

VPi
j

ADVPj

j VPj

Figure 10
Lexicon entry for very and the precombined prediction tree that would be needed in order to
integrate the word very with the preﬁx the cunning fox.

Sarah

loves

NPn

. . .

NPn ↓

NP-POS3
2

NP2
1

NP-POS1

NP1
2

NP2
3

POS3
3

NP1

Peter

POS1
1

Figure 11
Left recursion: Preﬁx tree (after integrating Peter) needed to parse the sentence Sarah loves Peter’s
father’s . . . books.

There are cases where more than two prediction trees will have to be combined,
and in principle, due to left recursion, the number of prediction trees that are needed
to form a precombined prediction tree is unbounded. Consider the example of parsing
the sentences Sarah loves Peter’s books, Sarah loves Peter’s father’s books, Sarah loves Peter’s
father’s neighbor’s books, and so on; a schematic example is shown in Figure 11. Before
integrating Peter, we would need to predict a precombined prediction tree for every
possible embedding depth of Peter, thus requiring an inﬁnitely large prediction tree
lexicon.5 We therefore restrict our prediction tree lexicon to precombined prediction
trees that we have observed during training. This bounds the embedding depth that

5 Note that such left-recursion examples could be handled in alternative ways, for example, by changing

them to right-recursive structures (as suggested in Mazzei 2005). Using right-recursive structures to avoid
problems with left-recursion, however, leads to elementary trees that are not well motivated linguistically.

1045

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

the induced grammar can use. The grammar can still reconstruct all derivations in the
training corpus incrementally, however.

Note also that the Penn Treebank does not contain any sentence for which more
than ﬁve prediction trees need to be pre-combined in order to allow for incremental
processing—and there were just four instances of such large pre-combined prediction
trees; more than 99% of prediction trees extracted from Penn Treebank are based on
three or fewer elementary trees.

5.2 Lexicon Extraction Statistics

Our conversion algorithm extracted 6,700 tree templates from Sections 2–21 of the Penn
Treebank.6 The grammars extracted by Chen (2001) and Xia, Palmer, and Joshi (2000)
are of similar size: They extracted 3,000–8,600 tree templates, depending on parameters
such as number of categories, treatment of traces/null elements, punctuation, head
percolation rules, and modiﬁer–argument distinction. Our lexicon is fairly large as
it contains traces, null elements, and sentence-internal punctuation. Furthermore, it
contains some trees with several anchors, to account for constructions such as for either
. . . or or particle verbs like show . . . up. In such trees with more than one anchor, all but
the leftmost anchor are predicted (see Demberg-Winterfors [2010] and Kaeshammer and
Demberg [2012] for more details). The size of the prediction tree lexicon we extracted is
2,595 trees.

A smaller grammar was extracted by Chiang (2000) (2,104 tree templates). The
smaller lexicon can be attributed to the decision to use sister adjunction instead of
normal adjunction. Chiang furthermore removes traces and null elements, and does
not extract multi-anchored trees. Smaller lexicons lead to less data sparsity and faster
parsing performance, as there is less ambiguity between the elementary trees.

There were some sentences in the Penn Treebank that we could not automatically
convert to PLTAG format. This affected about 1.8% of sentences. Failure to convert sen-
tences was due to incorrect annotation in the original treebank, as well as fragmentary
or ungrammatical sentences. In 0.4% of cases, conversion failure was due to a modiﬁer
occuring between two arguments. Similar rates of unsuccessful conversion have been
reported, for example, by Hockenmaier and Steedman (2007).

5.3 Features and Parameter Estimation

As explained in Section 3.6, we approximate the probability distributions for substitu-
tion, adjunction, and veriﬁcation using features of the preﬁx tree, the elementary tree,
and the integration site. These features are shown in Table 1.

The presentation of the features relies on the notation from Sections 3.5 and 3.6. The
node up is the node in the preﬁx tree τp at which the elementary tree τe was integrated;
ur is the root of the elementary tree that introduced up, and head(ur) is the head of
that node. There are features that use the lexeme at the node head(ur) directly, if it is a
lexical leaf; other features use the syntactic category of the parent of head(ur), which will
typically be the part-of-speech tag of that word. If up was introduced by an unlexicalized
prediction tree, head(ur) may not be a lexical leaf (an example is the second tree in

6 A tree template is an elementary tree without its lexical anchor (e.g., [DT the] and [DT a] belong to the

same template).

1046

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

Table 1
Features used in the PLTAG parser with back-off levels.
feature for P(unlex(τ)|τp, ue, up) (used in subst. and adj.)

tree template that contributed up
lexeme of head(ur)
syntactic category of head(ur)
position of the integration site within its elementary tree
syntactic category of the integration site
is the beginning or end of the current fringe a trace?
category of the leftmost leaf below up
category of the rightmost leaf to the left of up
if adjunction: position of integration site among alternatives

+ − −
+
+ − − − − −
+ − − −
+
+
+ − −
+
+
+
+ −
+
+
+
+
+ − − − −
+
+ − − −
+
+
+ − − −
+
+
+ − −
+
+
+

feature for P(unlex(τv)|τpredict) (used in veriﬁcation)

tree template of matched prediction tree τpredict
is the beginning or end of the current fringe a trace?

+ −
+
+ − −

feature for P(anchor(τ)|unlex(τ), head(u))

tree template, unlex(τ)
syntactic category of τ’s anchor node
lexeme of head(u)
syntactic category of head(u)

+
+

+ −
+
+
+
+
+ − − −
+ − −
+

Figure 4); in this case, we use the syntactic category of head(ur) both for the lexeme
and the category (in the example, this is “VP”).

Two of the features traverse the entire preﬁx tree to ﬁnd the leftmost leaf below up
and the leaf directly to its left. If up dominates the ﬁrst leaf of the preﬁx tree, then there
is no leaf directly to the left; in this case, the second feature takes a null value. In the case
of adjunctions, one feature inspects the position of the integration site among all other
nodes in the tree that could have been used as adjunction sites for the same auxiliary
tree. This allows us to distinguish high and low attachment.

The probability models are now obtained via maximum likelihood estimation from
the training data. Many of the substitution and adjunction events are seen rarely or not
at all with their full contexts, which indicates the need for smoothing. We use back-off
with deleted interpolation, as detailed in Table 1. The weight for each of these contexts
is automatically determined by a variant of Witten-Bell smoothing, which calculates
a weight for each of the back-off levels for each context (Witten and Bell 1991). We
implemented the version described by Collins (2003). For the veriﬁcation operation,
data sparsity for the probability of the tree template τv is less of an issue because the
probability of a tree template verifying a prediction tree is conditioned only on the
identity of the prediction tree and the trace feature.

6. Evaluation

In order to compare the PLTAG parser to other probabilistic parsers, we evaluated
parsing accuracy on the Penn Treebank (PTB). We ﬁrst converted the PTB into a PLTAG

1047

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

treebank as described in Section 5. We then trained the parser on Sections 2–21 of
the Penn Treebank and evaluated it on Section 23; only sentences of length 40 or less
were used for evaluation. It is important to note that because the parser is trained and
evaluated on a converted treebank, its accuracy is not directly comparable to parsers
that run directly on the PTB. We will discuss this point in more detail in Section 6.1.

For the results reported here, the beam width of the parser was set such that all
analyses whose log probability is less than the log probability of the best analysis minus
8 are removed (our implementation uses the natural logarithm). The beam width was
set using the development set (Section 0 of the PTB). Furthermore, only the 250 best
analyses in a chart entry are maintained. The supertagger was set to select the best
20 prediction trees at each step.

We found that parsing accuracy is higher when using a probability model that
does not factor in the NONE-adjunction events (parsing accuracy decreases by about
1.5 percentage points in a model that takes these events into account). We believe
that this decrease in parsing performance is due to the low probability of adjunction
operations relative to substitution operations, which is a result of the normalization with
NONE events. The high occurrence of NONE adjunctions in our parser is a consequence
of the PLTAG conversion procedure, which leads to trees with more nodes overall, and
hence many nodes at which no adjunction happens. In what follows we will present
results for parsing with the probability model that does not include NONE-adjunction.

Coverage. We ﬁrst evaluated the coverage of our PLTAG parser on the test set. The parser
found valid parses for 98.09% of the sentences in Section 23 within reasonable memory
usage (2 GB RAM). The reason for failing to parse a sentence can be that all valid parses
have fallen out of the beam, that an essential prediction tree was not selected by the
supertagger, or that no parse can be derived given the PLTAG lexicon acquired during
training.

A way of dealing with out-of-coverage sentences is to return a ﬂat structure in
which all words are attached directly to the root node (Roark 2001). This way a coverage
of 100% is obtained, which facilitates the comparison between parsers. We also give
results for such a “full coverage” version of our parser in the following.

Parsing Accuracy. Table 2 gives the parsing results for the variants of the PLTAG model
that we evaluated. The full PLTAG probability model as described in Section 3.6
achieved an F-score of 79.41 with Witten-Bell smoothing, given the gold-standard POS
tags. When gold-standard POS tags are given, the algorithm only retrieves elementary
trees for a word which includes the correct POS tag, whereas it retrieves all elementary
trees for a word, and hence has a larger search space, when no gold-standard POS tags
are given. Without gold-standard POS tags, parsing performance drops to an F-score of

Table 2
Parsing results for the PLTAG parser with gold standard POS tags. Prediction tree oracle =
correct prediction tree provided; No gold POS = PLTAG parser with no gold-standard
POS tags provided; full cov = ﬂat structure returned for out-of-coverage sentences.

Model

Precision Recall

F-score Coverage

F-score (full cov)

PLTAG parser
No gold POS
Prediction tree oracle

79.43
77.57
81.15

79.39
77.24
81.13

79.41
77.41
81.14

98.09
98.09
96.18

78.65
76.64
80.08

1048

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

77.41. Table 2 also gives the F-scores for the full coverage version of the parser, which
are about 0.8 points lower.

Error Analysis. An aspect in which our parser fundamentally differs from other parsers is
its use of prediction trees. We evaluated the impact of using prediction trees on parsing
performance, by running the parser in an oracle scenario where only the correct pre-
diction trees were given (the parser is, however, not forced to use them; the probability
model can still assign a low probability to an analysis with the correct prediction tree).
Table 2 includes the result for the prediction tree oracle condition. In this condition,
we observed an increase of 1.7 points in F-score; the modest improvement indicates
that our supertagger selects good prediction trees most of the time. The full coverage
version of the oracle parser performs about one point worse than the standard version
due to lower coverage in the oracle condition (the lower coverage in the oracle condition
is caused by the fact that no other than the correct prediction trees can be used). We
also tested the coverage of the prediction tree lexicon on our test data: In order to
obtain the correct parse, a prediction tree that the parser has observed less than ﬁve
times during training is needed in 1.6% of all instances where some prediction tree
is needed.

It is important to note that not using the supertagger slows down the parser con-
siderably. In fact, it is not possible to run the parser without the supertagger on a 16 GB
RAM machine without making the beam so small that it is no longer useful. In order to
nevertheless evaluate the loss in accuracy through supertagging of prediction trees, we
ran the parser with and without supertagger on WSJ10, the portion of the test set that
includes only sentences of 10 words or less. The F-score on WSJ10 was 85.82 without the
supertagger, and 85.18 with the supertagger, indicating that supertagging for prediction
trees incurs a small loss in parsing accuracy compared to full parsing.

We then compared for each sentence of the test set the probability of the best
analysis with the probability of the correct analysis and found that in 55% of the cases,
the probability model assigns a higher probability to an incorrect analysis, meaning
that a large proportion of the errors can be attributed to the probability model. In the
remaining 45% of cases, the best analysis found by the parser has a lower probability
than the correct one. This could be due to an incomplete lexicon, supertagging errors,
or the beam being too narrow. We ﬁrst quantiﬁed the loss in parsing accuracy arising
from missing lexicon entries: We compared parser performance for a version of the
lexicon obtained from the training data and a version of the lexicon that additionally
included all trees needed to parse a test sentence. This only improved parsing F-score
by 0.5 points.

Supertagging errors are the next potential cause for the parser returning an analysis
that has a lower probability than the correct one. Our analysis, however, based on a
prediction tree oracle showed that these errors are a fairly minor problem, leading
to a decrease of 1.7 points in overall F-score. Therefore, it seems likely that a large
proportion of errors is due to the correct analysis falling out of the beam during parsing.
Although our beam width is similar to the beam widths used in other parsers, it may
be insufﬁcient for a PLTAG parser. The reason for this lies in the strict incrementality:
The role of the prediction trees is essentially to list all the possible ways in which words
can be combined, whereas parsers that keep unconnected treelets can wait for more
evidence before connecting all parts; the space of analyses for a non-incremental parser
hence grows more slowly.

To further investigate the performance of the probability model, we evaluated the
probability model in terms of attachment decisions. We provided the parser with the

1049

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

Figure 12
Aggregating different analyses by their current fringe signiﬁcantly reduces the number of tree
combination operations that need to be performed at each word.

correct elementary trees, which amounts to assuming perfect supertagging for all trees
(not just for prediction trees). The F-score given perfect supertagging is 93.7%, indicat-
ing that the best analysis often does not get the largest probability: in 41% of sentences,
an incorrect analysis gets assigned a higher probability than the correct analysis.7

Effectiveness of Chart Parsing. As described in Section 4.2, partial analyses are aggregated
by their current fringes. This signiﬁcantly reduces the number of tree combinations
that have to be calculated at each step, as shown in Figure 12. Aggregation only
becomes relevant after the ﬁrst few words, but then the beneﬁt of aggregating all
analyses with the same current fringe becomes substantial, reducing the number of tree
combination calculations by 80% on average with the beam search settings described
here. The beneﬁt of this strategy can be expected to be even larger with less rigorous
pruning.

6.1 Comparison to Other Parsers

The PLTAG parser is trained and evaluated on a version of the Penn Treebank that
was ﬁrst converted to a PLTAG format. This means that our results are not directly
comparable to parsers that reproduce the Penn Treebank bracketing, as our parser
produces deeper tree structures informed by Propbank and the noun phrase annotation
of Vadas and Curran (2007). It is not trivial to convert PLTAG structures back to PTB
structures. In lieu of a full conversion, we ﬂattened both our PLTAG structures and the
original PTB structures such that a node cannot have a child with the same non-terminal
symbol. For an example, see Figure 9, where an NP is a child of another NP, and a VP is
the child of another VP. In the ﬂattened representation, the NP under S has the children

7 Note that this ﬁgure is slightly lower than the 45% reported earlier, as the experiment uses perfect lexical

trees for parsing, rather than the actual lexicon derived from the training set.

1050

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

DT, ADJ, and NN, and the VP under S has the children ADVP, VB, and NP instead of
ADVP and VP. We then evaluated the ﬂattened output of our parser against the ﬂattened
version of the original PTB test set (rather than using the PLTAG version of that test set).
For comparison, we applied the same procedure to the output of the Charniak parser
(chosen to represent a typical PTB parser), that is, we ﬂattened its output and evaluated
it against the ﬂattened version of the PTB test set. We found that F-score decreased by
two points in both cases.

In the following, we compare to other TAG parsers only; but even these compar-
isons should be taken with a grain of salt because these still differ in which variant
of the formalism they use (Lexicalized TAG, Spinal TAG, Incremental TAG). Table 3
gives the F-scores of existing TAG parsers and compares them along the hierarchy
of psycholinguistic requirements (incrementality, connectedness, prediction; see Sec-
tion 1). The formalism that comes closest to ours in terms of psycholinguistic properties
is that of Mazzei, Lombardo, and Sturt (2007), which has not been implemented or
evaluated on corpus data, however. Numerically, our results are comparable to those of
Kato, Matsubara, and Inagaki (2004). Their parser is incremental and builds connected
structures, but makes strong simplifying assumptions, such as failing to distinguish
modiﬁers and arguments. It uses gold POS tags as input and is not lexicalized. The
parsers of Chiang (2000) and Shen and Joshi (2005) achieve higher F-scores, but at
the cost of not being incremental (Chiang 2000) or not building connected structures
(Shen and Joshi 2005). Furthermore, Shen and Joshi use a look-ahead of two words,
which signiﬁcantly weakens their incrementality claim. Note also that their F-score is
measured on dependencies rather than labeled bracketing.

6.2 Discussion

Differences in performance with other TAG parsers are likely due to the incrementality
restriction (incremental parsers generally have slightly lower performance), not doing
any supertagging for canonical trees, a large lexicon, and data sparsity in the probability
model. The latter is due to the large lexicon and the additional parser operation of
veriﬁcation. A further effect of having separate prediction and veriﬁcation steps is that
many bilexical dependencies are lost when prediction trees are integrated. Because
prediction trees are not necessarily lexicalized, the statistical model cannot condition
them on lexemes. To make up for this, it would be necessary to take into account
bilexical dependencies at veriﬁcation time, but this is not part of the current probability
model. An improvement in parsing performance is likely to result from addressing this
shortcoming.

Table 3
Comparison of this work with other TAG parsers. impl = implemented model; incr =
incrementality; con = connectedness; pred = prediction; F = F-score; Shen and Joshi (2005)
evaluate on dependencies and use a look-ahead of two words.

Model

incr

con

pred

impl

Mazzei, Lombardo, and Sturt (2007)
This work (gold POS)
Kato, Matsubara, and Inagaki (2004)
Shen and Joshi (2005)
Chiang (2000)

+
+
+
(+)
–

+
+
+
–
–

+
+
–
–
–

–
+
+
+
+

n/a
78.65
79.65
(87.4)
86.7

1051

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

7. Psycholinguistic Evaluation

Although we have focused on the computational issues of PLTAG parsing so far, a key
motivation behind our incremental parser is to develop a more realistic model of human
language processing. A treebank-based evaluation as in the previous section does not
directly provide evidence of psycholinguistic validity; however, a parser with good
coverage and high parsing accuracy is a prerequisite for an evaluation on eye-tracking
corpora, which Keller (2010) argues are the benchmark for models of human sentence
processing. In what follows, we report an evaluation study that uses our PLTAG parser
to predict human reading times, and compares its performance on this task to a standard
model based on surprisal (Hale 2001).

Surprisal assumes that processing difﬁculty is associated with expectations built
up by the sentence processor: A word that is unexpected given its preceding context is
harder to process. Mathematically, the amount of surprisal at word wi can be formalized
as the negative logarithm of the conditional probability of wi given the preceding words
in the sentence w1 . . . wi−1:

Surprisalwi = − log P(wi

|w1 . . . wi−1)

(13)

= − log

P(w1 . . . wi)
P(w1 . . . wi−1)

(cid:2)

P(τpw1…wi

) + log

τpw1…wi

(cid:2)

τpw1…wi−1

P(τpw1…wi−1

)

) is the probability of the preﬁx tree τpw1…wi

Here, P(τpw1…wi
that spans the words
w1 . . . wi. If the surprisal at word wi is high, then wi should be difﬁcult to process.
This manifests itself in elevated reading times, for example, in eye-tracking data (Boston
et al. 2008; Demberg and Keller 2008a; Frank 2009).

Surprisal can be estimated in two different ways: as lexical surprisal and structural
surprisal, following Demberg and Keller (2008a). We calculated lexical surprisal using
the preﬁx probabilities returned by the incremental probabilistic parser of Roark (2001).
Lexical surprisal takes into account the lexical items that make up a sentence preﬁx, and
is thus inﬂuenced by word frequency and by the probability of a word being assigned a
speciﬁc part of speech. This quantity will be referred to as the factor LEXICALSURPRISAL
in the following. This can be contrasted with structural surprisal (factor STRUCTURAL-
SURPRISAL), which uses unlexicalized parses, and is based only on the probability of
the syntactic structures assigned to a sentence preﬁx. Following Demberg and Keller
(2008a), we replaced each word in the training corpus with its part-of-speech tag and
then trained the Roark parser on this version of the corpus. The unlexicalized parser
obtained this way was run on the Dundee corpus and preﬁx probabilities were
obtained in the usual way to compute unlexicalized surprisal scores.

7.1 Estimating Processing Difﬁculty using PLTAG

Prediction Theory was proposed by Demberg and Keller (2008b, 2009) as a model
of processing difﬁculty in human parsing based on PLTAG. Prediction Theory, like
surprisal, returns word-by-word difﬁculty scores that can be tested against human
reading times. Unlike surprisal, however, Prediction Theory aims to formalize how
predictions about upcoming linguistic structure are made by the sentence processor.

1052

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

To achieve this, it incorporates an explicit prediction and veriﬁcation process, as well
as memory decay.

Prediction Theory estimates of processing difﬁculty can be obtained straight-
forwardly from the PLTAG parser presented in this article. The Prediction Theory
framework therefore directly addresses the psycholinguistic desiderata of incremental,
connected, and predictive processing that we argued for in Section 2. In addition, Pre-
diction Theory incorporates a linking theory that relates model quantities to behavioral
data (Demberg and Keller 2009). Speciﬁcally, there are two components that account
for processing difﬁculty in Prediction Theory. Firstly, surprisal is used to quantify
difﬁculty in terms of updates to the parser’s representation of possible analyses as the
sentence unfolds. Surprisal is calculated from the probability distribution over preﬁx
trees spanning the input as deﬁned in Equation (13). This component of the model is
equivalent to Hale’s surprisal model, except that preﬁx tree probability are computed
using the probabilistic PLTAG parser proposed in this article, rather than a PCFG parser.
The second component of processing difﬁculty in Prediction Theory is veriﬁcation
cost, which arises when previously predicted structures are checked against what is
actually encountered in the input. Veriﬁcation cost depends on the probability of the
original prediction tree τpredict that is to be veriﬁed, as well as on how much this
prediction has decayed in memory:

VeriﬁcationCostwi

= − log

(cid:2)

τpredict

P(τpredict)(1−d

t−tτpredict )

(14)

Here, t is the time at which the veriﬁcation takes place and tτpredict is the time at which
τpredict was last accessed. For simplicity, we assume that t is the index associated with
wi, that is, t = i (more sophisticated approaches that take into account word length are
possible). By analogy, tτpredict is the index of the word that was processed when τpredict
was retrieved from the prediction tree lexicon and integrated into the preﬁx tree. The
decay constant d is a parameter of the model, which we set to 0.9 (the results reported
subsequently are stable across a large range of values of d).

The motivation for the veriﬁcation cost component of Prediction Theory comes from
a large body of psycholinguistic evidence that shows that human sentence processing
incurs processing difﬁculty when new material has to be integrated into an existing
representation (for a review of this evidence, see Gibson [2000]; Lewis and Vasishth
[2005]). Integration cost occurs when the processor integrates a syntactic argument with
the head that the argument depends on (e.g., an NP argument is integrated with a verbal
head that has already been processed). The cost of this integration (e.g., measured as
elevated reading time) has been shown to depend on the type of integration and on the
distance between the argument and its head (see Gibson [1998], and much subsequent
work). Prediction Theory veriﬁcation cost models this in terms of prediction difﬁculty
(i.e., the probability of the predicted element) and memory decay (i.e., distance between
prediction and veriﬁcation). In the following analyses, Prediction Theory scores are
represented by the factor PREDICTIONTHEORY, and are computed as the sum of the
surprisal in Equation (13) and the veriﬁcation cost in Equation (14).

To summarize, the central aim of Prediction Theory is to unify two types of pro-
cessing difﬁculty: The cost of updating syntactic representations (surprisal) and the cost
of integrating predicted structure (veriﬁcation cost). These two components have so far
been observed and modeled separately in the psycholinguistic literature (Demberg and
Keller 2008a; Staub 2010). Given its unifying nature, Prediction Theory can be expected

1053

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 39, Number 4

to capture a wider range of experimental results than surprisal alone. In the following,
we will test this claim by evaluating Prediction Theory on the reading times in the
Dundee eye-tracking corpus.

7.2 Method

In order to test whether Prediction Theory or surprisal scores correlate with reading
times we use linear mixed effects (LME) models (Pinheiro and Bates [2000]). These
models can be thought of as a generalization of linear regression that allows the in-
clusion of random factors (such as participants or items) as well as ﬁxed factors (e.g.,
word frequency or surprisal). The ﬁxed factors can be discrete (such as whether the
previous word was ﬁxated) or continuous (such as word frequency). When reporting
mixed models, we normally give the estimates of the coefﬁcients β of the ﬁxed factors
included in the model; these can be interpreted as the weights of the factors in the
model (though only coefﬁcients of factors on the same scale can be compared directly).
In addition, each coefﬁcient is associated with a standard error (SE), which expresses
the amount of variation in the estimate of that coefﬁcient, and a t-value, which indicates
whether the coefﬁcient is signiﬁcantly different from zero. For the model as a whole,
we can measure the log-likelihood, which is an indicator of how well the model ﬁts the
data. Two models can be compared by testing whether their log-likelihood values are
signiﬁcantly different.

All predictors in our LME models were centered (i.e., the mean value of the pre-
dictor was subtracted from each instance) to reduce collinearity. We treat participants
as a random factor, which means that our models contain an intercept term for each
participant, representing the individual differences in the rates at which they read.
Furthermore, we include a random slope for the predictor of interest (Prediction Theory
and surprisal difﬁculty, respectively), essentially accounting for idiosyncrasies of a par-
ticipant with respect to the predictor of interest, such that only the part of the variance
that is common to all participants and can be attributed to that predictor.8

We built the models by initially including all predictors as ﬁxed factors, as well as
all binary interactions between them. To obtain a minimal model, we then performed
stepwise removal of ﬁxed factors that did not signiﬁcantly improve model ﬁt (based on a
χ2-test comparing model log-likelihoods), starting with the factor that improved model
ﬁt least. This procedure ensures that a model is obtained that achieves the best ﬁt to the
data with the minimum number of predictors. In the following, results tables give the
coefﬁcients and signiﬁcance levels for those predictors that remained in the minimal
models. No model selection was performed on the random effect structure, which
always included PARTICIPANT as random intercept, as well as random slopes under
PARTICIPANT for factors of theoretical interest (LEXICALSURPRISAL, STRUCTURAL-
SURPRISAL, PREDICTIONTHEORY).9

Before ﬁtting LME models, we applied outlier removal: We computed the mean
reading time (over all items and participants), and then removed all data points that
deviated more than two standard deviations from the mean. This led to a loss of roughly

8 Other random factors that are appropriate for our analyses are word and sentence; due to the large
number of instances for these factors, however, we were not able to include them: The model ﬁtting
algorithm we used (implemented in the R package LME4) does not converge for such large models.

9 Model selection on random effects has been shown in simulation studies to be anti-conservative
(Barr et al. 2013). Note that the models reported here are updated in this respect compared to the
results in Demberg-Winterfors (2010).

1054

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

3
9
4
1
0
2
5
1
8
0
3
0
4
4
/
c
o

l
i

_
a
_
0
0
1
6
0
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar

4% of data points. Outliers can strongly inﬂuence the results of analyses on the Dundee
corpus, as Roland et al. (2012) show. Furthermore, this way of trimming the data also
reduces the long tail of the reading time distribution, resulting in a distribution that is
closer to normal.

Data. For our psycholinguistic evaluation, we used the English portion of the Dundee
eye-tracking corpus (Kennedy and Pynte 2005), which contains 20 texts taken from The
Independent newspaper. The corpus consists of 51,502 tokens and 9,776 types in total. It is
annotated with the eye-movement records of 10 English native speakers, who each read
the whole corpus. The eye-tracking data was preprocessed following the methodology
described by Demberg and Keller (2008a). From this data, a range of reading time
measures can be computed for each word in the corpus. Here, we will only discuss ﬁrst-
pass times (other measures give similar results). First-pass times are deﬁned as the sum
of the duration of ﬁxations from ﬁrst entering the word from the left to leaving it (only
cases where no later words have been ﬁxated are counted). Our statistical analyses are
based on actual reading times, and so we only included words that were not skipped.
Furthermore, all data points for which one of the values used in the regression is missing
(e.g., launch distance is not available for data points that are the ﬁrst ﬁxation on a line or
after track loss; Prediction Theory scores are not available for sentences that could not
be parsed, about 1.5% of data) are discarded for regression analysis. The resulting data
set for ﬁrst pass times consisted of 159,378 data points.

7.3 Results and Discussion

The simplest evaluation is to ﬁt an LME model in which the measure of interest is
included as a single predictor for observed reading times (including in the model
also a random intercept per subject and a random slope for LEXICALSURPRISAL or
PREDICTIONTHEORY, respectively, under subject). Under this approach, we ﬁnd that
both LEXICALSURPRISAL and PREDICTIONTHEORY are signiﬁcant positive predictors
of ﬁrst pass reading times (β = 4.9; t > 9; p < 0.001 and β = 7.8; t > 9; and p < 0.001, respectively). PREDICTIONTHEORY provides better ﬁt than LEXICALSURPRISAL, as evi- denced by the signiﬁcantly larger log likelihood of the model. STRUCTURALSURPRISAL is not a signiﬁcant predictor of reading times in such a model (and the log likelihood of a model including STRUCTURALSURPRISAL is even lower). It is possible, however, that the effect of a single predictor of interest can be ex- plained away by simpler predictors that are not of theoretical relevance, but are highly correlated with reading times. Therefore, we ﬁtted full models that included predictors that relate to properties of the word, like word length in characters (WORDLENGTH), word frequency in log scale (WORDFREQUENCY), as well as word position in the sentence (WORDNOINSENTENCE). In order to account for spill-over effects, where processing difﬁculty on one word inﬂuences processing difﬁculty on the next word, we also include the frequency of the previous word (PREVWORDFREQ) and a binary ﬂag indicating whether the previous word was ﬁxated (PREVWORDFIXATED). Furthermore, we include predictors that relate to the reading process, such as the launch distance of the eye-movement in characters (LAUNCHDISTANCE), the squared relative landing po- sition of a ﬁxation on a word in characters (LANDINGPOSITION), and bigram probability (BIGRAMPROB), which can be regarded as a simple version of surprisal. In addition, we included the binary interactions between these predictors. We found that the interaction between word length and word frequency (WORDLENGTH:FREQUENCY), as well as 1055 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 1 0 2 5 1 8 0 3 0 4 4 / c o l i _ a _ 0 0 1 6 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 4 between word length and landing position (WORDLENGTH:LANDINGPOS) signiﬁcantly improved model ﬁt. Some of these predictors are correlated with the measures of interest, leading to collinearity in the models which cannot be removed by centering. Speciﬁcally PREDIC- TIONTHEORY is correlated with WORDLENGTH, WORDFREQUENCY, BIGRAMPROB, and PREVWORDFREQ after centering. We therefore residualize PREDICTIONTHEORY against these predictors. The residualized predictor RESIDPREDICTIONTHEORY then only accounts for that part of the prediction theory score that cannot be explained in terms of word frequency, bigram probability, and so forth. In analogy with this, we obtained RESIDLEXICALSURPRISAL by residualizing lexical surprisal against WORDLENGTH, WORDFREQUENCY, and BIGRAMPROB. RESIDBIGRAMPROB was resid- ualized with respect to WORDFREQUENCY, and RESIDSTRUCTURALSURPRISAL was residualized against BIGRAMPROB. All remaining correlations are smaller than 0.06, indicating an acceptably low level of collinearity. All models reported in Table 4 include random slopes for the predictor of interest. Whenever possible (no failure to reach convergence or indicated otherwise through increased correlation in ﬁxed effects which points to an overspeciﬁcation in random slopes), we also added a random slope for WORDLENGTH. Table 4 gives the ﬁnal LME models for ﬁrst pass times for the predictors of interest: RESIDPREDICTTHEORY (residualized prediction theory score), RESIDLEXICAL- SURPRISAL (residualized Roark lexical surprisal), and RESIDSTRUCTSURPRISAL (residualized Roark structural surprisal). We also ran separate models with the two components of prediction theory: RESIDPLTAGSURPRISAL (residualized PLTAG surprisal), and PLTAGVERIFICATION (PLTAG veriﬁcation cost; not residualization necessary). The relevant scores were obtained by computing only the surprisal scores or only the veriﬁcation cost scores as deﬁned by Prediction Theory, according to Equations (13) and (14). The results in Table 4 indicate that only residualized Prediction Theory is a signif- icant positive predictor of reading times. Both types of Roark lexical surprisal fail to reach signiﬁcance.10 Model comparison reveals that all models reported in Table 4, with the exception of the model including Roark lexicalized surprisal, signiﬁcantly improve model ﬁt over a baseline model that does not include them.11 In the analysis of the two components of prediction theory, PLTAG surprisal and PLTAG veriﬁcation cost, we can see that the surprisal component is responsible for the effect of Prediction Theory. PLTAG veriﬁcation cost is not a signiﬁcant predictor on its own, which can be attributed to the fact that most veriﬁcation cost values are zero, as no veriﬁcation takes place at most words. Effectively, therefore, veriﬁcation cost does not make predictions for the bulk of the words in the corpus, which explains why no signiﬁcant effect is observed overall. A similar observation has previously been reported for integration cost (Gibson 1998; conceptually related to veriﬁcation cost) on the Dundee data (Demberg and Keller 2008a). It is important to note, though, that adding veriﬁcation cost to the baseline LME model increases model ﬁt signiﬁcantly, which provides some evidence for effectiveness of the veriﬁcation cost component. 10 The result for Roark structural surprisal differs from that reported by Demberg and Keller (2008a) and Demberg-Winterfors (2010). This can be attributed to the different outlier removal and more conservative treatment of random effects in the present article. 11 Surprisal has subsequently been reported to be a signiﬁcant predictor of Dundee reading time by Fossum and Levy (2012), who used a context-free grammar induced using the state-split model of Petrov and Klein (2007) in combination with a standard probabilistic Earley parser to compute surprisal estimates. 1056 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 1 0 2 5 1 8 0 3 0 4 4 / c o l i _ a _ 0 0 1 6 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar Table 4 Linear mixed effects models of ﬁrst-pass time for predictors of theoretical interest: Prediction Theory cost, PLTAG surprisal, PLTAG veriﬁcation cost, Roark lexical surprisal, and Roark structural surprisal, each residualized against low-level predictors (see text for details). Random intercepts of participant and random slopes under participants for the predictors of interest were also included. β = coefﬁcient; SE = standard error; t = signiﬁcance value. Note that the SE values for PLTAG surprisal and veriﬁcation cost are almost identical to those of Prediction Theory and have been omitted. Predictor β SE t β t β t Prediction Theory PLTAG Surprisal PLTAG Verif. Cost INTERCEPT WORDLENGTH WORDFREQUENCY PREVWORDFREQ PREVWORDFIXATED LAUNCHDISTANCE LANDINGPOSITION WORDNOINSENTENCE RESIDBIGRAMPROB WORDLEN:WORDFREQ WORDLEN:LANDPOS 230.62 3.91 230.76 3.95 38.20*** 4.73*** 37.95*** 32.94*** 230.65 6.038 3.77 0.798 37.98*** 33.31*** −9.54 0.214 −44.57*** −9.59 −44.64*** −9.59 −44.60*** −3.81 0.151 −25.16*** −3.86 −25.39*** −3.87 −25.45*** −19.32 0.532 −36.26*** −19.35 −36.22*** −19.34 −36.19*** −2.32 0.063 −36.63*** −2.38 −37.45*** −2.38 −37.38*** −28.20 0.662 −42.59*** −28.23 −42.58*** −28.23 −42.58*** −0.12 0.017 −6.60*** −2.86 0.187 −15.32*** −2.86 −15.25*** −2.82 −15.02*** −0.82 0.066 −12.39*** −0.74 −11.19*** −0.68 −10.56*** −13.03 0.255 −50.99*** −13.05 −51.04*** −13.05 −51.06*** −6.88*** −0.11 −6.90*** −0.12 RESIDPREDICTTHEORY RESIDPLTAGSURPRISAL PLTAGVERIFICATION *p < 0.05, **p < 0.01, ***p < 0.001. 0.34 0.155 2.22* 0.37 2.27* −2.87 −1.61 Predictor β SE t β SE t Lexical Surprisal Structural Surprisal INTERCEPT WORDLENGTH WORDFREQUENCY PREVWORDFREQ PREVWORDFIXATED LAUNCHDISTANCE LANDINGPOSITION WORDNOINSENTENCE RESIDBIGRAMPROB WORDLEN:WORDFREQ WORDLEN:LANDPOS 37.96*** 33.30*** 230.77 6.078 3.95 0.118 230.73 6.075 3.95 0.118 37.98*** 33.32*** −9.54 0.214 −44.46*** −9.57 0.214 −44.60*** −3.87 0.152 −25.46*** −3.82 0.153 −24.88*** −19.33 0.534 −36.18*** −19.35 0.534 −36.22*** −2.38 0.063 −37.42*** −2.38 0.063 −37.43*** −28.24 0.663 −42.58*** −28.23 0.663 −42.58*** −0.11 0.017 −6.69*** −0.11 0.017 −6.68*** −2.83 0.187 −15.08*** −2.79 0.189 −14.75*** −0.67 0.065 −10.34*** −0.69 0.064 −10.81*** −13.05 0.255 −51.06*** −13.05 0.255 −51.04*** l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 1 0 2 5 1 8 0 3 0 4 4 / c o l i _ a _ 0 0 1 6 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 RESIDLEXICALSURPRISAL RESIDSTRUCTSURPRISAL *p < 0.05, **p < 0.01, ***p < 0.001. ‡ ‡ −0.12 0.075 −1.64 0.25 0.209 1.24 Inclusion does not signiﬁcantly improve model ﬁt. Furthermore, veriﬁcation cost is able to explain experimental evidence that cannot be accounted for by surprisal alone, such as the relative clause asymmetry (Demberg and Keller 2009); see also Section 7.4. To summarize, our implementation of the PLTAG incremental parsing algorithm allowed us to carry out a crucial test of Prediction Theory, that is, an evaluation using the naturally occurring reading data that the Dundee eye-tracking corpus provides. Due to 1057 Computational Linguistics Volume 39, Number 4 the good coverage of the PLTAG parser, Prediction Theory estimates could be calculated for all but 1.5% of data points. Based on these estimates, we were able to show that Prediction Theory provides a better ﬁt to the Dundee reading time data than a rival theory, namely, lexical and structural surprisal, as estimated using Roark’s incremental parser. As we have argued in Section 2, we cannot easily compare Prediction Theory to sentence processing models other than surprisal, as these are typically formalized but not implemented (e.g., Mazzei, Lombardo, and Sturt 2007) or only small-scale imple- mentations based on restricted data sets are available (e.g., McRae, Spivey-Knowlton, and Tanenhaus 1998; Lewis and Vasishth 2005). 7.4 Other Experimental Results In previous work, Prediction Theory difﬁculty scores estimated using PLTAG have been shown to successfully account for individual experimental results in psycholinguistics. Demberg and Keller (2009) showed that Prediction Theory can account for the relative clause asymmetry (Gibson 1998), that is, the fact that subject relative clauses are easier to process than object relative clauses. Prediction Theory provides an explanation for this fact in terms of higher veriﬁcation cost for object relative clauses. Demberg and Keller (2009) also show that Prediction Theory successfully models the either . . . or effect (Staub and Clifton 2006): Coordinate structures involving either are easier to process than ones involving just or. Prediction Theory explains this in terms of the lexicon entry for either, which introduces a prediction tree for the whole coordinate structure. 8. Conclusion This article presented a probabilistic parser for PLTAG, a psycholinguistically motivated version of TAG. Our parser is incremental, builds fully connected structures (partial structures contain no unattached nodes), and models prediction, that is, the anticipation of upcoming syntactic material that is an important feature of human sentence pro- cessing. We proposed an efﬁcient implementation of our parser based on fringes—data structures that indicate which nodes in a partial tree are available for the application of parsing rules. We trained our parser on a TAG-transformed version of the Penn Treebank and we showed that it achieves broad coverage (98.09%) and an F-score of 79.41 (with gold-standard POS tags) for TAG structure recovery. We argued that broad coverage and competitive parsing accuracy are essential properties when it comes to testing a psycholinguistic model on realistic data, including the eye-tracking corpora which have recently become the gold-standard for psycholin- guistic evaluation (e.g., Boston et al. 2008; Demberg and Keller 2008a; Frank 2009; Mitchell et al. 2010). We showed how our PLTAG parser underpins a theory of human sentence processing, Prediction Theory, and used Prediction Theory to derive estimates of human reading times. An evaluation on the Dundee eye-tracking corpus showed that Prediction Theory estimates achieve a better ﬁt with human reading times than standard surprisal estimates computed using Roark’s incremental parser. Appendix A: A Parsing Schema for PLTAG In this appendix, we supplement the presentation of the parsing algorithm in Section 4 with a parsing schema that spells out in detail what parsing rules the parser applies. We present this schema in two steps. First, we explain a simpliﬁed version of the parsing 1058 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 1 0 2 5 1 8 0 3 0 4 4 / c o l i _ a _ 0 0 1 6 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar schema in Section A.1. This simpliﬁed scheme handles the different ways in which substitution and adjunction can be applied, but treats prediction trees like any other elementary tree. We then extend the parsing schema in Section A.2 to handle prediction trees correctly by adding markers to nodes when they are introduced, and removing the markers through applications of the Veriﬁcation rule. A.1 Parsing without Prediction Markers The parsing schema manipulates parsing items that consist of a depth-ﬁrst, left-to-right traversal of (preﬁx or elementary) trees, written as a sequence of downward or upward visits to the nodes of the tree, as explained in Section 4.1. The beginning of the current fringe—that is, the position just before the downward visit of the root or just before the upward visit of a leaf—is marked in the parsing items with a dot, •. For instance, we B ↓+ B ↓− represent the preﬁx tree on the left of Figure 6 in the parsing item S+a+ • a C ↓+ C ↓− . Because the parser proceeds incrementally, we know that all fringes of a preﬁx tree before the current one can no longer be modiﬁed by processing later words. This means that the past fringes, that is, the part of the node sequence before the dot, will remain ﬁxed until parsing ﬁnishes. S − − We use this notation to write the rules of the PLTAG parser as parsing schemata (Shieber, Schabes, and Pereira 1995) that manipulate dotted strings of downward or upward node visits. We will ﬁrst set aside the management of markers, and focus on substitution and adjunction. A parsing schema for these operations is shown in Figure A.1. We write f + ( f ) to indicate that f must be a possibly empty string of − − f0 f0 f • f − 1 f + 1 f + 2 a+a 2 a+ • a − − f3 f3 Scan f is node sequence of an elementary tree • f Init − f0 • f 2 A ↓+ A ↓− 1 f + 1 f + f0 f3 2 A+g+ 1 g2A (if right tree canonical: g1 must be of form g3a) A+g+ − 1 g2A SubstDown • f f3 − − − • f A+f0 − 1 A g1A+f0 1 A ↓+ A ↓− g+ − g2 1 A (if right tree canonical: g2 must be of form g SubstUp • f g2 − − 3 g+ 4 a+) f0A+f1 − − • f f3 2 A f0A+g+ 1 A+f1 A+g+ − • f 2 A+g2A − f3 1 A ∗+ A ∗− − g2A AdjDownR (if right tree canonical: g2 must be of form g+ 3 g − 4 g+ 5 a+) • f f0 − − 1 f + 2 A+f3A • f 1 f + f0 − f4 A+g1A ∗+ A ∗− − f4 g 2 A − − 2 A+g1A+f3A − − g 2 A AdjDownL (if right tree canonical: g1 must be of form g+ 3 a+) A+f0 • f − 1 f + g1A+f0 2 A ∗+ A ∗− • f 1 f + f3A 2 A+g2A − − − − g1A+g2A − g3 f3A g3 AdjUp (if right tree canonical: g1 must be positive) Figure A.1 Provisional rules for incremental parsing with PLTAG (without markers). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 1 0 2 5 1 8 0 3 0 4 4 / c o l i _ a _ 0 0 1 6 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 1059 Computational Linguistics Volume 39, Number 4 downward (upward) node visits; a stands for a visit to a node labeled with the terminal symbol a; and A ↓ and A∗ represent visits to substitution and foot nodes. Notice that parse items are sequences of upward or downward node visits. Any parse starts by an application of the Init rule, which initializes the preﬁx tree with a single elementary tree (represented by its node sequence f ). Because we have not yet read any input, we place the dot before f . At any point at which the end of the current fringe (i.e., the next leaf of the preﬁx tree after the dot) is a terminal symbol a (usually right after a Subst or Adj rule has been applied), we can use the Scan rule to move the dot to just after the downward visit of a. The parse terminates successfully once we derive an item of the form w • f which contains all words in the input string; that is, an item in which all that remains to do for the depth-ﬁrst search is to return to the root along the right frontier of the preﬁx tree. − The main work of the parsing algorithm is carried out by the ﬁve parsing rules for substitution and adjunction, which are illustrated in Figure A.2. Each of these rules has two premises; the left premise is the preﬁx tree and the right premise an elementary tree with which it is to be combined. Both substitution and adjunction rules are available in an Up and a Down version: Down is the case where the elementary tree is substi- tuted or adjoined into the preﬁx tree, and Up the reverse. There are two versions of AdjDown depending on whether the foot node of the auxiliary tree is the leftmost or the rightmost leaf. Consider the SubstDown operation, which substitutes the elementary tree into some substitution node u with label A of the preﬁx tree (again, see Figure A.2 for an illustration). Because we parse incrementally, this operation can only be applied if the substitution node is the next leaf of the preﬁx tree after the dot: Once we substitute into a node, we will never have another chance to substitute or adjoin into a node to its left. The fact that u is the next leaf is reﬂected in SubstDown by requiring that f1 must consist only of upward traversals of edges of the preﬁx tree, and f2 only of downward traversals. Furthermore, if the elementary tree is canonical, then we require that its ﬁrst leaf is a lexical anchor by insisting that g1 (the initial sequence of downward traversals to the ﬁrst leaf) ends in a. Under these conditions, SubstDown manipulates the string of node visits as PLTAG’s substitution operation requires. The other parsing rules enforce and exploit the incremental derivation process in analogous ways. The AdjDown rules exist in two versions because auxiliary trees with a rightmost foot node must be read before the adjunction site is ﬁrst visited, that is, before the dot is moved past its downward occurrence. On the other hand, auxiliary trees with a leftmost foot node must be read after the ﬁrst visit to the adjunction site. As a general rule, after applying a Subst or Adj rule whose right-hand premise is a canonical tree, the 2 a+. The Scan rule will move string immediately after the dot will be of the form f the dot over this substring, in preparation for the reading of the next word. 1 f + − A.2 Complete Parsing Rules The rules in Figure A.1 gloss over the fact that elementary trees in PLTAG may be prediction trees, whose upper and lower node halves may carry markers. The two halves of a node in a preﬁx tree may therefore carry markers as well, possibly from two different prediction trees. In general, node labels may be of the form Ak l . Figure A.3 shows an extended version of these rules that takes the markers into account. We extend the notation for node visits in the following way. A visit that carries a superscript, for example, Ak, must be to a node whose upper half carries the marker 1060 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 1 0 2 5 1 8 0 3 0 4 4 / c o l i _ a _ 0 0 1 6 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 1 0 2 5 1 8 0 3 0 4 4 / c o l i _ a _ 0 0 1 6 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 1061 Figure A.2 Illustration of the parsing rules in Figure A.1. Computational Linguistics Volume 39, Number 4 − f0 f0 f • f − 1 f + 1 f + 2 a+a 2 a+ • a − − f3 f3 Scan f is node sequence of an elementary tree • f Init • f f0 − 1 f + 2 A ↓(k),+ A ↓(k),− A+ (l)g+ f3 1 g2A(k)− 2 A(k)+ (l) g+ 1 f + f3 f0 • f (l) − 1 g2A − (l) SubstDown (if right tree canonical: g1 must be of form g3a) − (l) • f A+ (l) f0 − 1 A g1A(k)+ 1 A ↓(k),+ A ↓(k),− g+ 1 A(k)− • f (l) g2 (if right tree canonical: g2 must be of form g g2 f0 (l) − SubstUp − 3 g+ 4 a+) f0A(k)+ (l) f1 − (l) • f f0A(k)+ 2 A(k)− f3 1 A(m)+ (m) g+ (m)g+ A+ • f 1 A ∗(m),+ A ∗(m),− g2A(k)− 2 A(m)+ (m) f3 − f1 (if right tree canonical: g2 must be of form g+ 3 g (l) (l) − 4 g+ 5 a+) g2A − (m) AdjDownR − f0 • f 1 f + 2 A(k)+ (l) • f f0 (m)g1A ∗(m),+ A ∗(m),− A+ f4 2 A(k)− − (m) g1A(m)+ 2 A(k)+ (m) f4 g (if right tree canonical: g1 must be of form g+ f3A(k)− 1 f + f3A(m)− (l) (l) (l) − 3 a+) − g 2 A − (m) AdjDownL − • f A+ (k) f0 (l) g2A(l)− 1 f + (k) g3 (if right tree canonical: g1 must be positive) 2 A ∗(k),+ A ∗(k),− f3A (l) g2A(k)− 2 A(k)+ • f g1A(l)+ f3A(l)− 1 f + g1A(l)+ − (k) f0 (k) (l) − (l) g3 AdjUp k − f3 k A(k)− 2 A(k)+ • f 1 f + f0 − 2 A(k)+ • f 1 f + verify( f0 Figure A.3 Incremental parsing rules for PLTAG. g3 , g1A+) g2 extend(A(k)− − g1A+g2A k k − Verify(k) g3) f3, A k. To generalize over cases with and without markers, we may write the superscript in parentheses, for example, A(k), to match a visit to a node that may or may not have a marker on the upper half. If any such node visit in the premise of the rule does have a marker, then all other uses of the superscript (k) in the premises and conclusion of the rule must also carry this marker. The use of subscripts Al and A(l) for lower markers is analogous. The Subst and Adj rules pass markers onto the result item appropriately; note that the upper and lower markers of a node into which another tree is adjoined are split over two nodes in the result. The Scan rule does not allow a to carry markers. This is intentional: Trees with two lexical anchors (such as the one for either . . . or) may contain predicted lexical nodes, which may not be processed by Scan as read before they have been veriﬁed. Furthermore, the revised rule schema includes the rule Verify(k), which removes all occurrences of the marker k from the preﬁx tree by performing a veriﬁcation operation. Verify(k) can be applied if the current fringe ends in a visit to some node u with lower marker k. It cuts the node visit sequence for the preﬁx tree τ into two parts: the string up to the downward visit to u, and the string starting at the upward visit. The ﬁrst part consists of the region of the preﬁx tree to the left of the node, plus the downward visits to the spine of u, that is, the nodes on the path from the root to u. We call this part F1 for 1062 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 1 0 2 5 1 8 0 3 0 4 4 / c o l i _ a _ 0 0 1 6 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar brevity. The second part F2 consists of the nodes to the right of u and the upward visits (cid:2) to the spine. Verify(k) chooses a veriﬁcation tree τv that contains a visit to a node u that has the same label as u, and splits it into the part G1 = g1A+ up to the downward visit of (cid:2) , and the part g2 between these u two visits. (cid:2) g3 starting at the upward visit of u , the part G2 = A − Verify(k) then attempts to compute an admissible correspondence and performs a veriﬁcation step by calling the functions verify and extend. The function verify tries to establish a correspondence between the k-marked node visits in F1 and the node visits in G1. If such a correspondence exists, it is unique and can be found in linear time by matching the tree structure of the k-marked nodes in τ against the nodes in τv. If such a correspondence cannot be found, verify fails; otherwise, it returns F1 with all k-markers removed. The function extend performs the same task for the F2 and G2, that is, the “right” parts of τ and τv, except that G2 may now contain complete subtrees that are not present in F2. Just like verify, extend looks for the unique correspondence between the k-marked node visits on F2 and the node visits in G2 in linear time. If such a correspondence exists, extend adds the unmatched subtrees in G2 to the correspondents of their parent nodes in F2 and removes all k-markers from F2. Verify(k) then obtains the node visit sequence for the result of the veriﬁcation operation by concatenating verify(F1, G1) with g2 (i.e., the (cid:2) nodes below u ) and extend(F2, G2). Acknowledgments Portions of this work have beneﬁted from presentations at the 2008 Architectures and Mechanisms for Sentence Processing Conference and at the 2008 CUNY Sentence Processing Conferences. We are grateful to and Aravind Joshi, Roger Levy, Mark Steedman, Patrick Sturt, and the four anonymous reviewers for feedback. This research was supported by EPSRC research grant EP/C546830/1: Prediction in Human Parsing. References Altmann, Gerry T. M. and Yuki Kamide. 1999. Incremental interpretation at verbs: Restricting the domain of subsequent reference. Cognition, 73:247–264. Aoshima, Sachiko, Masaya Yoshida, and Colin Phillips. 2009. Incremental processing of coreference and binding in Japanese. Syntax, 12:93–134. Arai, Manabu and Frank Keller. 2013. The use of verb-speciﬁc information for prediction in sentence processing. Language and Cognitive Processes, 28:525–560. Bangalore, Srinivas and Aravind K. Joshi. 1999. Supertagging: an approach to almost parsing. Computational Linguistics, 25:237–265. Barr, Dale J., Roger Levy, Christoph Scheepers, and Harry J. Tily. 2013. Random effects structure for conﬁrmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3):255–278. Beuck, Niels, Arne K ¨ohn, and Wolfgang Menzel. 2011. Incremental parsing and the evaluation of partial dependency analyses. In Proceedings of the 1st International Conference on Dependency Linguistics, pages 290–299, Barcelona. Boston, Marisa Ferrara, John Hale, Reinhold Kliegl, Umesh Patil, and Shravan Vasishth. 2008. Parsing costs as predictors of reading difﬁculty: An evaluation using the Potsdam Sentence Corpus. Journal of Eye Movement Research, 2(1):1–12. Brants, Thorsten. 2000. TnT—A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing, pages 224–231, Seattle, WA. Chen, John. 2001. Towards Efﬁcient Statistical Parsing using Lexicalized Grammatical Information. Ph.D. thesis, University of Delaware. Chiang, David. 2000. Statistical parsing with an automatically-extracted tree adjoining grammar. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 456–463, Hong Kong. Collins, Michael. 2003. Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4):589–637. 1063 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 1 0 2 5 1 8 0 3 0 4 4 / c o l i _ a _ 0 0 1 6 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 4 Collins, Michael and Brian Roark. 2004. Incremental parsing with the perceptron algorithm. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pages 111–120, Barcelona. Demberg, Vera. 2012. Incremental derivations in CCG. In Proceedings of the 11th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+11), pages 198–206, Paris. Demberg, Vera and Frank Keller. 2008a. Data from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition, 109:193–210. Demberg, Vera and Frank Keller. 2008b. A psycholinguistically motivated version of TAG. In Proceedings of the 9th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+9), 8 pages, T ¨ubingen. Demberg, Vera and Frank Keller. 2009. A computational model of prediction in human parsing: Unifying locality and surprisal effects. In Proceedings of the 29th Annual Meeting of the Cognitive Science Society, pages 1,888–1,893, Amsterdam. Demberg-Winterfors, Vera. 2010. A Broad-Coverage Model of Prediction in Human Sentence Processing. Ph.D. thesis, University of Edinburgh. Fossum, Victoria and Roger Levy. 2012. Syntactic vs. hierarchical models of human incremental sentence processing. In Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics, pages 61–69, Montreal. Frank, Stefan L. 2009. Surprisal-based comparison between a symbolic and a connectionist model of sentence processing. In Niels Taatgen and Hedderik van Rijn, editors, Proceedings of the 31st Annual Conference of the Cognitive Science Society, pages 1,139–1,144, Amsterdam. Gibson, Edward. 1998. Linguistic complexity: Locality of syntactic dependencies. Cognition 68, pages 1–76. Gibson, Edward. 2000. Dependency locality theory: A distance-based theory of linguistic complexity. In Alec Marantz, Yasushi Miyashita, and Wayne O’Neil, editors, Image, Language, Brain: Papers from the First Mind Articulation Project Symposium. MIT Press, Cambridge, MA, pages 95–126. Hale, John. 2001. A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the 2nd Conference of the 1064 North American Chapter of the Association for Computational Linguistics, volume 2, pages 159–166, Pittsburgh, PA. Hockenmaier, Julia and Mark Steedman. 2007. CCGbank: A corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Computational Linguistics, 33(3):355–396. Joshi, Aravind K., Leon Levy, and M. Takahashi. 1975. Tree adjunct grammars. Journal of the Computer and System Sciences, 10(1):136–163. Joshi, Aravind K. and Yves Schabes. 1992. Tree adjoining grammars and lexicalized grammars. In Maurice Nivat and Andreas Podelski, editors, Tree Automata and Languages. North-Holland, Amsterdam, pages 409–432. Kamide, Yuki, Christoph Scheepers, and Gerry T. M. Altmann. 2003. Integration of syntactic and semantic information in predictive processing: Cross-linguistic evidence from German and English. Journal of Psycholinguistic Research, 32:37–55. Kashammer, Miriam and Vera Demberg. 2012. German and English treebanks and lexica for tree-adjoining grammars. In Proceedings of the 8th International Conference on Language Resources and Evaluation, pages 1,880–1,887, Istanbul. Kato, Yoshihide, Shigeki Matsubara, and Yasuyoshi Inagaki. 2004. Stochastically evaluating the validity of partial parse trees in incremental parsing. In Proceedings of the ACL Workshop Incremental Parsing, pages 9–15, Barcelona. Keller, Frank. 2010. Cognitively plausible models of human language processing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics: Short Papers, pages 60–67, Uppsala. Kennedy, Alan and Joel Pynte. 2005. Parafoveal-on-foveal effects in normal reading. Vision Research, 45:153–168. Konieczny, Lars. 2000. Locality and parsing complexity. Journal of Psycholinguistic Research, 29(6):627–645. Lewis, Richard L. and Shravan Vasishth. 2005. An activation-based model of sentence processing as skilled memory retrieval. Cognitive Science, 29:1–45. Magerman, David M. 1994. Natural language parsing as statistical pattern recognition. Ph.D. thesis, Stanford University. Mazzei, Alessandro. 2005. Formal and empirical issues of applying dynamics to Tree Adjoining Grammars. Ph.D. thesis, Universit`a di Torino. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 1 0 2 5 1 8 0 3 0 4 4 / c o l i _ a _ 0 0 1 6 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Demberg, Keller, and Koller Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar Mazzei, Alessandro, Vincenzo Lombardo, and Patrick Sturt. 2007. Dynamic TAG and lexical dependencies. Research on Language and Computation, 5:309–332. McRae, Ken, Michael J. Spivey-Knowlton, and Michael K. Tanenhaus. 1998. Modeling the inﬂuence of thematic ﬁt (and other constraints) in on-line sentence comprehension. Journal of Memory and Language, 38(3):283–312. Roland, Douglas, Gail Mauner, Carolyn O’Meara, and Hongoak Yun. 2012. Discourse expectations and relative clause processing. Journal of Memory and Language, 66(3):479–508. Schuler, William, Samir AbdelRahman, Tim Miller, and Lane Schwartz. 2010. Broad-coverage parsing using human-like memory constraints. Computational Linguistics, 36(1):1–30. Mitchell, Jeff, Mirella Lapata, Vera Demberg, Shen, Libin and Aravind K. Joshi. 2005. and Frank Keller. 2010. Syntactic and semantic factors in processing difﬁculty: An integrated measure. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 196–206, Uppsala. Nivre, Joakim. 2004. Incrementality in deterministic dependency parsing. In Proceedings of the ACL Workshop on Incremental Parsing, pages 50–57, Barcelona. Palmer, Martha, Dan Gildea, and Paul Kingsbury. 2003. The Proposition Bank: an annotated corpus of semantic roles. Computational Linguistics, 31(1):71–106. Petrov, Slav and Dan Klein. 2007. Improved inference for unlexicalized parsing. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 404–411, Rochester, NY. Pinheiro, Jos´e C. and Douglas M. Bates. 2000. Mixed-Effects Models in S and S-PLUS. Springer, New York. Resnik, Philip. 1992a. Left-corner parsing and psychological plausibility. In Proceedings of the 15th International Conference on Computational Linguistics, pages 191–197, Nantes. Resnik, Philip. 1992b. Probabilistic tree-adjoining grammar as a framework for statistical natural language processing. In Proceedings of the 15th International Conference on Computational Linguistics, pages 418–424, Nantes. Roark, Brian. 2001. Probabilistic top–down parsing and language modeling. Computational Linguistics, 27(2):249–276. Roark, Brian, Asaf Bachrach, Carlos Cardenas, and Christophe Pallier. 2009. Deriving lexical and syntactic expectation-based measures for psycholinguistic modeling via incremental top-down parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 324–333, Singapore. Incremental LTAG parsing. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 811–818, Vancouver. Shieber, Stuart M., Yves Schabes, and Fernando C. N. Pereira. 1995. Principles and implementation of deductive parsing. Journal of Logic Programming, 24(1–2):3–36. Staub, Adrian. 2010. Eye movements and processing difﬁculty in object relative clauses. Cognition, 116:71–86. Staub, Adrian and Charles Clifton. 2006. Syntactic prediction in language comprehension: Evidence from either . . . or. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32:425–436. Steedman, Mark. 2000. The Syntactic Process. MIT Press, Cambridge, MA. Sturt, Patrick and Vincenzo Lombardo. 2005. Processing coordinate structures: Incrementality and connectedness. Cognitive Science, 29:291–305. Tanenhaus, Michael K., Michael J. Spivey-Knowlton, Kathleen M. Eberhard, and Julie C. Sedivy. 1995. Integration of visual and linguistic information in spoken language comprehension. Science, 268:1,632–1,634. Thompson, Henry S., Mike Dixon, and John Lamping. 1991. Compose-reduce parsing. In Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, pages 87–97, Berkeley, CA. Vadas, David and James Curran. 2007. Adding noun phrase structure to the Penn Treebank. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 240–247, Prague. Vijay-Shanker, K. and Aravind K. Joshi. 1988. Feature structures based tree adjoining grammars. In Proceedings of the 12th International Conference on Computational Linguistics, pages 714–719, Morristown, NJ. 1065 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 1 0 2 5 1 8 0 3 0 4 4 / c o l i _ a _ 0 0 1 6 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 39, Number 4 Witten, Ian H. and Timothy C. Bell. 1991. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transaction on Information Theory, 37(4):1,085–1,094. Wu, Stephen, Asaf Bachrach, Carlos Cardenas, and William Schuler. 2010. Complexity metrics in an incremental right-corner parser. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1,189–1,198, Uppsala. Xia, Fei, Martha Palmer, and Aravind Joshi. 2000. A uniform method of grammar extraction and its applications. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 53–62, Hong Kong. Yoshida, Masaya, Michael Walsh-Dickey, and Patrick Sturt. 2013. Predictive processing of syntactic structure: Sluicing and ellipsis in real-time sentence processing. Language and Cognitive Processes, 28:272–302. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 9 4 1 0 2 5 1 8 0 3 0 4 4 / c o l i _ a _ 0 0 1 6 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 1066
Download pdf