Un modelo generativo de puntuación en árboles de dependencia
Xiang Lisa Li⇤ and Dingquan Wang⇤ and Jason Eisner
Departamento de Ciencias de la Computación, Universidad Johns Hopkins
xli150@jhu.edu, {wdd,jason}@cs.jhu.edu
Abstracto
Treebanks traditionally treat punctuation
marks as ordinary words, but linguists have
suggested that a tree’s “true” punctuation
marks are not observed (Nünberg, 1990).
These latent “underlying” marks serve to
delimit or separate constituents in the syn-
tax tree. When the tree’s yield is rendered as
a written sentence, a string rewriting mech-
anism transduces the underlying marks into
“surface” marks, which are part of the ob-
served (surface) string but should not be re-
garded as part of the tree. We formalize
this idea in a generative model of punc-
tuation that admits efficient dynamic pro-
gramming. We train it without observing
the underlying marks, by locally maximiz-
ing the incomplete data likelihood (simi-
larly to the EM algorithm). When we use
the trained model to reconstruct the tree’s
underlying punctuation, the results appear
plausible across 5 idiomas, and in par-
ticular are consistent with Nunberg’s anal-
ysis of English. We show that our gener-
ative model can be used to beat baselines
on punctuation restoration. También, our recon-
struction of a sentence’s underlying punctu-
ation lets us appropriately render the surface
punctuation (via our trained underlying-to-
surface mechanism) when we syntactically
transform the sentence.
1
Introducción
Punctuation enriches the expressiveness of writ-
ten language. When converting from spoken to
written language, punctuation indicates pauses or
pitches; expresses propositional attitude; and is
conventionally associated with certain syntactic
constructions such as apposition, parenthesis, quo-
tation, and conjunction.
en este documento, we present a latent-variable
model of punctuation usage, inspired by the rule-
based approach to English punctuation of Nun-
berg (1990). Training our model on English data
⇤Equal contribution.
learns rules that are consistent with Nunberg’s
hand-crafted rules. Our system is automatic, so we
use it to obtain rules for Arabic, Chino, Español,
and Hindi as well.
Además, our rules are stochastic, which al-
lows us to reason probabilistically about ambigu-
ous or missing punctuation. Across the 5 lan-
calibres, our model predicts surface punctuation
better than baselines, as measured both by per-
plejidad (§4) and by accuracy on a punctuation
restoration task (§ 6.1). We also use our model
to correct the punctuation of non-native writers
of English (§ 6.2), and to maintain natural punc-
tuation style when syntactically transforming En-
glish sentences (§ 6.3).
En principio, our model
could also be used within a generative parser, Alabama-
lowing the parser to evaluate whether a candidate
tree truly explains the punctuation observed in the
input sentence (§8).
Punctuation is interesting In The Linguistics of
Puntuación, Nünberg (1990) argues that punctu-
ación (en Inglés) is more than a visual counter-
part of spoken-language prosody, but forms a lin-
guistic system that involves “interactions of point
indicators (es decir. commas, semicolons, colons, pe-
riods and dashes).” He proposes that much as in
phonology (Chomsky y Halle, 1968), a gram-
mar generates underlying punctuation which then
transforms into the observed surface punctuation.
Consider generating a sentence from a syntactic
grammar as follows:
Hail the king [, Arthur Pendragon ,]
[, who wields [ “ Excalibur ” ] ,] .
Although the full tree is not depicted here, algunos de
the constituents are indicated with brackets. En esto
underlying generated tree, each appositive NP is
surrounded by commas. On the surface, sin embargo,
the two adjacent commas after Pendragon will
now be collapsed into one, and the final comma
will be absorbed into the adjacent period. Fur-
thermore, in American English, the typographic
357
Transacciones de la Asociación de Lingüística Computacional, volumen. 7, páginas. 357–373, 2019. Editor de acciones: Trevor Cohn.
Lote de envío: 10/2018; Lote de revisión: 1/2019; Publicado 7/2019.
C(cid:2) 2019 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
yo
a
C
_
a
_
0
0
2
7
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
convention is to move the final punctuation inside
the quotation marks. Thus a reader sees only this
modified surface form of the sentence:
Hail the king, Arthur Pendragon,
who wields “Excalibur.”
Note that these modifications are string transfor-
mations that do not see or change the tree. El
resulting surface punctuation marks may be clues
to the parse tree, pero (contrary to NLP convention)
they should not be included as nodes in the parse
árbol. Only the underlying marks play that role.
Punctuation is meaningful Pang et al. (2002)
use question and exclamation marks as clues to
sentiment. Similarmente, quotation marks may be
used to mark titles, quotations, reported speech,
or dubious terminology (University of Chicago,
2010). Because of examples like this, methods for
determining the similarity or meaning of syntax
árboles, such as a tree kernel (Agarwal et al., 2011)
or a recursive neural network (Tai et al., 2015),
should ideally be able to consider where the un-
derlying punctuation marks attach.
Punctuation is helpful Surface punctuation re-
mains correlated with syntactic phrase structure.
NLP systems for generating or editing text must be
able to deploy surface punctuation as human writ-
ers do. Parsers and grammar induction systems
benefit from the presence of surface punctuation
marks (jones, 1994; Spitkovsky et al., 2011). Es
plausible that they could do better with a linguisti-
cally informed model that explains exactly why the
surface punctuation appears where it does. Pat-
terns of punctuation usage can also help identify
the writer’s native language (Markov et al., 2018).
Punctuation is neglected Work on syntax and
parsing tends to treat punctuation as an af-
terthought rather than a phenomenon governed by
its own linguistic principles. Treebank annota-
tion guidelines for punctuation tend to adopt sim-
ple heuristics like “attach to the highest possi-
ble node that preserves projectivity” (Bies et al.,
1995; Nivre et al., 2018).1 Many dependency
parsing works exclude punctuation from evalua-
ción (Nivre et al., 2007b; Koo and Collins, 2010;
Chen and Manning, 2014; Lei et al., 2014; Kiper-
wasser and Goldberg, 2016), although some others
retain punctuation (Nivre et al., 2007a; Goldberg
and Elhadad, 2010; Dozat and Manning, 2017).
Unpunctuated Tree: t
ATTACH
árbol:
t 0
Underlying sequence: tu
oración: ¯u
NOISYCHANNEL
Valle
means river valley
nsubj
raíz
dobj
raíz.
"
“
nsubj
u0
u1
“ Dale ”
“
dobj
"
u3
u4
means “ river valley ” .
u2
Surface
oración: ¯x
secuencia: X
“ Dale ”
x0
x1
means “ river valley . "
x4
x3
x2
Cifra 1: The generative story of a sentence. Given
an unpunctuated tree T at top, at each node w
2
t , the ATTACH process stochastically attaches a left
puncteme l and a right puncteme r, which may be
empty. The resulting tree T 0 has underlying punctua-
tion u. Each slot’s punctuation ui 2
u is rewritten to
x by NOISYCHANNEL.
xi 2
In tasks such as word embedding induction
(Mikolov et al., 2013; Pennington et al., 2014) y
machine translation (Zens et al., 2002), punctua-
tion marks are usually either removed or treated as
ordinary words ( ˇReh˚uˇrek and Sojka, 2010).
Yet to us, building a parse tree on a surface
sentence seems as inappropriate as morphologi-
cally segmenting a surface word. In both cases,
one should instead analyze the latent underlying
forma, jointly with recovering that form. Para examen-
por ejemplo, the proper segmentation of English hoping
is not hop-ing but hope-ing (with underlying
mi), and the proper segmentation of stopping
is neither stopp-ing nor stop-ping but
stop-ing (with only one underlying p). Cot-
terell et al. (2015, 2016) get this right for morphol-
ogia. We attempt to do the same for punctuation.
2 Formal Model
We propose a probabilistic generative model of
oraciones (Cifra 1):
(1)
¯u(t 0))
pag(¯x) =
psyn(t )
t,t 0
p✓(t 0 |
·
t )
·
pag(¯x
|
PAG
Primero, an unpunctuated dependency tree T is
stochastically generated by some recursive pro-
cess psyn (p.ej., Eisner, 1996, Model C).2 Segundo,
each constituent (es decir., dependency subtree) sprouts
optional underlying punctuation at its left and right
bordes, according to a probability distribution p✓
that depends on the constituent’s syntactic role
(p.ej., dobj for “direct object”). This punctuated
tree T 0 yields the underlying string ¯u = ¯u(t 0),
which is edited by a finite-state noisy channel p
to arrive at the surface sentence ¯x.
1http://universaldependencies.org/u/
2Our model could be easily adapted to work on con-
dep/punct.html
stituency trees instead.
358
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
yo
a
C
_
a
_
0
0
2
7
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
This third step may alter the sequence of punc-
tuation tokens at each slot between words—for ex-
amplio, in §1, collapsing the double comma , ,
between Pendragon and who. u and x denote
just the punctuation at the slots of ¯u and ¯x respec-
activamente, with ui and xi denoting the punctuation to-
ken sequences at the ith slot. De este modo, the transfor-
mation at the ith slot is ui 7!
Since this model is generative, we could train
it without any supervision to explain the observed
surface string ¯x: maximize the likelihood p(¯x) en
(1), marginalizing out the possible T, t 0 valores.
xi.
In the present paper, sin embargo, we exploit known
T values (as observed in the “depunctuated” ver-
sion of a treebank). Because T is observed, podemos
jointly train ✓, to maximize just
pag(X
|
t ) =
XT 0
p✓(t 0
t )
·
|
pag(X
|
tu(t 0))
(2)
the psyn model that generated T be-
Eso es,
comes irrelevant, but we still try to predict what
surface punctuation will be added to T . Nosotros
still marginalize over the underlying punctuation
marks u. These are never observed, but they must
explain the surface punctuation marks x (§ 2.2),
and they must be explained in turn by the syntax
tree T (§ 2.1). The trained generative model then
lets us restore or correct punctuation in new trees
t (§6).
2.1 Generating Underlying Punctuation
The ATTACH model characterizes the probability
of an underlying punctuated tree T 0 given its cor-
responding unpunctuated tree T , which is given by
p✓(t 0
|
t ) =
t
Yw
2
p✓(lw, rw |
w)
(3)
where lw, rw 2V
are the left and right punctemes
that T 0 attaches to the tree node w. Each puncteme
is a string of 0 o
(Krahn, 2014) in the finite set
more underlying punctuation tokens.3 The proba-
bility p✓(yo, r
w) is given by a log-linear model
V
|
3Multi-token punctemes are occasionally useful. por ejemplo-
amplio, the puncteme … might consist of either 1 o 3 a-
kens, depending on how the tokenizer works; similarmente, el
puncteme ?! might consist of 1 o 2 tokens. También, if a sin-
gle constituent of T gets surrounded by both parentheses and
quotation marks, this gives rise to punctemes (“ and ”).
(A better treatment would add the parentheses as a separate
puncteme pair at a unary node above the quotation marks, pero
that would have required T 0 to introduce this extra node.)
359
7!
, ,.
1. Point Absorption
. -,
„
-;
.
2. Quote Transposition 4. Bracket Absorptions
(
",
3. Period Absorption
.!
7!
abbv
?
7!
abbv.
; ;.
– .?
."
,"
".
7!
7!
7!
7!
7!
!
7!
7!
,)
,"
7!
7!
) -)
” “,
) (,
“
7!
7!
7!
Mesa 1: Some of Nunberg’s punctuation interaction
rules in English, in priority order. The absorption rules
ensure that when there are two adjacent tokens, el
“weaker” one is deleted (where the strength ordering
> – > ,), excepto
es
that bracketing tokens such as () and “” do not absorb
tokens outside the material they bracket.
?, !, (, ), “, "
{
;, :
}
{
> . >
}
p✓(yo, r
w)
|
/ (
exp ✓>f (yo, r, w) si (yo, r)
0
de lo contrario
2W d(w)
(4)
V
y
is the finite set of possible punctemes and
dónde
V
2 gives the possible puncteme pairs for a
Wd ✓V
node w that has dependency relation d = d(w) a
its parent.
Wd are estimated heuristically
from the tokenized surface data (§4). F (yo, r, w) es
a sparse binary feature vector, and ✓ is the cor-
responding parameter vector of feature weights.
The feature templates in Appendix A4 consider the
symmetry between l and r, and their compatibility
con (a) the POS tag of w’s head word, (b) the de-
pendency paths connecting w to its children and
the root of T , (C) the POS tags of the words flank-
ing the slots containing l and r, (d) surface punc-
tuation already added to w’s subconstituents.
2.2 From Underlying to Surface
From the tree T 0, we can read off the sequence
of underlying punctuation tokens ui at each slot i
between words. Namely, ui concatenates the right
punctemes of all constituents ending at i with the
left punctemes of all constituents starting at i (como
illustrated by the examples in §1 and Figure 1).
The NOISYCHANNEL model then transduces ui to
a surface token sequence xi, for each i = 0, . . . , norte
independientemente (where n is the sentence length).
Nunberg’s formalism Much like Chomsky and
Halle’s (1968) phonological grammar of English,
Nünberg (1990) descriptive English punctuation
gramática (Mesa 1) can be viewed computationally
as a priority string rewriting system, or Markov
algoritmo (Markov, 1960; Caracciolo di Forino,
1968). The system begins with a token string u.
4The appendices (supplementary material) are available at
https://arxiv.org/abs/1906.11298.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
yo
a
C
_
a
_
0
0
2
7
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
abcde
abcde
a bde
a dbe
a d e
. ab
. bc
. bd
. ser
ab
b
db
mi
7!
7!
7!
7!
7!
ade with a sliding win-
Cifra 2: Editing abcde
dow. (When an absorption rule maps 2 tokens to 1, nuestro
diagram leaves blank space that is not part of the out-
put string.) At each step, the left-to-right process has
already committed to the green tokens as output; tiene
not yet looked at the blue input tokens; and is currently
considering how to (further) rewrite the black tokens.
The right column shows the chosen edit.
At each step it selects the highest-priority local
rewrite rule that can apply, and applies it as far
left as possible. When no more rules can apply,
the final state of the string is returned as x.
Simplifying the formalism Markov algorithms
are Turing complete. Fortunately, Johnson (1972)
noted that in practice, phonological u
x maps
described in this formalism can usually be imple-
mented with finite-state transducers (FSTs).
7!
For computational simplicity, we will formu-
late our punctuation model as a probabilistic FST
(PFST)—a locally normalized left-to-right rewrite
modelo (Cotterell et al., 2014). The probabilities
for each language must be learned, using gradient
descent. Normally we expect most probabilities to
be near 0 o 1, making the PFST nearly determin-
istic (es decir., close to a subsequential FST). Sin embargo,
permitting low-probability choices remains useful
to account for typographical errors, dialectal dif-
ferences, and free variation in the training corpus.
Our PFST generates a surface string, pero el
invertibility of FSTs will allow us to work back-
wards when analyzing a surface string (§3).
Instead of having rule
A sliding-window model
priorities, we apply Nunberg-style rules within a
2-token window that slides over u in a single left-
to-right pass (Cifra 2). Conditioned on the cur-
rent window contents ab, a single edit is selected
b
stochastically: either ab
(left absorption), ab
a (right absorption), o
ab
ba (transposition). Then the window slides
rightward to cover the next input token, together
with the token that is (now) to its left. a and b are
always real tokens, never boundary symbols.
specifies the conditional edit probabilities.5
ab (no change), ab
7!
7!
7!
7!
5Rather than learn a separate edit probability distribution
for each bigram ab, one could share parameters across bi-
gramos. Por ejemplo, Table 1’s caption says that “stronger”
tokens tend to absorb “weaker” ones. A model that incor-
360
These specific edit rules (like Nunberg’s) poder-
not insert new symbols, nor can they delete all of
the underlying symbols. De este modo, surface xi is a good
clue to ui: all of its tokens must appear underly-
ingly, and if xi = ✏ (the empty string) then ui = ✏.
The model can be directly implemented as
a PFST (Appendix D4) using Cotterell et al.’s
(2014) more general PFST construction.
Our single-pass formalism is less expressive
than Nunberg’s. It greedily makes decisions based
on at most one token of right context (“label
bias”). It cannot rewrite ’”.
."
because the . is encountered too late to percolate
leftward; luckily, aunque, we can handle such En-
glish examples by sliding the window right-to-left
instead of left-to-right. We treat the sliding direc-
tion as a language-specific parameter.6
.’” or ”,.
7!
7!
2.3 Training Objective
Building on equation (2), we train ✓, to lo-
cally maximize the regularized conditional log-
likelihood
&
· ||
2
✓
||
(5)
iniciar sesión p(X
t )
|
⇠
· E
t 0
[C(t 0)]2
·
⌘
pag(
⇣ Xx,t
where the sum is over a training treebank.7
] is over T 0 ⇠
The expectation E[
|
· · ·
t, X). This generalized expectation term pro-
vides posterior regularization (Mann and McCal-
lum, 2010; Ganchev et al., 2010), by encourag-
ing parameters that reconstruct trees T 0 that use
symmetric punctuation marks in a “typical” way.
The function c(t 0) counts the nodes in T 0 cuyo
punctemes contain “unmatched” symmetric punc-
tuation tokens: Por ejemplo, ) is “matched” only
when it appears in a right puncteme with ( en el
comparable position in the same constituent’s left
puncteme. The precise definition is given in Ap-
pendix B.4
|
|
⌃
2) separate
porated this insight would not have to learn O(
)
absorption probabilities (two per bigram ab), but only O(
|
fortalezas (one per unigram a, which may be regarded as a
1-dimensional embedding of the punctuation token a). Nosotros
figured that the punctuation vocabulary ⌃ was small enough
(Mesa 2) that we could manage without the additional com-
plexity of embeddings or other featurization, although this
does presumably hurt our generalization to rare bigrams.
⌃
|
En g
6We could have handled all languages uniformly by mak-
2 passes of the sliding window (via a composition of
2 PFSTs), with at least one pass in each direction.
7In retrospect, there was no good reason to square the
ET 0 [C(t 0)] term. Sin embargo, when we started redoing the ex-
perimentos, we found the results essentially unchanged.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
yo
a
C
_
a
_
0
0
2
7
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
In our development experiments on English, el
posterior regularization term was necessary to dis-
cover an aesthetically appealing theory of under-
lying punctuation. When we dropped this term
(⇠ = 0) and simply maximized the ordinary regu-
larized likelihood, we found that the optimization
problem was underconstrained: different training
runs would arrive at different, rather arbitrary un-
derlying punctemes. Por ejemplo, one training run
learned an ATTACH model that used underlying
“. to terminate sentences, along with a NOISY-
CHANNEL model that absorbed the left quotation
mark into the period. By encouraging the under-
lying punctuation to be symmetric, we broke the
corbatas. We also tried making this a hard constraint
(⇠ =
), but then the model was unable to explain
some of the training sentences at all, giving them
probability of 0. Por ejemplo, I went to the
“ special place ” cannot be explained, ser-
cause special place is not a constituent.8
1
3
Inference
En principio, working with the model (1) es
straightforward, thanks to the closure properties
of formal languages. Provided that psyn can be en-
coded as a weighted CFG, it can be composed with
the weighted tree transducer p✓ and the weighted
FST p to yield a new weighted CFG (similarly to
Bar-Hillel et al., 1961; Nederhof and Satta, 2003).
Under this new grammar, one can recover the opti-
mal T, t 0 for ¯x by dynamic programming, or sum
over T, t 0 by the inside algorithm to get the likeli-
hood p(¯x). A similar approach was used by Levy
(2008) with a different FST noisy channel.
In this paper we assume that T is observed, Alabama-
lowing us to work with equation (2). This cuts the
computation time from O(n3) to O(norte).9 Whereas
the inside algorithm for (1) must consider O(n2)
possible constituents of ¯x and O(norte) ways of build-
ing each, our algorithm for (2) only needs to iterate
over the O(norte) true constituents of T and the 1 true
way of building each. Sin embargo, it must still con-
puncteme pairs for each constituent.
sider the
|Wd|
8Recall that the NOISYCHANNEL model family (§ 2.2)
requires the surface “ before special to appear under-
lyingly, and also requires the surface ✏ after special to
be empty underlyingly. These hard constraints clash with
the ⇠ =
hard constraint that the punctuation around
special must be balanced. The surface ” after place
causes a similar problem: no edge can generate the match-
ing underlying “.
1
9We do O(norte) multiplications of N
N matrices where
⇥
361
Algoritmo 1 The algorithm for scoring a given
(t, X) pair. The code in blue is used during train-
ing to get the posterior regularization term in (5).
Input: t , X
. Training pair (omits T 0, tu)
Output: pag(X
t ), mi[C(t 0)]
|
1: procedure TOTALSCORE(t , X)
2:
for i = 1 to n do
compute WFSA (Mi, i, ⇢i)
0
mi
procedure IN(w)
. exp.. count of unmatched punctemes
t
. w
2
slots at left, right of w constit
slot at right of w headword
i, k
j
Mleft
(
w02
Mright
>j (
q
M0
1
Mleft ·
q
0
METRO
para (yo, r)
pag
METRO
mi
return M
2W d(w) hacer
w)
p✓(yo, r
|
METRO + pag
mi + pag
·
·
Mroot
EN(raíz(t ))
return >0 Mroot⇢n, mi
leftkids(w) EN(w0))⇢j
1
rightkids(w) EN(w0))
Nj
Nj ⇥
w02
Mright . R
·
1, R
. R
⇥
Ni⇥
Nk
1
Mi(yo)M0Mk(r)
yo,r have unmatched punc
. R
Ni⇥
Nk
. R, R
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
3.1 Algorithms
i
xi|
ui|
is unbounded whenever
|
For each slot 0
Given an input sentence ¯x of length n, our job is
to sum over possible trees T 0 that are consistent
with T and ¯x, or to find the best such T 0. Este
is roughly a lattice parsing problem—made easier
by knowing T . Sin embargo, the possible ¯u values
are characterized not by a lattice but by a cyclic
> 0).
WFSA (como
|
norte, transduce the sur-
face punctuation string xi by the inverted PFST
for p to obtain a weighted finite-state automa-
tonelada (WFSA) that describes all possible underly-
ing strings ui.10 This WFSA accepts each pos-
sible ui with weight p(xi
If it has Ni
estados, we can represent it (Berstel and Reutenauer,
1988) with a family of sparse weight matrices
En, whose element at row s and
Mi()
column t is the weight of the s
t arc labeled
!
con , o 0 if there is no such arc. Additional
RNi specify the initial and final
vectors i, ⇢i 2
weights. (i is one-hot if the PFST has a single
RNi⇥
ui).
2
|
·
N = O(# of punc types
máximo # of punc tokens per slot).
10Constructively, compose the u-to-x PFST (from the end
of § 2.2) with a straight-line FSA accepting only xi, y
project the resulting WFST to its input tape (Pereira and Ri-
ley, 1996), as explained at the end of Appendix D.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
yo
a
C
_
a
_
0
0
2
7
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
initial state, of weight 1.)
For any puncteme l (or r) en
Mi(yo
, nosotros definimos
V
Mi(yo) = Mi(l1)Mi(l2)
), a product
yo
|
over the 0 or more tokens in l. This gives the total
!⇤ t WFSA paths labeled with l.
weight of all s
The subprocedure in Algorithm 1 essentially
· · ·
|
extends this to obtain a new matrix IN(w)
2
RNi⇥
Nk , where the subtree rooted at w stretches
Its element IN(w)st gives
from slot i to slot k.
the total weight of all extended paths in the ¯u
WFSA from state s at slot i to state t at slot k. Un
extended path is defined by a choice of underly-
ing punctemes at w and all its descendants. Estos
punctemes determine an s-to-final path at i, entonces
initial-to-final paths at i + 1 through k
1, then an
initial-to-t path at k. The weight of the extended
path is the product of all the WFSA weights on
these paths (which correspond to transition prob-
abilities in p PFST) times the probability of the
choice of punctemes (from p✓).
Este
inside algorithm computes quantities
needed for training (§ 2.3). Useful variants arise
via well-known methods for weighted derivation
forests (Berstel and Reutenauer, 1988; Buen hombre,
1999; Li and Eisner, 2009; Eisner, 2016).
Específicamente, to modify Algorithm 1 to maximize
over T 0 valores (§§ 6.2–6.3) instead of summing
over them, we switch to the derivation semiring
(Buen hombre, 1999), como sigue. Whereas IN(w)st
used to store the total weight of all extended paths
from state s at slot i to state t at slot j, now it will
store the weight of the best such extended path. Él
will also store that extended path’s choice of un-
derlying punctemes, in the form of a puncteme-
annotated version of the subtree of T that is rooted
at w. This is a potential subtree of T 0.
De este modo, each element of IN(w) has the form
R and D is a tree. We define
(r, D) where r
addition and multiplication over such pairs:
2
(r, D) + (r0, D0) =
(r, D)
(r0, D0)
(
(r0, D0) = (rr0, DD0)
if r > r0
de lo contrario
(6)
(7)
(r, D)
·
where DD0 denotes an ordered combination of
two trees. Matrix products UV and scalar-matrix
products p
V are defined in terms of element ad-
dition and multiplication as usual:
·
(UV)st =
(pag
V)st = p
·
PAG
·
Vrt
rUsr ·
Vst
(8)
(9)
362
What is DD0? For presentational purposes, es
convenient to represent a punctuated dependency
tree as a bracketed string. Por ejemplo, the under-
lying tree T 0 En figura 1 would be [ [“ Dale ”]
medio [“ [ river ] valley ”] ] dónde
the words correspond to nodes of T . En este caso,
we can represent every D as a partial bracketed
string and define DD0 by string concatenation.
that multiplication
This presentation ensures
(7) is a complete and associative (though not
commutative) operación, as in any semiring. Como
base cases, each real-valued element of Mi(yo)
or Mk(r) is now paired with the string [l or r]
respectivamente,11 and the real number 1 at line 10 es
paired with the string w. The real-valued elements
of the i and ⇢i vectors and the 0 matrix at line 11
are paired with the empty string ✏, as is the real
number p at line 13.
En la práctica, the D strings that appear within the
matrix M of Algorithm 1 will always represent
complete punctuated trees. De este modo, they can actu-
ally be represented in memory as such, and differ-
ent trees may share subtrees for efficiency (usando
pointers). The product in line 10 constructs a ma-
trix of trees with root w and differing sequences
of left/right children, while the product in line 14
annotates those trees with punctemes l, r.
To sample a possible T 0 from the derivation for-
est in proportion to its probability (§ 6.1), we use
the same algorithm but replace equation (6) con
(r, D) + (r0, D0) =
(r + r0, D)
(r + r0, D0)
(
if u < r
r+r0
otherwise
with u
⇠
Uniform(0, 1) being a random number.
3.2 Optimization
Having computed the objective (5), we find the
gradient via automatic differentiation, and opti-
mize ✓, via Adam (Kingma and Ba, 2014)—a
variant of stochastic gradient decent—with learn-
ing rate 0.07, batchsize 5, sentence per epoch
400, and L2 regularization. (These hyperparam-
eters, along with the regularization coefficients &
and ⇠ from equation (5), were tuned on dev data
(§4) for each language respectively.) We train
11We still construct the real matrix Mi(l) by ordinary ma-
trix multiplication before pairing its elements with strings.
This involves summation of real numbers: each element of
the resulting real matrix is a marginal probability, which sums
over possible PFST paths (edit sequences) that could map the
underlying puncteme l to a certain substring of the surface
slot xi. Similarly for Mk(r).
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
l
a
c
_
a
_
0
0
2
7
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
the punctuation model for 30 epochs. The initial
NOISYCHANNEL parameters () are drawn from
(0, 1), and the initial ATTACH parameters (✓)
(0, 1) (with one minor excep-
N
are drawn from
tion described in Appendix A).
N
4
Intrinsic Evaluation of the Model
Data. Throughout §§4–6, we will examine the
punctuation model on a subset of the Univer-
sal Dependencies (UD) version 1.4 (Nivre et al.,
2016)—a collection of dependency treebanks
across 47 languages with unified POS-tag and de-
pendency label sets. Each treebank has designated
training, development, and test portions. We ex-
periment on Arabic, English, Chinese, Hindi, and
Spanish (Table 2)—languages with diverse punc-
tuation vocabularies and punctuation interaction
rules, not to mention script directionality. For each
treebank, we use the tokenization provided by UD,
and take the punctuation tokens (which may be
multi-character, such as ...) to be the tokens with
the PUNCT tag. We replace each straight dou-
ble quotation mark " with either “ or ” as appro-
priate, and similarly for single quotation marks.12
We split each non-punctuation token that ends in
. (such as etc.) into a shorter non-punctuation
token (etc) followed by a special punctuation to-
ken called the “abbreviation dot” (which is distinct
from a period). We prepend a special punctuation
mark ˆ to every sentence ¯x, which can serve to
absorb an initial comma, for example.13 We then
replace each token with the special symbol UNK if
its type appeared fewer than 5 times in the training
portion. This gives the surface sentences.
V
To estimate the vocabulary
of underlying
punctemes, we simply collect all surface token se-
quences xi that appear at any slot in the training
portion of the processed treebank. This is a gener-
ous estimate. Similarly, we estimate
Wd (§ 2.1) as
all pairs (l, r)
2 that flank any d constituent.
Recall that our model generates surface punctu-
ation given an unpunctuated dependency tree. We
train it on each of the 5 languages independently.
We evaluate on conditional perplexity, which will
be low if the trained model successfully assigns a
high probability to the actual surface punctuation
in a held-out corpus of the same language.
2V
12For en and en_esl, “ and ” are distinguished by
language-specific part-of-speech tags. For the other 4 lan-
guages, we identify two " dependents of the same head word,
363
Language Treebank #Token %Punct #Omit #Type
282K
Arabic
123K
Chinese
255K
97.7K
352K
es_ancora 560K
ar
zh
en
en_esl
hi
7.9
13.8
11.7
9.8
6.7
11.7
255
3
40
2
21
25
18
23
35
16
15
16
Hindi
Spanish
English
Table 2: Statistics of our datasets. “Treebank” is the
UD treebank identifier, “#Token” is the number of to-
kens, “%Punct” is the percentage of punctuation to-
kens, “#Omit” is the small number of sentences con-
taining non-leaf punctuation tokens (see footnote 19),
and “#Type” is the number of punctuation types after
preprocessing. (Recall from §4 that preprocessing dis-
tinguishes between left and right quotation mark types,
and between abbreviation dot and period dot types.)
Baselines. We compare our model against three
baselines to show that its complexity is necessary.
Our first baseline is an ablation study that does not
use latent underlying punctuation, but generates
the surface punctuation directly from the tree. (To
implement this, we fix the parameters of the noisy
channel so that the surface punctuation equals the
underlying with probability 1.) If our full model
performs significantly better, it will demonstrate
the importance of a distinct underlying layer.
Our other two baselines ignore the tree struc-
ture, so if our full model performs significantly
better, it will demonstrate that conditioning on ex-
plicit syntactic structure is useful. These baselines
are based on previously published approaches that
reduce the problem to tagging: Xu et al. (2016)
use a BiLSTM-CRF tagger with bigram topology;
Tilk and Alumäe (2016) use a BiGRU tagger with
attention. In both approaches, the model is trained
to tag each slot i with the correct string xi 2V ⇤
(possibly ✏ or ˆ). These are discriminative proba-
bilistic models (in contrast to our generative one).
Each gives a probability distribution over the tag-
gings (conditioned on the unpunctuated sentence),
so we can evaluate their perplexity.14
Results. As shown in Table 3, our full model
beats the baselines in perplexity in all 5 languages.
Also,
in 4 of 5 languages, allowing a trained
NOISYCHANNEL (rather than the identity map)
replacing the left one with “ and the right one with ”.
13For symmetry, we should also have added a final mark.
14These methods learn word embeddings that optimize
log-likelihood on the punctuation restoration
conditional
training data. They might do better if these embeddings were
shared with other tasks, as multi-task learning might lead
them to discover syntactic categories of words.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
l
a
c
_
a
_
0
0
2
7
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Attn.
CRF ATTACH +NC DIR
1.4676 1.3016 1.2230 1.1526 L
Arabic
Chinese 1.6850 1.4436 1.1921 1.1464 L
English 1.5737 1.5247 1.5636 1.4276 R
1.1201 1.1032 1.0630 1.0598 L
Hindi
Spanish 1.4397 1.3198 1.2364 1.2103 R
Table 3: Results of the conditional perplexity experi-
ment (§4), reported as perplexity per punctuation slot,
where an unpunctuated sentence of n words has n + 1
slots. Column “Attn.” is the BiGRU tagger with atten-
tion, and “CRF” stands for the BiLSTM-CRF tagger.
“ATTACH” is the ablated version of our model where
surface punctuation is directly attached to the nodes.
Our full model “+NC” adds NOISYCHANNEL to trans-
duce the attached punctuation into surface punctuation.
DIR is the learned direction (§ 2.2) of our full model’s
noisy channel PFST: Left-to-right or Right-to-left. Our
models are given oracle parse trees T . The best per-
plexity is boldfaced, along with all results that are not
significantly worse (paired permutation test, p < 0.05).
significantly improves the perplexity.
5 Analysis of the Learned Grammar
5.1 Rules Learned from the Noisy Channel
7!
7!
7!
7!
b, ab
a, ab
We study our learned probability distribution over
noisy channel rules (ab
ab,
ba) for English. The probability distributions
ab
corresponding to six of Nunberg’s English rules
are shown in Figure 3. By comparing the orange
and blue bars, observe that the model trained on
the en_cesl treebank learned different quotation
rules from the one trained on the en treebank. This
is because en_cesl follows British style, whereas
en has American-style quote transposition.15
We now focus on the model learned from the
en treebank. Nunberg’s rules are deterministic,
and our noisy channel indeed learned low-entropy
rules, in the sense that for an input ab with un-
25,16 at least one of the possi-
derlying count
ble outputs (a, b, ab or ba) always has probability
> 0.75. The one exception is ”.
.” for which
0.5, porque
the argmax output has probability
writers do not apply this quote transposition rule
consistently. As shown by the blue bars in Fig-
ura 3, the high-probability transduction rules are
7!
⇡
15American style places commas and periods inside the
quotation marks, even if they are not logically in the quote.
British style (more sensibly) places unquoted periods and
commas in their logical place, sometimes outside the quo-
tation marks if they are not part of the quote.
16For rarer underlying pairs ab, the estimated distributions
sometimes have higher entropy due to undertraining.
Cifra 3: Rewrite probabilities learned for English,
averaged over the last 4 epochs on en treebank (azul
barras) or en_esl treebank (orange bars). The header
above each figure is the underlying punctuation string
(input to NOISYCHANNEL). The two counts in the fig-
ure headers are the number of occurrences of the under-
lying punctuation strings in the 1-best reconstruction of
underlying punctuation sequences (by Algorithm 1) re-
spectively in the en and en_esl treebank. Each bar
represents one surface punctuation string (output of
NOISYCHANNEL), its height giving the probability.
consistent with Nunberg’s hand-crafted determin-
istic grammar in Table 1.
Our system has high precision when we look at
the confident rules. Del 24 learned edits with
conditional probability > 0.75, Nunberg lists 20.
Our system also has good recall. Nünberg
hand-crafted schemata consider 16 punctuation
types and generate a total of 192 edit rules, en-
cluding the specimens in Table 1. Eso es, del
162 = 256 possible underlying punctuation bi-
grams ab, 3
4 are supposed to undergo absorption
or transposition. Our method achieves fairly high
recordar, in the sense that when Nunberg proposes
ab) usually ranks highly
ab
ab). 75
among all probabilities of the form p(0 |
of Nunberg’s rules got rank 1, 48 got rank 2, y
the remaining 69 got rank > 2. The mean recipro-
cal rank was 0.621. Recall is quite high when we
restrict to those Nunberg rules ab
para cual
our model is confident how to rewrite ab, en el
sense that some p(0
(This tends
to eliminate rare ab: see footnote 5.) De estos 55
Nunberg rules, 38 rules got rank 1, 15 got rank 2,
y solo 2 got rank worse than 2. The mean recip-
rocal rank was 0.836.
, our learned p(
ab) > 0.5.
7!
7!
|
|
¿What about Spanish? Spanish uses inverted
question marks ¿ and exclamation marks ¡, cual
form symmetric pairs with the regular question
marks and exclamation marks.
If we try to ex-
trapolate to Spanish from Nunberg’s English for-
364
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
yo
a
C
_
a
_
0
0
2
7
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
malization, the English mark most analogous to ¿
es (. Our learned noisy channel for Spanish (no
graphed here) includes the high-probability rules
,¿
¿ which match
:¿ and ¿,
7!
Nunberg’s treatment of ( en Inglés.
,¿ and :¿
7!
7!
parataxis appos
2.29
2.38
, , 26.8
✏ ✏
20.1
( ) 13.0
– ✏
9.7
: ✏
8.1
advcl
0.77
ccomp
0.53
lista
1.33
, , 18.8 ✏ ✏ 60.0 ✏ ✏ 73.8
90.8
: ✏ 18.1 , , 22.3 , , 21.2 “ ” 2.4
– ✏ 15.9 , ✏ 5.3 ✏ , 3.1
, , 2.4
✏ ✏ 14.4 < > 3.0 ( ) 0.74 :“ ” 0.9
( ) 13.1 ( ) 3.0 ✏ – 0.21 “ ," 0.8
✏ ✏
5.2 Attachment Model
What does our model learn about how dependency
relations are marked by underlying punctuation?
,advmod,
raíz.
,“ccomp”
,nmod,
Mesa 4: The top 5 relations that are most likely to
generate symmetric punctemes, the entropy of their
puncteme pair (row 2), and their top 5 puncteme pairs
(rows 3–7) with their probabilities shown as percent-
siglos. The symmetric punctemes are in boldface.
ˆ,Earlier, Kerry said ,“ … ,En realidad, answer the question”.
ˆ Earlier, Kerry said ,“ … ,En realidad, answer the question.”
6.1 Punctuation Restoration
The above example17 illustrates the use of specific
puncteme pairs to set off the advmod, ccomp,
and nmod relations. Notice that said takes
a complement (ccomp) eso
is symmetrically
quoted but also left delimited by a comma, cual
is indeed how direct speech is punctuated in
Inglés. This example also illustrates quotation
transposition. The top five relations that are most
likely to generate symmetric punctemes and their
arriba (yo, r) pairs are shown in Table 4.
conj
,conj,
,conj,
cc
Sección 1
Sección 1
,2 , ,… 7 , y 8 …
,… 7 , y 8 …
,2
The above example18 shows how our model han-
dles commas in conjunctions of 2 or more phrases.
UD format dictates that each conjunct after the
first is attached by the conj relation. As shown
arriba, each such conjunct is surrounded by under-
lying commas (via the N.,.,.conj feature from
Apéndice A), except for the one that bears the
conjunction and (via an even stronger weight on
the C.✏.✏.!conj.cc feature). Our learned feature
weights indeed yield p(` = ✏, r = ✏) > 0.5 para el
final conjunct in this example. Some writers omit
the “Oxford comma” before the conjunction: este
style can be achieved simply by changing “sur-
rounded” to “preceded” (eso es, changing the N
feature to N.,.✏.conj).
6 Performance on Extrinsic Tasks
en esta tarea, we are given a depunctuated sentence
¯d19 and must restore its (surface) punctuation. Nuestro
model supposes that the observed punctuated sen-
tence ¯x would have arisen via the generative pro-
impuesto (1). De este modo, we try to find T , t 0, and ¯x that are
consistent with ¯d (a partial observation of ¯x).
The first step is to reconstruct T from ¯d. Este
initial parsing step is intended to choose the T that
¯d).20 This step depends only
maximizes psyn(t
on psyn and not on our punctuation model (p✓, pag).
En la práctica, we choose T via a dependency parser
that has been trained on an unpunctuated treebank
with examples of the form (¯d, t ).21
|
Ecuación (2) now defines a distribution over
(t 0, X) given this T . To obtain a single prediction
for x, we adopt the minimum Bayes risk (MBR)
approach of choosing surface punctuation ˆx that
minimizes the expected loss with respect to the
unknown truth x⇤. Our loss function is the total
edit distance over all slots (where edits operate on
punctuation tokens). Finding ˆx exactly would be
intractable, so we use a sampling-based approx-
imation and draw m = 1000 samples from the
posterior distribution over (t 0, X). We then define
ˆx = argmin
S(t )
X
2
S(t )
Xx⇤2
ˆp(x⇤
t )
|
·
loss(X, x⇤) (10)
where S(t ) is the set of unique x values in the
sample and ˆp is the empirical distribution given by
the sample. This can be evaluated in O(m2) tiempo.
We evaluate the trained punctuation model by us-
ing it in the following three tasks.
19 To depunctuate a treebank sentence, we remove all to-
kens with POS-tag PUNCT or dependency relation punct.
These are almost always leaves; else we omit the sentence.
17[en] Earlier, Kerry said, “Just because you
get an honorable discharge does not, En realidad,
answer that question.”
18[en]
Secciones 1, 2, 5, 6, 7, y 8 will
survive any termination of this License.
365
20Idealmente, rather than maximize, one would integrate over
possible trees T , in practice by sampling many values Tk
¯u) and replacing S(t ) en (10) con
k S(Tk).
from psyn(
the Yara parser (Rasooli and Tetreault,
2015), a fast non-probabilistic transition-based parser that
uses rich non-local features (Zhang and Nivre, 2011).
21Específicamente,
· |
S
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
yo
a
C
_
a
_
0
0
2
7
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
We evaluate on Arabic, Inglés, Chino,
Hindi, and Spanish. For each language, we train
both the parser and the punctuation model on
the training split of that UD treebank (§4), y
evaluate on held-out data. We compare to the
BiLSTM-CRF baseline in §4 (Xu et al., 2016).22
We also compare to a “trivial” deterministic base-
line, which merely places a period at the end of the
oración (o un «|» in the case of Hindi) and adds no
other punctuation. Because most slots do not in
fact have punctuation, the trivial baseline already
does very well; to improve on it, we must fix its
errors without introducing new ones.
Our final comparison on test data is shown in
the table in Figure 4. On all 5 idiomas, nuestro
method beats (usually significantly) es 3 com-
petitors:
el
BiLSTM-CRF, and the ablated version of our
modelo (ATTACH) that omits the noisy channel.
the trivial deterministic baseline,
Por supuesto, the success of our method depends
on the quality of the parse trees T (which is par-
ticularly low for Chinese and Arabic). The graph
En figura 4 explores this relationship, by evaluat-
En g (on dev data) with noisier trees obtained from
parsers that were variously trained on only the first
10%, 20%, . . . of the training data. On all 5 lan-
calibres, provided that the trees are at least 75%
correcto, our punctuation model beats both the triv-
ial baseline and the BiLSTM-CRF (which do not
use trees). It also beats the ATTACH ablation base-
line at all levels of tree accuracy (these curves are
omitted from the graph to avoid clutter). In all lan-
calibres, better parses give better performance, y
gold trees yield the best results.
6.2 Punctuation Correction
Our next goal is to correct punctuation errors in
a learner corpus. Each sentence is drawn from
the Cambridge Learner Corpus treebanks, cual
provide original (en_esl) and corrected (en_cesl)
oraciones. All kinds of errors are corrected, semejante
22We copied their architecture exactly but re-tuned the hy-
perparameters on our data. We also tried tripling the amount
of training data by adding unannotated sentences (provided
along with the original annotated sentences by Ginter et al.
(2017)), taking advantage of the fact that the BiLSTM-CRF
does not require its training sentences to be annotated with
árboles. Sin embargo, this actually hurt performance slightly, por-
haps because the additional sentences were out-of-domain.
We also tried the BiGRU-with-attention architecture of Tilk
and Alumäe (2016), but it was also weaker than the BiLSTM-
CRF (just as in Table 3). We omit all these results from Fig-
ura 4 to reduce clutter.
pag
8 ATTACH a–
–a
Arábica 0.064 0.064 0.063
Chino 0.110 0.109 0.104
Inglés 0.100 0.108 0.092
0.025 0.023 0.019
Hindi
Español 0.093 0.092 0.085
0.059 0.053
0.102 0.048
0.090 0.079
0.018 0.013
0.078 0.068
Cifra 4: Edit distance per slot (which we call average
edit distance, or AED) for each of the 5 corpus. Lower
is better. The table gives the final AED on the test data.
Its first 3 columns show the baseline methods just as in
Mesa 3: the trivial deterministic method, the BiLSTM-
CRF, and the ATTACH ablation baseline that attaches
the surface punctuation directly to the tree. Columna 4
is our method that incorporates a noisy channel, y
columna 5 (in gray) is our method using oracle (oro)
árboles. We boldface the best non-oracle result as well as
all that are not significantly worse (paired permutation
prueba, pag < 0.05). The curves show how our method’s
AED (on dev data) varies with the labeled attachment
score (LAS) of the trees, where --a at x = 100 uses
the oracle (gold) trees, a-- at x < 100 uses trees from
our parser trained on 100% of the training data, and the
100 use increasingly worse parsers.
-- points at x
The p and 8 at the right of the graph show the AED of
#
the trivial deterministic baseline and the BiLSTM-CRF
baseline, which do not use trees.
⌧
as syntax errors, but we use only the 30% of sen-
tences whose depunctuated trees T are isomorphic
between en_esl and en_cesl. These en_cesl
trees may correct word and/or punctuation errors
in en_esl, as we wish to do automatically.
We assume that an English learner can make
mistakes in both the attachment and the noisy
channel steps. A common attachment mistake is
the failure to surround a non-restrictive relative
In the noisy channel step,
clause with commas.
mistakes in quote transposition are common.
Correction model. Based on the assumption
about the two error sources, we develop a dis-
Let ¯xe de-
criminative model for this task.
note the full input sentence, and let xe and xc
denote the input (possibly errorful) and output
(corrected) punctuation sequences. We model
p✓(T 0c |
p(xc |
T
|
T, xe)
T 0c). Here T is the depunctu-
ated parse tree, T 0c is the corrected underlying tree,
T 0e is the error underlying tree, and we assume
T 0e).
p(T 0e |
p✓(T 0c |
¯xe) =
p(xc |
P
p✓(T 0c |
T, xe) =
psyn(T
T, xe)
¯xe)
P
T 0e
T 0c
·
·
·
366
P
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
l
a
c
_
a
_
0
0
2
7
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
In practice we use a 1-best pipeline rather than
summing. Our first step is to reconstruct T from
the error sentence ¯xe. We choose T that max-
¯xe) from a dependency parser
imizes psyn(T
trained on en_esl treebank examples (¯xe, T ). The
second step is to reconstruct T 0e based on our punc-
tuation model trained on en_esl. We choose T 0e
that maximizes p(T 0e |
T, xe). We then reconstruct
T 0c by
|
p(T 0c |
T 0e) =
we
T 0e
2
p(l, r
we)
|
(11)
where we is the node in T 0e, and p(l, r
we) is a
Q
similar log-linear model to equation (4) with addi-
tional features (Appendix C4) which look at we.
|
Finally, we reconstruct xc based on the noisy
T 0c) in § 2.2. During training, is
channel p(xc |
regularized to be close to the noisy channel param-
eters in the punctuation model trained on en_cesl.
We use the same MBR decoder as in § 6.1 to
choose the best action. We evaluate using AED
as in § 6.1. As a second metric, we use the script
from the CoNLL 2014 Shared Task on Grammati-
cal Error Correction (Ng et al., 2014): it computes
the F0.5-measure of the set of edits found by the
system, relative to the true set of edits.
As shown in Table 5, our method achieves bet-
ter performance than the punctuation restoration
baselines (which ignore input punctuation). On
the other hand, it is soundly beaten by a new
BiLSTM-CRF that we trained specifically for the
task of punctuation correction. This is the same
as the BiLSTM-CRF in the previous section, ex-
cept that the BiLSTM now reads a punctuated
input sentence (with possibly erroneous punctua-
n, the BiL-
tion). To be precise, at step 0
STM reads a concatenation of the embedding of
word i (or BOS if i = 0) with an embedding of
the punctuation token sequence xi. The BiLSTM-
CRF wins because it is a discriminative model tai-
lored for this task: the BiLSTM can extract arbi-
trary contextual features of slot i that are corre-
lated with whether xi is correct in context.
i
6.3 Sentential Rephrasing
We suspect that syntactic transformations on a
sentence should often preserve the underlying
punctuation attached to its tree. The surface punc-
tuation can then be regenerated from the trans-
formed tree. Such transformations include ed-
its that are suggested by a writing assistance tool
(Heidorn, 2000), or subtree deletions in compres-
sive summarization (Knight and Marcu, 2002).
p
8
parsed gold 8-corr
a--
AED 0.052 0.051 0.047 0.034 0.033 0.005
F0.5 0.779 0.787 0.827 0.876 0.881 0.984
Table 5: AED and F0.5 results on the test split of
English-ESL data. Lower AED is better; higher F0.5
is better. The first three columns (markers corre-
spond to Figure 4) are the punctuation restoration base-
lines, which ignore the input punctuation. The fourth
and fifth columns are our correction models, which
use parsed and gold trees. The final column is the
BiLSTM-CRF model tailored for the punctuation cor-
rection task.
For our experiment, we evaluate an interesting
case of syntactic transformation. Wang and Eis-
ner (2016) consider a systematic rephrasing pro-
cedure by rearranging the order of dependent sub-
trees within a UD treebank, in order to synthesize
new languages with different word order that can
then be used to help train multi-lingual systems
(i.e., data augmentation with synthetic data).
As Wang and Eisner acknowledge (2016, foot-
note 9), their permutations treat surface punctua-
tion tokens like ordinary words, which can result
in synthetic sentences whose punctuation is quite
unlike that of real languages.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
l
a
c
_
a
_
0
0
2
7
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
In our experiment, we use Wang and Eisner’s
(2016) “self-permutation” setting, where the de-
pendents of each noun and verb are stochastically
reordered, but according to a dependent ordering
model that has been trained on the same language.
rephrasing a English sentence
For example,
advcl
punct
root
mark
det
SCONJ
If true , the caper failed .
nsubj
PUNCT
NOUN
VERB
DET
ADJ
PUNCT
punct
under an English ordering model may yield
det
nsubj
DET
the caper failed .
punct
VERB
NOUN
root
punct
advcl
PUNCT SCONJ
mark
ADJ
If true ,
PUNCT
which is still grammatical except that , and . are
wrongly swapped (after all, they have the same
POS tag and relation type). Worse, permutation
may yield bizarre punctuation such as , , at the
start of a sentence.
Our punctuation model gives a straightforward
remedy—instead of permuting the tree directly,
we first discover its most likely underlying tree
,advcl,
root.
mark
det
ˆ, If true, the caper failed .
nsubj
by the maximizing variant of Algorithm 1 (§ 3.1).
Then, we permute the underlying tree and sample
the surface punctuation from the distribution
modeled by the trained PFST, yielding
367
Punctuation
All
Base Half Full Base Half Full
Arabic 156.0 231.3 186.1 540.8 590.3 553.4
Chinese 165.2 110.0 61.4 205.0 174.4 78.7
English 98.4 74.5 51.0 140.9 131.4 75.4
Hindi
118.4 118.8 91.8
Spanish 266.2 259.2 194.5 346.3 343.4 239.3
10.8 11.0 9.7
“All” evaluates on all
Table 6: Perplexity (evaluated on the train split to
avoid evaluating generalization) of a trigram language
model trained (with add-0.001 smoothing) on differ-
ent versions of rephrased training sentences. “Punc-
tuation” only evaluates perplexity on the trigrams that
the tri-
have punctuation.
grams. “Base” permutes all surface dependents includ-
ing punctuation (Wang and Eisner, 2016). “Full” is
our full approach: recover underlying punctuation, per-
mute remaining dependents, regenerate surface punc-
tuation. “Half” is like “Full” but it permutes the non-
punctuation tokens identically to “Base.” The permu-
tation model is trained on surface trees or recovered
underlying trees T 0, respectively. In each 3-way com-
parison, we boldface the best result (always significant
under a paired permutation test over per-sentence log-
probabilities, p < 0.05).
det
nsubj
root.
,advcl,
mark
ˆthe caper failed ,If true ,.
ˆthe caper failed ,If true .
We
leave the handling of capitalization to future work.
We test the naturalness of the permuted sen-
tences by asking how well a word trigram lan-
guage model trained on them could predict the
original sentences.23 As shown in Table 6, our per-
mutation approach reduces the perplexity over the
baseline on 4 of the 5 languages, often dramati-
cally.
7 Related Work
Punctuation can aid syntactic analysis, since it
signals phrase boundaries and sentence structure.
Briscoe (1994) and White and Rajkumar (2008)
parse punctuated sentences using hand-crafted
constraint-based grammars that implement Nun-
berg’s approach in a declarative way. These gram-
mars treat surface punctuation symbols as ordi-
nary words, but annotate the nonterminal cate-
gories so as to effectively keep track of the under-
lying punctuation. This is tantamount to crafting
a grammar for underlyingly punctuated sentences
and composing it with a finite-state noisy channel.
23So the two approaches to permutation yield different
training data, but are compared fairly on the same test data.
368
The parser of Ma et al. (2014) takes a differ-
ent approach and treats punctuation marks as fea-
tures of their neighboring words. Zhang et al.
(2013) use a generative model for punctuated sen-
tences, leting them restore punctuation marks dur-
ing transition-based parsing of unpunctuated sen-
tences. Li et al. (2005) use punctuation marks to
segment a sentence: this "divide and rule" strat-
egy reduces ambiguity in parsing of long Chinese
sentences. Punctuation can similarly be used to
constrain syntactic structure during grammar in-
duction (Spitkovsky et al., 2011).
Punctuation restoration (§ 6.1) is useful for tran-
scribing text from unpunctuated speech. The task
is usually treated by tagging each slot with zero
or more punctuation tokens, using a traditional
sequence labeling method: conditional random
fields (Lui and Wang, 2013; Lu and Ng, 2010), re-
current neural networks (Tilk and Alumäe, 2016),
or transition-based systems (Ballesteros and Wan-
ner, 2016).
8 Conclusion and Future Work
We have provided a new computational approach
to modeling punctuation. In our model, syntactic
constituents stochastically generate latent under-
lying left and right punctemes. Surface punctu-
ation marks are not directly attached to the syn-
tax tree, but are generated from sequences of adja-
cent punctemes by a (stochastic) finite-state string
rewriting process . Our model is inspired by Nun-
berg’s (1990) formal grammar for English punctu-
ation, but is probabilistic and trainable. We give
exact algorithms for training and inference.
We trained Nunberg-like models for 5 lan-
guages and L2 English. We compared the English
model to Nunberg’s, and showed how the trained
models can be used across languages for punctua-
tion restoration, correction, and adjustment.
In the future, we would like to study the
usefulness of the recovered underlying trees on
tasks such as syntactically sensitive sentiment
analysis (Tai et al., 2015), machine translation
(Cowan et al., 2006), relation extraction (Cu-
lotta and Sorensen, 2004), and coreference reso-
lution (Kong et al., 2010). We would also like
to investigate how underlying punctuation could
aid parsing. For discriminative parsing, features
for scoring the tree could refer to the underly-
ing punctuation, not just the surface punctuation.
For generative parsing (§3), we could follow the
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
l
a
c
_
a
_
0
0
2
7
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
scheme in equation (1). For example, the psyn
factor in equation (1) might be a standard re-
current neural network grammar (RNNG) (Dyer
et al., 2016); when a subtree of T is completed by
the REDUCE operation of psyn, the punctuation-
augmented RNNG (1) would stochastically attach
subtree-external left and right punctemes with p✓
and transduce the subtree-internal slots with p.
In the future, we are also interested in enriching
the T 0 representation and making it more differ-
ent from T , to underlyingly account for other phe-
nomena in T such as capitalization, spacing, mor-
phology, and non-projectivity (via reordering).
Acknowledgments
This material is based upon work supported by
the National Science Foundation under Grant Nos.
1423276 and 1718846, including a REU supple-
ment to the first author. We are grateful to the state
of Maryland for the Maryland Advanced Research
Computing Center, a crucial resource. We thank
Xiaochen Li for early discussion, Argo lab mem-
bers for further discussion, and the three reviewers
for quality comments.
References
Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen
Rambow, and Rebecca Passonneau. 2011. Sen-
timent analysis of Twitter data. In Proceedings
of the Workshop on Language in Social Media
(LSM 2011), pages 30–38.
Miguel Ballesteros and Leo Wanner. 2016. A neu-
ral network architecture for multilingual punc-
tuation generation. In Proceedings of the 2016
Conference on Empirical Methods in Natural
Language Processing, pages 1048–1053.
On formal properties of
Yehoshua Bar-Hillel, M. Perles, and E. Shamir.
simple
1961.
für
phrase structure grammars.
Phonetik, Sprachwissenschaft und Kommunika-
tionsforschung, 14:143–172. Reprinted in Y.
Bar-Hillel (1964), Language and Information:
Selected Essays on their Theory and Applica-
tion, Addison-Wesley 1964, pages 116–150.
Zeitschrift
Jean Berstel and Christophe Reutenauer. 1988.
Rational Series and their Languages. Springer-
Verlag.
369
Ann Bies, Mark Ferguson, Karen Katz, Robert
MacIntyre, Victoria Tredinnick, Grace Kim,
Mary Ann Marcinkiewicz, and Britta Schas-
berger. 1995. Bracketing guidelines for Tree-
bank II style: Penn Treebank project. Technical
Report MS-CIS-95-06, University of Pennsyl-
vania.
Ted Briscoe. 1994. Parsing (with) punctuation,
etc. Technical report, Xerox European Re-
search Laboratory.
Danqi Chen and Christopher Manning. 2014. A
fast and accurate dependency parser using neu-
ral networks. In Proceedings of the 2014 Con-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 740–750.
Noam Chomsky and Morris Halle. 1968. The
Sound Pattern of English. Harper and Row,
New York.
Ryan Cotterell, Nanyun Peng, and Jason Eisner.
2014. Stochastic contextual edit distance and
probabilistic FSTs. In Proceedings of the 52nd
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 2: Short Papers),
pages 625–630.
Ryan Cotterell, Nanyun Peng, and Jason Eisner.
2015. Modeling word forms using latent under-
lying morphs and phonology. Transactions of
the Association for Computational Linguistics
(TACL), 3:433–447.
Ryan Cotterell, Tim Vieira, and Hinrich Schütze.
2016. A joint model of orthography and mor-
In Proceedings of
phological segmentation.
the 2016 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies
(NAACL-HLT), pages 664–669.
Brooke Cowan, Ivona Kuˇcerová, and Michael
Collins. 2006. A discriminative model for tree-
to-tree translation. In Proceedings of the 2006
Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 232–
241.
Aron Culotta and Jeffrey Sorensen. 2004. De-
pendency tree kernels for relation extraction.
In Proceedings of the 42nd Annual Meeting of
the Association for Computational Linguistics
(ACL).
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
l
a
c
_
a
_
0
0
2
7
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Timothy Dozat and Christopher Manning. 2017.
In
Efficient third-order dependency parsers.
Proceedings of the 5th International Confer-
ence on Learning Representations (ICLR).
Chris Dyer, Adhiguna Kuncoro, Miguel Balles-
teros, and Noah A. Smith. 2016. Recurrent
In Proceedings of
neural network grammars.
the 2016 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies
(NAACL-HLT), pages 199–209.
Jason Eisner. 1996. Three new probabilistic mod-
els for dependency parsing: An exploration. In
Proceedings of the 16th International Confer-
ence on Computational Linguistics (COLING),
pages 340–345.
Jason Eisner. 2016.
Inside-outside and forward-
backward algorithms are just backprop. In Pro-
ceedings of the EMNLP Workshop on Struc-
tured Prediction for NLP.
A. Caracciolo di Forino. 1968. String process-
ing languages and generalized Markov algo-
rithms. In D. G. Bobrow, editor, Symbol Manip-
ulation Languages and Techniques, pages 191–
206. North-Holland Publishing Company, Am-
sterdam.
Kuzman Ganchev, Jennifer Gillenwater, and Ben
Taskar. 2010. Posterior regularization for struc-
tured latent variable models. Journal of Ma-
chine Learning Research, 11:2001–2049.
Filip Ginter, Jan Hajiˇc, Juhani Luotolahti, Milan
Straka, and Daniel Zeman. 2017. CoNLL 2017
shared task - automatically annotated raw texts
and word embeddings. LINDAT/CLARIN dig-
ital library at the Institute of Formal and Ap-
plied Linguistics (ÚFAL), Faculty of Mathe-
matics and Physics, Charles University.
Yoav Goldberg and Michael Elhadad. 2010. An
efficient algorithm for easy-first non-directional
In Human Language
dependency parsing.
Technologies: The 2010 Annual Conference of
the North American Chapter of the Association
for Computational Linguistics (NAACL-HLT),
pages 742–750.
Joshua Goodman. 1999. Semiring parsing. Com-
putational Linguistics, 25(4):573–605.
George Heidorn. 2000.
Intelligent writing assis-
In Robert Dale, Herman Moisl, and
tance.
Harold Somers, editors, Handbook of Natural
Language Processing, pages 181–207. Marcel
Dekker, New York.
C. Douglas Johnson. 1972. Formal Aspects of
Phonological Description. Mouton.
Bernard E. M. Jones. 1994. Exploring the role of
punctuation in parsing natural text. In COLING
1994 Volume 1: The 15th International Confer-
ence on Computational Linguistics.
Diederik Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. In Proceed-
ings of the International Conference on Learn-
ing Representations (ICLR).
Eliyahu Kiperwasser and Yoav Goldberg. 2016.
Simple and accurate dependency parsing us-
ing bidirectional LSTM feature representations.
Transactions of the Association for Computa-
tional Linguistics (TACL), 4:313–327.
Kevin Knight and Daniel Marcu. 2002. Summa-
rization beyond sentence extraction: A proba-
bilistic approach to sentence compression. Ar-
tificial Intelligence, 139(1):91–107.
Fang Kong, Guodong Zhou, Longhua Qian, and
Dependency-driven
Qiaoming Zhu. 2010.
anaphoricity determination for coreference res-
In Proceedings of the 23rd Interna-
olution.
tional Conference on Computational Linguis-
tics (COLING), pages 599–607.
Terry Koo and Michael Collins. 2010. Efficient
third-order dependency parsers. In Proceedings
of the 48th Annual Meeting of the Association
for Computational Linguistics (ACL), pages 1–
11.
Albert E. Krahn. 2014. A New Paradigm for
Punctuation. Ph.D. thesis, The University of
Wisconsin-Milwaukee.
Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay,
and Tommi Jaakkola. 2014. Low-rank tensors
for scoring dependency structures. In Proceed-
ings of the 52nd Annual Meeting of the As-
sociation for Computational Linguistics (ACL),
pages 1381–1391.
370
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
l
a
c
_
a
_
0
0
2
7
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Roger Levy. 2008. A noisy-channel model of
human sentence comprehension under uncer-
In Proceedings of the 2008 Con-
tain input.
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 234–243.
Xing Li, Chengqing Zong, and Rile Hu. 2005. A
hierarchical parsing approach with punctuation
processing for long Chinese sentences. In Pro-
ceedings of the International Joint Conference
on Natural Language Processing (IJCNLP).
Zhifei Li and Jason Eisner. 2009.
First- and
second-order expectation semirings with appli-
cations to minimum-risk training on translation
In Proceedings of the Conference on
forests.
Empirical Methods in Natural Language Pro-
cessing (EMNLP), pages 40–51.
Wei Lu and Hwee Tou Ng. 2010. Better punctu-
ation prediction with dynamic conditional ran-
In Proceedings of the 2010 Con-
dom fields.
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 177–186.
Marco Lui and Li Wang. 2013.
Recovering
casing and punctuation using conditional ran-
dom fields. In Proceedings of the Australasian
Language Technology Association Workshop
(ALTA), pages 137–141.
Ji Ma, Yue Zhang, and Jingbo Zhu. 2014. Punc-
tuation processing for projective dependency
In Proceedings of the 52nd Annual
parsing.
Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages
791–796.
Gideon S. Mann and Andrew McCallum. 2010.
Generalized expectation criteria for
semi-
supervised learning with weakly labeled
data. Journal of Machine Learning Research,
11:955–984.
Andrey Andreevich Markov. 1960. The theory
of algorithms. American Mathematical Society
Translations, series 2(15):1–14.
Ilia Markov, Vivi Nastase, and Carlo Strappar-
ava. 2018. Punctuation as native language in-
In Proceedings of the 27th Inter-
terference.
national Conference on Computational Linguis-
tics (COLING), pages 3456–3466.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
frey Dean. 2013. Efficient estimation of word
representations in vector space. Computing Re-
search Repository (CoRR), arXiv:1301.3781.
Mark-Jan Nederhof and Giorgio Satta. 2003.
Probabilistic parsing as intersection. In 8th In-
ternational Workshop on Parsing Technologies
(IWPT), pages 137–148.
Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Chris-
tian Hadiwinoto, Raymond Hendy Susanto, and
Christopher Bryant. 2014. The CoNLL-2014
shared task on grammatical error correction.
In Proceedings of the Eighteenth Conference
on Computational Natural Language Learning:
Shared Task, pages 1–14.
Joakim Nivre, Željko Agi´c, Lars Ahrenberg,
Maria Jesus Aranzabe, Masayuki Asahara,
John
Aitziber Atutxa, Miguel Ballesteros,
Bauer, Kepa Bengoetxea, Yevgeni Berzak,
Riyaz Ahmad Bhat, Eckhard Bick, Carl
Börstell, Cristina Bosco, Gosse Bouma, Sam
Bowman, Gül¸sen Cebiro˘glu Eryi˘git, Giuseppe
G. A. Celano, Fabricio Chalub, Ça˘grı Çöl-
tekin, Miriam Connor, Elizabeth Davidson,
Marie-Catherine de Marneffe, Arantza Diaz de
Ilarraza, Kaja Dobrovoljc, Timothy Dozat,
Kira Droganova, Puneet Dwivedi, Marhaba
Eli, Tomaž Erjavec, Richárd Farkas, Jennifer
Foster, Claudia Freitas, Katarína Gajdošová,
Daniel Galbraith, Marcos Garcia, Moa Gär-
denfors, Sebastian Garza, Filip Ginter, Iakes
Goenaga, Koldo Gojenola, Memduh Gökır-
mak, Yoav Goldberg, Xavier Gómez Guino-
vart, Berta Gonzáles Saavedra, Matias Gri-
oni, Normunds Gr¯uz¯ıtis, Bruno Guillaume,
Jan Hajiˇc, Linh Hà M˜y, Dag Haug, Barbora
Hladká, Radu Ion, Elena Irimia, Anders Jo-
hannsen, Fredrik Jørgensen, Hüner Ka¸sıkara,
Hiroshi Kanayama,
Jenna Kanerva, Boris
Katz, Jessica Kenney, Natalia Kotsyba, Si-
mon Krek, Veronika Laippala, Lucia Lam,
Phuong Lê H`ông, Alessandro Lenci, Nikola
Ljubeši´c, Olga Lyashevskaya, Teresa Lynn,
Aibek Makazhanov, Christopher Manning,
C˘at˘alina M˘ar˘anduc, David Mareˇcek, Héctor
Martínez Alonso, André Martins, Jan Mašek,
Yuji Matsumoto, Ryan McDonald, Anna Mis-
silä, Verginica Mititelu, Yusuke Miyao, Simon-
etta Montemagni, Keiko Sophie Mori, Shun-
suke Mori, Bohdan Moskalevskyi, Kadri Muis-
371
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
l
a
c
_
a
_
0
0
2
7
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
chnek, Nina Mustafina, Kaili Müürisep, Luong
Nguy˜ên Thi., Huy`ên Nguy˜ên Thi. Minh, Vitaly
Nikolaev, Hanna Nurmi, Petya Osenova, Robert
Östling, Lilja Øvrelid, Valeria Paiva, Elena Pas-
cual, Marco Passarotti, Cenel-Augusto Perez,
Slav Petrov, Jussi Piitulainen, Barbara Plank,
Martin Popel, Lauma Pretkalnin, a, Prokopis
Prokopidis, Tiina Puolakainen, Sampo Pyysalo,
Alexandre Rademaker, Loganathan Ramasamy,
Livy Real, Laura Rituma, Rudolf Rosa, Shadi
Saleh, Baiba Saul¯ıte, Sebastian Schuster, Wolf-
gang Seeker, Mojgan Seraji, Lena Shakurova,
Mo Shen, Natalia Silveira, Maria Simi, Radu
Simionescu, Katalin Simkó, Mária Šimková,
Kiril Simov, Aaron Smith, Carolyn Spadine,
Alane Suhr, Umut Sulubacak, Zsolt Szántó,
Takaaki Tanaka, Reut Tsarfaty, Francis Ty-
ers, Sumire Uematsu, Larraitz Uria, Gertjan
van Noord, Viktor Varga, Veronika Vincze,
Lars Wallin, Jing Xian Wang, Jonathan North
Washington, Mats Wirén, Zdenˇek Žabokrt-
ský, Amir Zeldes, Daniel Zeman, and Hanzhi
Zhu. 2016.
Universal Dependencies 1.4.
the In-
LINDAT/CLARIN digital
stitute of Formal and Applied Linguistics
(ÚFAL), Faculty of Mathematics and Physics,
Charles University. Data available at http:
//universaldependencies.org.
library at
Joakim Nivre, Johan Hall, Sandra Kübler, Ryan
McDonald, Jens Nilsson, Sebastian Riedel, and
Deniz Yuret. 2007a. The CoNLL 2007 shared
In Proceedings
task on dependency parsing.
of the CoNLL Shared Task Session of EMNLP-
CoNLL 2007, pages 915–932.
Joakim Nivre, Johan Hall, Jens Nilsson, Atanas
Chanev, Gül¸sen Eryigit, Sandra Kübler, Sve-
toslav Marinov, and Erwin Marsi. 2007b. Malt-
parser: A language-independent system for
data-driven dependency parsing. Natural Lan-
guage Engineering, 13(2):95–135.
Joakim Nivre et al. 2018. Universal depen-
dencies annotation guidelines. Available at
universaldependencies.org.
Geoffrey Nunberg. 1990. The Linguistics of Punc-
tuation. Number 18 in CSLI Lecture Notes.
Center for the Study of Language and Informa-
tion.
Bo Pang,
Lillian Lee,
Vaithyanathan. 2002.
and Shivakumar
Senti-
Thumbs up?
ment classification using machine learning
the 2002
techniques.
Conference on Empirical Methods in Natural
Language Processing (EMNLP 2002).
In Proceedings of
Jeffrey Pennington, Richard Socher, and Christo-
pher Manning. 2014. GloVe: Global vectors
In Proceedings of
for word representation.
the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pages
1532–1543.
Fernando C. N. Pereira and Michael D. Riley.
1996. Speech recognition by composition of
weighted finite automata. Computing Research
Repository (CoRR), arXiv:cmp-lg/9603001.
Mohammad Sadegh Rasooli and Joel R. Tetreault.
2015. Yara parser: A fast and accurate depen-
dency parser. Computing Research Repository,
arXiv:1503.06733 (version 2).
Radim ˇReh˚uˇrek and Petr Sojka. 2010. Software
framework for topic modelling with large cor-
pora. In Proceedings of the LREC 2010 Work-
shop on New Challenges for NLP Frameworks,
pages 45–50.
Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel
Jurafsky. 2011. Punctuation: Making a point in
unsupervised dependency parsing. In Proceed-
ings of the Fifteenth Conference on Computa-
tional Natural Language Learning, CoNLL ’11,
pages 19–28.
Kai Sheng Tai, Richard Socher, and Christo-
pher D. Manning. 2015.
Improved semantic
representations from tree-structured long short-
term memory networks. In Proceedings of the
53rd Annual Meeting of the Association for
Computational Linguistics and the 7th Interna-
tional Joint Conference on Natural Language
Processing (ACL-COLING), pages 1556–1566.
Ottokar Tilk and Tanel Alumäe. 2016. Bidirec-
tional recurrent neural network with attention
mechanism for punctuation restoration. In In-
terspeech, pages 3047–3051.
Ke M. Tran, Yonatan Bisk, Ashish Vaswani,
Daniel Marcu, and Kevin Knight. 2016. Unsu-
pervised neural hidden Markov models. In Pro-
ceedings of the Workshop on Structured Predic-
tion for NLP, pages 63–71.
372
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
l
a
c
_
a
_
0
0
2
7
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
University of Chicago. 2010. The Chicago Man-
ual of Style. University of Chicago Press.
Dingquan Wang and Jason Eisner. 2016. The
Galactic Dependencies treebanks: Getting more
data by synthesizing new languages. Transac-
tions of the Association for Computational Lin-
guistics (TACL), 4:491–505.
Michael White and Rajakrishnan Rajkumar. 2008.
A more precise analysis of punctuation for
broad-coverage surface realization with CCG.
In Proceedings of the COLING 2008 Workshop
on Grammar Engineering Across Frameworks,
pages 17–24.
K. Xu, L. Xie, and K. Yao. 2016.
Investigat-
ing LSTM for punctuation prediction. In 2016
10th International Symposium on Chinese Spo-
ken Language Processing (ISCSLP), pages 1–5.
Richard Zens, Franz Josef Och, and Hermann Ney.
2002. Phrase-based statistical machine transla-
tion. In Annual Conference on Artificial Intelli-
gence, pages 18–32.
Dongdong Zhang, Shuangzhi Wu, Nan Yang, and
Punctuation prediction with
Mu Li. 2013.
In Proceedings of
transition-based parsing.
the 51st Annual Meeting of the Association for
Computational Linguistics (ACL), pages 752–
760.
Yue Zhang and Joakim Nivre. 2011. Transition-
based dependency parsing with rich non-local
In Proceedings of the 49th Annual
features.
Meeting of the Association for Computational
Linguistics: Human Language Technologies
(NAACL-HLT), pages 188–193.
373
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
7
3
1
9
2
3
4
6
6
/
/
t
l
a
c
_
a
_
0
0
2
7
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3