A Generative Model for Punctuation in Dependency Trees

A Generative Model for Punctuation in Dependency Trees

Xiang Lisa Li⇤ and Dingquan Wang⇤ and Jason Eisner
Department of Computer Science, Johns Hopkins Universität
xli150@jhu.edu, {wdd,jason}@cs.jhu.edu

Abstrakt

Treebanks traditionally treat punctuation
marks as ordinary words, but linguists have
suggested that a tree’s “true” punctuation
marks are not observed (Nunberg, 1990).
These latent “underlying” marks serve to
delimit or separate constituents in the syn-
tax tree. When the tree’s yield is rendered as
a written sentence, a string rewriting mech-
anism transduces the underlying marks into
“surface” marks, which are part of the ob-
serviert (surface) string but should not be re-
garded as part of the tree. We formalize
this idea in a generative model of punc-
tuation that admits efficient dynamic pro-
gramming. We train it without observing
the underlying marks, by locally maximiz-
ing the incomplete data likelihood (simi-
larly to the EM algorithm). When we use
the trained model to reconstruct the tree’s
underlying punctuation, the results appear
plausible across 5 languages, and in par-
ticular are consistent with Nunberg’s anal-
ysis of English. We show that our gener-
ative model can be used to beat baselines
on punctuation restoration. Auch, our recon-
struction of a sentence’s underlying punctu-
ation lets us appropriately render the surface
punctuation (via our trained underlying-to-
surface mechanism) when we syntactically
transform the sentence.

1

Einführung

Punctuation enriches the expressiveness of writ-
ten language. When converting from spoken to
written language, punctuation indicates pauses or
pitches; expresses propositional attitude; und ist
conventionally associated with certain syntactic
constructions such as apposition, parenthesis, quo-
Station, and conjunction.

In diesem Papier, we present a latent-variable
model of punctuation usage, inspired by the rule-
based approach to English punctuation of Nun-
berg (1990). Training our model on English data

⇤Equal contribution.

learns rules that are consistent with Nunberg’s
hand-crafted rules. Our system is automatic, so we
use it to obtain rules for Arabic, Chinese, Spanish,
and Hindi as well.

Darüber hinaus, our rules are stochastic, which al-
lows us to reason probabilistically about ambigu-
ous or missing punctuation. Across the 5 lan-
guages, our model predicts surface punctuation
better than baselines, as measured both by per-
plexity (§4) and by accuracy on a punctuation
restoration task (§ 6.1). We also use our model
to correct the punctuation of non-native writers
of English (§ 6.2), and to maintain natural punc-
tuation style when syntactically transforming En-
glish sentences (§ 6.3).
In principle, our model
could also be used within a generative parser, al-
lowing the parser to evaluate whether a candidate
tree truly explains the punctuation observed in the
input sentence (§8).

Punctuation is interesting In The Linguistics of
Punctuation, Nunberg (1990) argues that punctu-
ation (in English) is more than a visual counter-
part of spoken-language prosody, but forms a lin-
guistic system that involves “interactions of point
indicators (i.e. commas, semicolons, colons, pe-
riods and dashes).” He proposes that much as in
phonology (Chomsky and Halle, 1968), a gram-
mar generates underlying punctuation which then
transforms into the observed surface punctuation.
Consider generating a sentence from a syntactic

grammar as follows:

Hail the king [, Arthur Pendragon ,]

[, who wields [ “ Excalibur ” ] ,] .

Although the full tree is not depicted here, some of
the constituents are indicated with brackets. In diesem
underlying generated tree, each appositive NP is
surrounded by commas. On the surface, Jedoch,
the two adjacent commas after Pendragon will
now be collapsed into one, and the final comma
will be absorbed into the adjacent period. Fur-
thermore, in American English, the typographic

357
Transactions of the Association for Computational Linguistics, Bd. 7, S. 357–373, 2019. Action Editor: Trevor Cohn.
Submission batch: 10/2018; Revision batch: 1/2019; Published 7/2019.
C(cid:2) 2019 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
2
7
3
1
9
2
3
4
6
6

/

/
T

l

A
C
_
A
_
0
0
2
7
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

convention is to move the final punctuation inside
the quotation marks. Thus a reader sees only this
modified surface form of the sentence:

Hail the king, Arthur Pendragon,
who wields “Excalibur.”

Note that these modifications are string transfor-
mations that do not see or change the tree. Der
resulting surface punctuation marks may be clues
to the parse tree, Aber (contrary to NLP convention)
they should not be included as nodes in the parse
tree. Only the underlying marks play that role.

Punctuation is meaningful Pang et al. (2002)
use question and exclamation marks as clues to
sentiment. Ähnlich, quotation marks may be
used to mark titles, quotations, reported speech,
or dubious terminology (Universität von Chicago,
2010). Because of examples like this, methods for
determining the similarity or meaning of syntax
trees, such as a tree kernel (Agarwal et al., 2011)
or a recursive neural network (Tai et al., 2015),
should ideally be able to consider where the un-
derlying punctuation marks attach.

Punctuation is helpful Surface punctuation re-
mains correlated with syntactic phrase structure.
NLP systems for generating or editing text must be
able to deploy surface punctuation as human writ-
ers do. Parsers and grammar induction systems
benefit from the presence of surface punctuation
marks (Jones, 1994; Spitkovsky et al., 2011). Es ist
plausible that they could do better with a linguisti-
cally informed model that explains exactly why the
surface punctuation appears where it does. Pat-
terns of punctuation usage can also help identify
the writer’s native language (Markov et al., 2018).

Punctuation is neglected Work on syntax and
parsing tends to treat punctuation as an af-
terthought rather than a phenomenon governed by
its own linguistic principles. Treebank annota-
tion guidelines for punctuation tend to adopt sim-
ple heuristics like “attach to the highest possi-
ble node that preserves projectivity” (Bies et al.,
1995; Nivre et al., 2018).1 Many dependency
parsing works exclude punctuation from evalua-
tion (Nivre et al., 2007B; Koo and Collins, 2010;
Chen and Manning, 2014; Lei et al., 2014; Kiper-
wasser and Goldberg, 2016), although some others
retain punctuation (Nivre et al., 2007A; Goldberg
and Elhadad, 2010; Dozat and Manning, 2017).

Unpunctuated Tree: T

ATTACH

tree:

T 0
Underlying sequence: u
Satz: ¯u

NOISYCHANNEL

Dale

means river valley

nsubj

root

dobj

root.

nsubj
u0
u1
“ Dale ”

dobj


u3

u4
means “ river valley ” .

u2

Surface

Satz: ¯x
sequence: X

“ Dale ”
x0
x1

means “ river valley . ”
x4

x3

x2

Figur 1: The generative story of a sentence. Gegeben
an unpunctuated tree T at top, at each node w
2
T , the ATTACH process stochastically attaches a left
puncteme l and a right puncteme r, which may be
leer. The resulting tree T 0 has underlying punctua-
tion u. Each slot’s punctuation ui 2
u is rewritten to
x by NOISYCHANNEL.
xi 2

In tasks such as word embedding induction
(Mikolov et al., 2013; Pennington et al., 2014) Und
maschinelle Übersetzung (Zens et al., 2002), punctua-
tion marks are usually either removed or treated as
ordinary words ( ˇReh˚uˇrek and Sojka, 2010).

Yet to us, building a parse tree on a surface
sentence seems as inappropriate as morphologi-
cally segmenting a surface word. In both cases,
one should instead analyze the latent underlying
bilden, jointly with recovering that form. For exam-
Bitte, the proper segmentation of English hoping
is not hop-ing but hope-ing (with underlying
e), and the proper segmentation of stopping
is neither stopp-ing nor stop-ping but
stop-ing (with only one underlying p). Cot-
terell et al. (2015, 2016) get this right for morphol-
Ogy. We attempt to do the same for punctuation.

2 Formal Model

We propose a probabilistic generative model of
Sätze (Figur 1):

(1)
¯u(T 0))

P(¯x) =

psyn(T )

T,T 0

p✓(T 0 |

·

T )

·

P(¯x

|

P

Erste, an unpunctuated dependency tree T is
stochastically generated by some recursive pro-
cess psyn (z.B., Eisner, 1996, Model C).2 Zweite,
each constituent (d.h., dependency subtree) sprouts
optional underlying punctuation at its left and right
edges, according to a probability distribution p✓
that depends on the constituent’s syntactic role
(z.B., dobj for “direct object”). This punctuated
tree T 0 yields the underlying string ¯u = ¯u(T 0),
which is edited by a finite-state noisy channel p
to arrive at the surface sentence ¯x.

1http://universaldependencies.org/u/

2Our model could be easily adapted to work on con-

dep/punct.html

stituency trees instead.

358

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
2
7
3
1
9
2
3
4
6
6

/

/
T

l

A
C
_
A
_
0
0
2
7
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

This third step may alter the sequence of punc-
tuation tokens at each slot between words—for ex-
reichlich, in §1, collapsing the double comma , ,
between Pendragon and who. u and x denote
just the punctuation at the slots of ¯u and ¯x respec-
aktiv, with ui and xi denoting the punctuation to-
ken sequences at the ith slot. Daher, the transfor-
mation at the ith slot is ui 7!
Since this model is generative, we could train
it without any supervision to explain the observed
surface string ¯x: maximize the likelihood p(¯x) In
(1), marginalizing out the possible T, T 0 Werte.

xi.

In the present paper, Jedoch, we exploit known
T values (as observed in the “depunctuated” ver-
sion of a treebank). Because T is observed, we can
jointly train ✓, to maximize just

P(X

|

T ) =

XT 0

p✓(T 0

T )

·

|

P(X

|

u(T 0))

(2)

the psyn model that generated T be-
Das ist,
comes irrelevant, but we still try to predict what
surface punctuation will be added to T . Wir
still marginalize over the underlying punctuation
marks u. These are never observed, but they must
explain the surface punctuation marks x (§ 2.2),
and they must be explained in turn by the syntax
tree T (§ 2.1). The trained generative model then
lets us restore or correct punctuation in new trees
T (§6).

2.1 Generating Underlying Punctuation

The ATTACH model characterizes the probability
of an underlying punctuated tree T 0 given its cor-
responding unpunctuated tree T , which is given by

p✓(T 0

|

T ) =

T
Yw
2

p✓(lw, rw |

w)

(3)

where lw, rw 2V
are the left and right punctemes
that T 0 attaches to the tree node w. Each puncteme
is a string of 0 oder
(Krahn, 2014) in the finite set
more underlying punctuation tokens.3 The proba-
bility p✓(l, R

w) is given by a log-linear model

V

|

3Multi-token punctemes are occasionally useful. Zum Beispiel-
reichlich, the punctememight consist of either 1 oder 3 Zu-
kens, depending on how the tokenizer works; similarly, Die
puncteme ?! might consist of 1 oder 2 tokens. Auch, if a sin-
gle constituent of T gets surrounded by both parentheses and
quotation marks, this gives rise to punctemes (“ and ”).
(A better treatment would add the parentheses as a separate
puncteme pair at a unary node above the quotation marks, Aber
that would have required T 0 to introduce this extra node.)

359

7!

, ,.

1. Point Absorption
. -,

-;
.
2. Quote Transposition 4. Bracket Absorptions
(
”,

3. Period Absorption
.!
7!
abbv

?
7!
abbv.

; ;.

– .?

.”

,”

”.

7!

7!

7!

7!

7!

!

7!

7!

,)
,”

7!
7!

) -)
” “,

) (,

7!

7!
7!

Tisch 1: Some of Nunberg’s punctuation interaction
rules in English, in priority order. The absorption rules
ensure that when there are two adjacent tokens, Die
“weaker” one is deleted (where the strength ordering
>> ,), except
Ist
that bracketing tokens such as () and “” do not absorb
tokens outside the material they bracket.

?, !, (, ), „, ”
{

;, :
}
{

> . >

}

p✓(l, R

w)

|

/ (

exp ✓>f (l, R, w) Wenn (l, R)
0

ansonsten

2W d(w)
(4)

V

Und

is the finite set of possible punctemes and
Wo
V
2 gives the possible puncteme pairs for a
Wd ✓V
node w that has dependency relation d = d(w) Zu
its parent.
Wd are estimated heuristically
from the tokenized surface data (§4). F (l, R, w) Ist
a sparse binary feature vector, and ✓ is the cor-
responding parameter vector of feature weights.
The feature templates in Appendix A4 consider the
symmetry between l and r, and their compatibility
mit (A) the POS tag of w’s head word, (B) die de-
pendency paths connecting w to its children and
the root of T , (C) the POS tags of the words flank-
ing the slots containing l and r, (D) surface punc-
tuation already added to w’s subconstituents.

2.2 From Underlying to Surface

From the tree T 0, we can read off the sequence
of underlying punctuation tokens ui at each slot i
between words. Namely, ui concatenates the right
punctemes of all constituents ending at i with the
left punctemes of all constituents starting at i (als
illustrated by the examples in §1 and Figure 1).
The NOISYCHANNEL model then transduces ui to
a surface token sequence xi, for each i = 0, . . . , N
unabhängig (where n is the sentence length).

Nunberg’s formalism Much like Chomsky and
Halle’s (1968) phonological grammar of English,
Nunberg’s (1990) descriptive English punctuation
grammar (Tisch 1) can be viewed computationally
as a priority string rewriting system, or Markov
Algorithmus (Markov, 1960; Caracciolo di Forino,
1968). The system begins with a token string u.

4The appendices (supplementary material) are available at

https://arxiv.org/abs/1906.11298.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
2
7
3
1
9
2
3
4
6
6

/

/
T

l

A
C
_
A
_
0
0
2
7
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

abcde
abcde
a bde
a dbe
a d e

. ab
. v. Chr
. bd
. Sei

ab
B
db
e

7!
7!
7!
7!

7!

ade with a sliding win-
Figur 2: Editing abcde
dow. (When an absorption rule maps 2 tokens to 1, unser
diagram leaves blank space that is not part of the out-
put string.) At each step, the left-to-right process has
already committed to the green tokens as output; hat
not yet looked at the blue input tokens; and is currently
considering how to (weiter) rewrite the black tokens.
The right column shows the chosen edit.

At each step it selects the highest-priority local
rewrite rule that can apply, and applies it as far
left as possible. When no more rules can apply,
the final state of the string is returned as x.

Simplifying the formalism Markov algorithms
are Turing complete. Glücklicherweise, Johnson (1972)
noted that in practice, phonological u
x maps
described in this formalism can usually be imple-
mented with finite-state transducers (FSTs).

7!

For computational simplicity, we will formu-
late our punctuation model as a probabilistic FST
(PFST)—a locally normalized left-to-right rewrite
Modell (Cotterell et al., 2014). The probabilities
for each language must be learned, using gradient
descent. Normally we expect most probabilities to
be near 0 oder 1, making the PFST nearly determin-
istic (d.h., close to a subsequential FST). Jedoch,
permitting low-probability choices remains useful
to account for typographical errors, dialectal dif-
Referenzen, and free variation in the training corpus.
Our PFST generates a surface string, aber die
invertibility of FSTs will allow us to work back-
wards when analyzing a surface string (§3).

Instead of having rule
A sliding-window model
Prioritäten, we apply Nunberg-style rules within a
2-token window that slides over u in a single left-
to-right pass (Figur 2). Conditioned on the cur-
rent window contents ab, a single edit is selected
B
stochastically: either ab
(left absorption), ab
A (right absorption), oder
ab
ba (transposition). Then the window slides
rightward to cover the next input token, together
with the token that is (Jetzt) to its left. a and b are
always real tokens, never boundary symbols.
specifies the conditional edit probabilities.5

ab (no change), ab

7!

7!

7!

7!

5Rather than learn a separate edit probability distribution
for each bigram ab, one could share parameters across bi-
Gramm. Zum Beispiel, Table 1’s caption says that “stronger”
tokens tend to absorb “weaker” ones. A model that incor-

360

These specific edit rules (like Nunberg’s) can-
not insert new symbols, nor can they delete all of
the underlying symbols. Daher, surface xi is a good
clue to ui: all of its tokens must appear underly-
ingly, and if xi = ✏ (the empty string) then ui = ✏.
The model can be directly implemented as
a PFST (Appendix D4) using Cotterell et al.’s
(2014) more general PFST construction.

Our single-pass formalism is less expressive
than Nunberg’s. It greedily makes decisions based
on at most one token of right context (“label
bias”). It cannot rewrite ’”.
.”
weil das . is encountered too late to percolate
leftward; luckily, obwohl, we can handle such En-
glish examples by sliding the window right-to-left
instead of left-to-right. We treat the sliding direc-
tion as a language-specific parameter.6

.’” or ”,.

7!

7!

2.3 Training Objective

Building on equation (2), we train ✓, to lo-
cally maximize the regularized conditional log-
likelihood

&

· ||

2

||

(5)

log p(X

T )

|

· E
T 0

[C(T 0)]2

·

P(

⇣ Xx,T
where the sum is over a training treebank.7
] is over T 0

The expectation E[

|
· · ·
T, X). This generalized expectation term pro-
vides posterior regularization (Mann and McCal-
lum, 2010; Ganchev et al., 2010), by encourag-
ing parameters that reconstruct trees T 0 that use
symmetric punctuation marks in a “typical” way.
The function c(T 0) counts the nodes in T 0 whose
punctemes contain “unmatched” symmetric punc-
tuation tokens: Zum Beispiel, ) is “matched” only
when it appears in a right puncteme with ( Bei der
comparable position in the same constituent’s left
puncteme. The precise definition is given in Ap-
pendix B.4

|

|

2) separate
porated this insight would not have to learn O(
)
absorption probabilities (two per bigram ab), but only O(
|
strengths (one per unigram a, which may be regarded as a
1-dimensional embedding of the punctuation token a). Wir
figured that the punctuation vocabulary ⌃ was small enough
(Tisch 2) that we could manage without the additional com-
plexity of embeddings or other featurization, although this
does presumably hurt our generalization to rare bigrams.

|

ing

6We could have handled all languages uniformly by mak-
2 passes of the sliding window (via a composition of

2 PFSTs), with at least one pass in each direction.
7In retrospect, there was no good reason to square the
ET 0 [C(T 0)] Begriff. Jedoch, when we started redoing the ex-
periments, we found the results essentially unchanged.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
2
7
3
1
9
2
3
4
6
6

/

/
T

l

A
C
_
A
_
0
0
2
7
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

In our development experiments on English, Die
posterior regularization term was necessary to dis-
cover an aesthetically appealing theory of under-
lying punctuation. When we dropped this term
(⇠ = 0) and simply maximized the ordinary regu-
larized likelihood, we found that the optimization
problem was underconstrained: different training
runs would arrive at different, rather arbitrary un-
derlying punctemes. Zum Beispiel, one training run
learned an ATTACH model that used underlying
„. to terminate sentences, along with a NOISY-
CHANNEL model that absorbed the left quotation
mark into the period. By encouraging the under-
lying punctuation to be symmetric, we broke the
Krawatten. We also tried making this a hard constraint
(⇠ =
), but then the model was unable to explain
some of the training sentences at all, giving them
probability of 0. Zum Beispiel, I went to the
“ special place ” cannot be explained, Sei-
cause special place is not a constituent.8

1

3

Inference

In principle, working with the model (1) Ist
straightforward, thanks to the closure properties
of formal languages. Provided that psyn can be en-
coded as a weighted CFG, it can be composed with
the weighted tree transducer p✓ and the weighted
FST p to yield a new weighted CFG (similarly to
Bar-Hillel et al., 1961; Nederhof and Satta, 2003).
Under this new grammar, one can recover the opti-
mal T, T 0 for ¯x by dynamic programming, or sum
over T, T 0 by the inside algorithm to get the likeli-
hood p(¯x). A similar approach was used by Levy
(2008) with a different FST noisy channel.

In this paper we assume that T is observed, al-
lowing us to work with equation (2). This cuts the
computation time from O(n3) to O(N).9 Whereas
the inside algorithm for (1) must consider O(n2)
possible constituents of ¯x and O(N) ways of build-
ing each, our algorithm for (2) only needs to iterate
over the O(N) true constituents of T and the 1 true
way of building each. Jedoch, it must still con-
puncteme pairs for each constituent.
sider the

|Wd|

8Recall that the NOISYCHANNEL model family (§ 2.2)
requires the surface “ before special to appear under-
lyingly, and also requires the surface ✏ after special to
be empty underlyingly. These hard constraints clash with
the ⇠ =
hard constraint that the punctuation around
special must be balanced. The surface ” after place
causes a similar problem: no edge can generate the match-
ing underlying “.

1

9We do O(N) multiplications of N

N matrices where

361

Algorithm 1 The algorithm for scoring a given
(T, X) pair. The code in blue is used during train-
ing to get the posterior regularization term in (5).
Input: T , X
. Training pair (omits T 0, u)
Output: P(X

T ), E[C(T 0)]

|

1: procedure TOTALSCORE(T , X)
2:

for i = 1 to n do

compute WFSA (Mi, ich, ⇢i)

0

E
procedure IN(w)

. exp. count of unmatched punctemes
T

. w

2
slots at left, right of w constit

slot at right of w headword

ich, k
J
Mleft
(
w02
Mright
>j (
Q
M0
1
Mleft ·
Q
0
M
für (l, R)
P
M
E

return M

2W d(w) do
w)
p✓(l, R
|
M + P
E + P

·

·

Mroot
IN(root(T ))
return >0 Mroot⇢n, E

leftkids(w) IN(w0))⇢j

1
rightkids(w) IN(w0))
Nj
Nj ⇥

w02
Mright . R
·

1, R
. R


Ni⇥

Nk

1

Mi(l)M0Mk(R)
l,r have unmatched punc

. R

Ni⇥

Nk

. R, R

3:

4:

5:

6:

7:

8:

9:

10:

11:

12:

13:

14:

15:

16:

17:

18:

3.1 Algorithms

ich

xi|

ui|

is unbounded whenever

|
For each slot 0

Given an input sentence ¯x of length n, our job is
to sum over possible trees T 0 that are consistent
with T and ¯x, or to find the best such T 0. Das
is roughly a lattice parsing problem—made easier
by knowing T . Jedoch, the possible ¯u values
are characterized not by a lattice but by a cyclic
> 0).
WFSA (als
|
N, transduce the sur-

face punctuation string xi by the inverted PFST
for p to obtain a weighted finite-state automa-
Tonne (WFSA) that describes all possible underly-
ing strings ui.10 This WFSA accepts each pos-
sible ui with weight p(xi
If it has Ni
Staaten, we can represent it (Berstel and Reutenauer,
1988) with a family of sparse weight matrices
Ni, whose element at row s and
Mi()
column t is the weight of the s
t arc labeled
!
mit , oder 0 if there is no such arc. Additional
RNi specify the initial and final
vectors i, ⇢i 2
weights. (i is one-hot if the PFST has a single

RNi⇥

ui).

2

|

·

N = O(# of punc types

max # of punc tokens per slot).

10Constructively, compose the u-to-x PFST (from the end
of § 2.2) with a straight-line FSA accepting only xi, Und
project the resulting WFST to its input tape (Pereira and Ri-
ley, 1996), as explained at the end of Appendix D.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
2
7
3
1
9
2
3
4
6
6

/

/
T

l

A
C
_
A
_
0
0
2
7
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

initial state, of weight 1.)

For any puncteme l (or r) In
Mi(l

, we define
V
Mi(l) = Mi(l1)Mi(l2)
), a product
l
|
over the 0 or more tokens in l. This gives the total
!⇤ t WFSA paths labeled with l.
weight of all s
The subprocedure in Algorithm 1 essentially

· · ·

|

extends this to obtain a new matrix IN(w)
2
RNi⇥
Nk , where the subtree rooted at w stretches
Its element IN(w)st gives
from slot i to slot k.
the total weight of all extended paths in the ¯u
WFSA from state s at slot i to state t at slot k. Ein
extended path is defined by a choice of underly-
ing punctemes at w and all its descendants. Diese
punctemes determine an s-to-final path at i, Dann
initial-to-final paths at i + 1 through k
1, then an
initial-to-t path at k. The weight of the extended
path is the product of all the WFSA weights on
these paths (which correspond to transition prob-
abilities in p PFST) times the probability of the
choice of punctemes (from p✓).

Das

inside algorithm computes quantities
needed for training (§ 2.3). Useful variants arise
via well-known methods for weighted derivation
forests (Berstel and Reutenauer, 1988; Guter Mann,
1999; Li and Eisner, 2009; Eisner, 2016).

Konkret, to modify Algorithm 1 to maximize
over T 0 Werte (§§ 6.2–6.3) instead of summing
over them, we switch to the derivation semiring
(Guter Mann, 1999), as follows. Whereas IN(w)st
used to store the total weight of all extended paths
from state s at slot i to state t at slot j, now it will
store the weight of the best such extended path. Es
will also store that extended path’s choice of un-
derlying punctemes, in the form of a puncteme-
annotated version of the subtree of T that is rooted
at w. This is a potential subtree of T 0.

Daher, each element of IN(w) has the form
R and D is a tree. We define

(R, D) where r
addition and multiplication over such pairs:

2

(R, D) + (r0, D0) =

(R, D)
(r0, D0)

(
(r0, D0) = (rr0, DD0)

if r > r0
ansonsten

(6)

(7)

(R, D)

·

where DD0 denotes an ordered combination of
two trees. Matrix products UV and scalar-matrix
products p
V are defined in terms of element ad-
dition and multiplication as usual:

·

(UV)st =
(P

V)st = p

·

P
·

Vrt

rUsr ·
Vst

(8)

(9)

362

What is DD0? For presentational purposes, es ist
convenient to represent a punctuated dependency
tree as a bracketed string. Zum Beispiel, the under-
lying tree T 0 in Abbildung 1 would be [ [“ Dale ”]
means [„ [ river ] valley ”] ] Wo
the words correspond to nodes of T . In this case,
we can represent every D as a partial bracketed
string and define DD0 by string concatenation.
that multiplication
This presentation ensures
(7) is a complete and associative (allerdings nicht
commutative) Betrieb, as in any semiring. Als
base cases, each real-valued element of Mi(l)
or Mk(R) is now paired with the string [l or r]
jeweils,11 and the real number 1 at line 10 Ist
paired with the string w. The real-valued elements
of the i and ⇢i vectors and the 0 matrix at line 11
are paired with the empty string ✏, as is the real
number p at line 13.

In der Praxis, the D strings that appear within the
matrix M of Algorithm 1 will always represent
complete punctuated trees. Daher, they can actu-
ally be represented in memory as such, and differ-
ent trees may share subtrees for efficiency (verwenden
pointers). The product in line 10 constructs a ma-
trix of trees with root w and differing sequences
of left/right children, while the product in line 14
annotates those trees with punctemes l, R.

To sample a possible T 0 from the derivation for-
est in proportion to its probability (§ 6.1), wir gebrauchen
the same algorithm but replace equation (6) mit

(R, D) + (r0, D0) =

(R + r0, D)
(R + r0, D0)

(

if u < r r+r0 otherwise with u ⇠ Uniform(0, 1) being a random number. 3.2 Optimization Having computed the objective (5), we find the gradient via automatic differentiation, and opti- mize ✓, via Adam (Kingma and Ba, 2014)—a variant of stochastic gradient decent—with learn- ing rate 0.07, batchsize 5, sentence per epoch 400, and L2 regularization. (These hyperparam- eters, along with the regularization coefficients & and ⇠ from equation (5), were tuned on dev data (§4) for each language respectively.) We train 11We still construct the real matrix Mi(l) by ordinary ma- trix multiplication before pairing its elements with strings. This involves summation of real numbers: each element of the resulting real matrix is a marginal probability, which sums over possible PFST paths (edit sequences) that could map the underlying puncteme l to a certain substring of the surface slot xi. Similarly for Mk(r). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 3 1 9 2 3 4 6 6 / / t l a c _ a _ 0 0 2 7 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 the punctuation model for 30 epochs. The initial NOISYCHANNEL parameters () are drawn from (0, 1), and the initial ATTACH parameters (✓) (0, 1) (with one minor excep- N are drawn from tion described in Appendix A). N 4 Intrinsic Evaluation of the Model Data. Throughout §§4–6, we will examine the punctuation model on a subset of the Univer- sal Dependencies (UD) version 1.4 (Nivre et al., 2016)—a collection of dependency treebanks across 47 languages with unified POS-tag and de- pendency label sets. Each treebank has designated training, development, and test portions. We ex- periment on Arabic, English, Chinese, Hindi, and Spanish (Table 2)—languages with diverse punc- tuation vocabularies and punctuation interaction rules, not to mention script directionality. For each treebank, we use the tokenization provided by UD, and take the punctuation tokens (which may be multi-character, such as ...) to be the tokens with the PUNCT tag. We replace each straight dou- ble quotation mark " with either “ or ” as appro- priate, and similarly for single quotation marks.12 We split each non-punctuation token that ends in . (such as etc.) into a shorter non-punctuation token (etc) followed by a special punctuation to- ken called the “abbreviation dot” (which is distinct from a period). We prepend a special punctuation mark ˆ to every sentence ¯x, which can serve to absorb an initial comma, for example.13 We then replace each token with the special symbol UNK if its type appeared fewer than 5 times in the training portion. This gives the surface sentences. V To estimate the vocabulary of underlying punctemes, we simply collect all surface token se- quences xi that appear at any slot in the training portion of the processed treebank. This is a gener- ous estimate. Similarly, we estimate Wd (§ 2.1) as all pairs (l, r) 2 that flank any d constituent. Recall that our model generates surface punctu- ation given an unpunctuated dependency tree. We train it on each of the 5 languages independently. We evaluate on conditional perplexity, which will be low if the trained model successfully assigns a high probability to the actual surface punctuation in a held-out corpus of the same language. 2V 12For en and en_esl, “ and ” are distinguished by language-specific part-of-speech tags. For the other 4 lan- guages, we identify two " dependents of the same head word, 363 Language Treebank #Token %Punct #Omit #Type 282K Arabic 123K Chinese 255K 97.7K 352K es_ancora 560K ar zh en en_esl hi 7.9 13.8 11.7 9.8 6.7 11.7 255 3 40 2 21 25 18 23 35 16 15 16 Hindi Spanish English Table 2: Statistics of our datasets. “Treebank” is the UD treebank identifier, “#Token” is the number of to- kens, “%Punct” is the percentage of punctuation to- kens, “#Omit” is the small number of sentences con- taining non-leaf punctuation tokens (see footnote 19), and “#Type” is the number of punctuation types after preprocessing. (Recall from §4 that preprocessing dis- tinguishes between left and right quotation mark types, and between abbreviation dot and period dot types.) Baselines. We compare our model against three baselines to show that its complexity is necessary. Our first baseline is an ablation study that does not use latent underlying punctuation, but generates the surface punctuation directly from the tree. (To implement this, we fix the parameters of the noisy channel so that the surface punctuation equals the underlying with probability 1.) If our full model performs significantly better, it will demonstrate the importance of a distinct underlying layer. Our other two baselines ignore the tree struc- ture, so if our full model performs significantly better, it will demonstrate that conditioning on ex- plicit syntactic structure is useful. These baselines are based on previously published approaches that reduce the problem to tagging: Xu et al. (2016) use a BiLSTM-CRF tagger with bigram topology; Tilk and Alumäe (2016) use a BiGRU tagger with attention. In both approaches, the model is trained to tag each slot i with the correct string xi 2V ⇤ (possibly ✏ or ˆ). These are discriminative proba- bilistic models (in contrast to our generative one). Each gives a probability distribution over the tag- gings (conditioned on the unpunctuated sentence), so we can evaluate their perplexity.14 Results. As shown in Table 3, our full model beats the baselines in perplexity in all 5 languages. Also, in 4 of 5 languages, allowing a trained NOISYCHANNEL (rather than the identity map) replacing the left one with “ and the right one with ”. 13For symmetry, we should also have added a final mark. 14These methods learn word embeddings that optimize log-likelihood on the punctuation restoration conditional training data. They might do better if these embeddings were shared with other tasks, as multi-task learning might lead them to discover syntactic categories of words. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 3 1 9 2 3 4 6 6 / / t l a c _ a _ 0 0 2 7 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Attn. CRF ATTACH +NC DIR 1.4676 1.3016 1.2230 1.1526 L Arabic Chinese 1.6850 1.4436 1.1921 1.1464 L English 1.5737 1.5247 1.5636 1.4276 R 1.1201 1.1032 1.0630 1.0598 L Hindi Spanish 1.4397 1.3198 1.2364 1.2103 R Table 3: Results of the conditional perplexity experi- ment (§4), reported as perplexity per punctuation slot, where an unpunctuated sentence of n words has n + 1 slots. Column “Attn.” is the BiGRU tagger with atten- tion, and “CRF” stands for the BiLSTM-CRF tagger. “ATTACH” is the ablated version of our model where surface punctuation is directly attached to the nodes. Our full model “+NC” adds NOISYCHANNEL to trans- duce the attached punctuation into surface punctuation. DIR is the learned direction (§ 2.2) of our full model’s noisy channel PFST: Left-to-right or Right-to-left. Our models are given oracle parse trees T . The best per- plexity is boldfaced, along with all results that are not significantly worse (paired permutation test, p < 0.05). significantly improves the perplexity. 5 Analysis of the Learned Grammar 5.1 Rules Learned from the Noisy Channel 7! 7! 7! 7! b, ab a, ab We study our learned probability distribution over noisy channel rules (ab ab, ba) for English. The probability distributions ab corresponding to six of Nunberg’s English rules are shown in Figure 3. By comparing the orange and blue bars, observe that the model trained on the en_cesl treebank learned different quotation rules from the one trained on the en treebank. This is because en_cesl follows British style, whereas en has American-style quote transposition.15 We now focus on the model learned from the en treebank. Nunberg’s rules are deterministic, and our noisy channel indeed learned low-entropy rules, in the sense that for an input ab with un- 25,16 at least one of the possi- derlying count ble outputs (a, b, ab or ba) always has probability > 0.75. The one exception is ”.
.” for which
0.5, Weil
the argmax output has probability
writers do not apply this quote transposition rule
konsequent. As shown by the blue bars in Fig-
ure 3, the high-probability transduction rules are

7!

15American style places commas and periods inside the
quotation marks, even if they are not logically in the quote.
British style (more sensibly) places unquoted periods and
commas in their logical place, sometimes outside the quo-
tation marks if they are not part of the quote.

16For rarer underlying pairs ab, the estimated distributions

sometimes have higher entropy due to undertraining.

Figur 3: Rewrite probabilities learned for English,
averaged over the last 4 epochs on en treebank (Blau
bars) or en_esl treebank (orange bars). The header
above each figure is the underlying punctuation string
(input to NOISYCHANNEL). The two counts in the fig-
ure headers are the number of occurrences of the under-
lying punctuation strings in the 1-best reconstruction of
underlying punctuation sequences (by Algorithm 1) Re-
spectively in the en and en_esl treebank. Each bar
represents one surface punctuation string (output of
NOISYCHANNEL), its height giving the probability.

consistent with Nunberg’s hand-crafted determin-
istic grammar in Table 1.

Our system has high precision when we look at
the confident rules. Of the 24 learned edits with
conditional probability > 0.75, Nunberg lists 20.
Our system also has good recall. Nunberg’s
hand-crafted schemata consider 16 punctuation
types and generate a total of 192 edit rules, In-
cluding the specimens in Table 1. Das ist, of the
162 = 256 possible underlying punctuation bi-
grams ab, 3
4 are supposed to undergo absorption
or transposition. Our method achieves fairly high
recall, in the sense that when Nunberg proposes
ab) usually ranks highly
ab
ab). 75
among all probabilities of the form p(0 |
of Nunberg’s rules got rank 1, 48 got rank 2, Und
the remaining 69 got rank > 2. The mean recipro-
cal rank was 0.621. Recall is quite high when we
restrict to those Nunberg rules ab
for which
our model is confident how to rewrite ab, im
sense that some p(0
(This tends
to eliminate rare ab: see footnote 5.) Of these 55
Nunberg rules, 38 rules got rank 1, 15 got rank 2,
und nur 2 got rank worse than 2. The mean recip-
rocal rank was 0.836.

, our learned p(

ab) > 0.5.

7!

7!

|

|

¿What about Spanish? Spanish uses inverted
question marks ¿ and exclamation marks ¡, welche
form symmetric pairs with the regular question
marks and exclamation marks.
If we try to ex-
trapolate to Spanish from Nunberg’s English for-

364

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
2
7
3
1
9
2
3
4
6
6

/

/
T

l

A
C
_
A
_
0
0
2
7
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

malization, the English mark most analogous to ¿
Ist (. Our learned noisy channel for Spanish (nicht
graphed here) includes the high-probability rules
,¿
¿ which match
:¿ and ¿,
7!
Nunberg’s treatment of ( in English.

,¿ and :¿

7!

7!

parataxis appos
2.29

2.38
, , 26.8
✏ ✏
20.1
( ) 13.0

9.7
:
8.1

advcl
0.77

ccomp
0.53

list
1.33
, , 18.8 ✏ ✏ 60.0 ✏ ✏ 73.8
90.8
: 18.1 , , 22.3 , , 21.2 “ ” 2.4
15.9 , 5.3 , 3.1
, , 2.4
✏ ✏ 14.4 < > 3.0 ( ) 0.74 :“ ” 0.9
( ) 13.1 ( ) 3.0 – 0.21 „ ,” 0.8

✏ ✏

5.2 Attachment Model

What does our model learn about how dependency
relations are marked by underlying punctuation?

,advmod,

root.

,“ccomp”

,nmod,

Tisch 4: The top 5 relations that are most likely to
generate symmetric punctemes, the entropy of their
puncteme pair (Reihe 2), and their top 5 puncteme pairs
(rows 3–7) with their probabilities shown as percent-
Alter. The symmetric punctemes are in boldface.

ˆ,Earlier, Kerry said ,„ … ,in fact, answer the question”.
ˆ Earlier, Kerry said ,„ … ,in fact, answer the question.”

6.1 Punctuation Restoration

The above example17 illustrates the use of specific
puncteme pairs to set off the advmod, ccomp,
and nmod relations. Notice that said takes
a complement (ccomp) Das
is symmetrically
quoted but also left delimited by a comma, welche
is indeed how direct speech is punctuated in
English. This example also illustrates quotation
transposition. The top five relations that are most
likely to generate symmetric punctemes and their
top (l, R) pairs are shown in Table 4.

conj

,conj,

,conj,

cc

Abschnitt 1
Abschnitt 1

,2 , ,… 7 , Und 8 …
,… 7 , Und 8 …
,2

The above example18 shows how our model han-
dles commas in conjunctions of 2 or more phrases.
UD format dictates that each conjunct after the
first is attached by the conj relation. As shown
über, each such conjunct is surrounded by under-
lying commas (via the N.,.,.conj feature from
Appendix A), except for the one that bears the
conjunction and (via an even stronger weight on
the C.✏.✏.!conj.cc feature). Our learned feature
weights indeed yield p(` = ✏, r = ✏) > 0.5 für die
final conjunct in this example. Some writers omit
the “Oxford comma” before the conjunction: Das
style can be achieved simply by changing “sur-
rounded” to “preceded” (das ist, changing the N
feature to N.,.✏.conj).

6 Performance on Extrinsic Tasks

In this task, we are given a depunctuated sentence
¯d19 and must restore its (surface) punctuation. Unser
model supposes that the observed punctuated sen-
tence ¯x would have arisen via the generative pro-
Prozess (1). Daher, we try to find T , T 0, and ¯x that are
consistent with ¯d (a partial observation of ¯x).

The first step is to reconstruct T from ¯d. Das
initial parsing step is intended to choose the T that
¯d).20 This step depends only
maximizes psyn(T
on psyn and not on our punctuation model (p✓, P).
In der Praxis, we choose T via a dependency parser
that has been trained on an unpunctuated treebank
with examples of the form (¯d, T ).21

|

Gleichung (2) now defines a distribution over
(T 0, X) given this T . To obtain a single prediction
for x, we adopt the minimum Bayes risk (MBR)
approach of choosing surface punctuation ˆx that
minimizes the expected loss with respect to the
unknown truth x⇤. Our loss function is the total
edit distance over all slots (where edits operate on
punctuation tokens). Finding ˆx exactly would be
intractable, so we use a sampling-based approx-
imation and draw m = 1000 samples from the
posterior distribution over (T 0, X). We then define

ˆx = argmin
S(T )
X

2

S(T )

Xx⇤2

ˆp(x⇤

T )

|

·

loss(X, x⇤) (10)

where S(T ) is the set of unique x values in the
sample and ˆp is the empirical distribution given by
the sample. This can be evaluated in O(m2) Zeit.

We evaluate the trained punctuation model by us-
ing it in the following three tasks.

19 To depunctuate a treebank sentence, we remove all to-
kens with POS-tag PUNCT or dependency relation punct.
These are almost always leaves; else we omit the sentence.

17[In] Earlier, Kerry said, “Just because you
get an honorable discharge does not, in fact,
answer that question.”

18[In]

Abschnitte 1, 2, 5, 6, 7, Und 8 Wille

survive any termination of this License.

365

20Im Idealfall, rather than maximize, one would integrate over
possible trees T , in practice by sampling many values Tk
¯u) and replacing S(T ) In (10) mit
k S(Tk).
from psyn(
the Yara parser (Rasooli and Tetreault,
2015), a fast non-probabilistic transition-based parser that
uses rich non-local features (Zhang and Nivre, 2011).

21Konkret,

· |

S

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
2
7
3
1
9
2
3
4
6
6

/

/
T

l

A
C
_
A
_
0
0
2
7
3
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

We evaluate on Arabic, English, Chinese,
Hindi, and Spanish. For each language, we train
both the parser and the punctuation model on
the training split of that UD treebank (§4), Und
evaluate on held-out data. We compare to the
BiLSTM-CRF baseline in §4 (Xu et al., 2016).22
We also compare to a “trivial” deterministic base-
Linie, which merely places a period at the end of the
Satz (or a “|” in the case of Hindi) and adds no
other punctuation. Because most slots do not in
fact have punctuation, the trivial baseline already
does very well; to improve on it, we must fix its
errors without introducing new ones.

Our final comparison on test data is shown in
the table in Figure 4. On all 5 languages, unser
method beats (usually significantly) its 3 com-
petitors:
Die
BiLSTM-CRF, and the ablated version of our
Modell (ATTACH) that omits the noisy channel.

the trivial deterministic baseline,

Natürlich, the success of our method depends
on the quality of the parse trees T (which is par-
ticularly low for Chinese and Arabic). The graph
in Abbildung 4 explores this relationship, by evaluat-
ing (on dev data) with noisier trees obtained from
parsers that were variously trained on only the first
10%, 20%, . . . of the training data. On all 5 lan-
guages, provided that the trees are at least 75%
correct, our punctuation model beats both the triv-
ial baseline and the BiLSTM-CRF (which do not
use trees). It also beats the ATTACH ablation base-
line at all levels of tree accuracy (these curves are
omitted from the graph to avoid clutter). In all lan-
guages, better parses give better performance, Und
gold trees yield the best results.

6.2 Punctuation Correction

Our next goal is to correct punctuation errors in
a learner corpus. Each sentence is drawn from
the Cambridge Learner Corpus treebanks, welche
provide original (en_esl) and corrected (en_cesl)
Sätze. All kinds of errors are corrected, solch

22We copied their architecture exactly but re-tuned the hy-
perparameters on our data. We also tried tripling the amount
of training data by adding unannotated sentences (provided
along with the original annotated sentences by Ginter et al.
(2017)), taking advantage of the fact that the BiLSTM-CRF
does not require its training sentences to be annotated with
trees. Jedoch, this actually hurt performance slightly, pro-
haps because the additional sentences were out-of-domain.
We also tried the BiGRU-with-attention architecture of Tilk
and Alumäe (2016), but it was also weaker than the BiLSTM-
CRF (just as in Table 3). We omit all these results from Fig-
ure 4 to reduce clutter.

P

8 ATTACH a

–A

Arabic 0.064 0.064 0.063
Chinese 0.110 0.109 0.104
English 0.100 0.108 0.092
0.025 0.023 0.019
Hindi
Spanish 0.093 0.092 0.085

0.059 0.053
0.102 0.048
0.090 0.079
0.018 0.013
0.078 0.068

Figur 4: Edit distance per slot (which we call average
edit distance, or AED) for each of the 5 corpora. Lower
is better. The table gives the final AED on the test data.
Its first 3 columns show the baseline methods just as in
Tisch 3: the trivial deterministic method, the BiLSTM-
CRF, and the ATTACH ablation baseline that attaches
the surface punctuation directly to the tree. Column 4
is our method that incorporates a noisy channel, Und
column 5 (in gray) is our method using oracle (gold)
trees. We boldface the best non-oracle result as well as
all that are not significantly worse (paired permutation
test, P < 0.05). The curves show how our method’s AED (on dev data) varies with the labeled attachment score (LAS) of the trees, where --a at x = 100 uses the oracle (gold) trees, a-- at x < 100 uses trees from our parser trained on 100% of the training data, and the 100 use increasingly worse parsers. -- points at x The p and 8 at the right of the graph show the AED of # the trivial deterministic baseline and the BiLSTM-CRF baseline, which do not use trees. ⌧ as syntax errors, but we use only the 30% of sen- tences whose depunctuated trees T are isomorphic between en_esl and en_cesl. These en_cesl trees may correct word and/or punctuation errors in en_esl, as we wish to do automatically. We assume that an English learner can make mistakes in both the attachment and the noisy channel steps. A common attachment mistake is the failure to surround a non-restrictive relative In the noisy channel step, clause with commas. mistakes in quote transposition are common. Correction model. Based on the assumption about the two error sources, we develop a dis- Let ¯xe de- criminative model for this task. note the full input sentence, and let xe and xc denote the input (possibly errorful) and output (corrected) punctuation sequences. We model p✓(T 0c | p(xc | T | T, xe) T 0c). Here T is the depunctu- ated parse tree, T 0c is the corrected underlying tree, T 0e is the error underlying tree, and we assume T 0e). p(T 0e | p✓(T 0c | ¯xe) = p(xc | P p✓(T 0c | T, xe) = psyn(T T, xe) ¯xe) P T 0e T 0c · · · 366 P l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 3 1 9 2 3 4 6 6 / / t l a c _ a _ 0 0 2 7 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In practice we use a 1-best pipeline rather than summing. Our first step is to reconstruct T from the error sentence ¯xe. We choose T that max- ¯xe) from a dependency parser imizes psyn(T trained on en_esl treebank examples (¯xe, T ). The second step is to reconstruct T 0e based on our punc- tuation model trained on en_esl. We choose T 0e that maximizes p(T 0e | T, xe). We then reconstruct T 0c by | p(T 0c | T 0e) = we T 0e 2 p(l, r we) | (11) where we is the node in T 0e, and p(l, r we) is a Q similar log-linear model to equation (4) with addi- tional features (Appendix C4) which look at we. | Finally, we reconstruct xc based on the noisy T 0c) in § 2.2. During training, is channel p(xc | regularized to be close to the noisy channel param- eters in the punctuation model trained on en_cesl. We use the same MBR decoder as in § 6.1 to choose the best action. We evaluate using AED as in § 6.1. As a second metric, we use the script from the CoNLL 2014 Shared Task on Grammati- cal Error Correction (Ng et al., 2014): it computes the F0.5-measure of the set of edits found by the system, relative to the true set of edits. As shown in Table 5, our method achieves bet- ter performance than the punctuation restoration baselines (which ignore input punctuation). On the other hand, it is soundly beaten by a new BiLSTM-CRF that we trained specifically for the task of punctuation correction. This is the same as the BiLSTM-CRF in the previous section, ex- cept that the BiLSTM now reads a punctuated input sentence (with possibly erroneous punctua- n, the BiL- tion). To be precise, at step 0 STM reads a concatenation of the embedding of word i (or BOS if i = 0) with an embedding of the punctuation token sequence xi. The BiLSTM- CRF wins because it is a discriminative model tai- lored for this task: the BiLSTM can extract arbi- trary contextual features of slot i that are corre- lated with whether xi is correct in context.   i 6.3 Sentential Rephrasing We suspect that syntactic transformations on a sentence should often preserve the underlying punctuation attached to its tree. The surface punc- tuation can then be regenerated from the trans- formed tree. Such transformations include ed- its that are suggested by a writing assistance tool (Heidorn, 2000), or subtree deletions in compres- sive summarization (Knight and Marcu, 2002). p 8 parsed gold 8-corr a-- AED 0.052 0.051 0.047 0.034 0.033 0.005 F0.5 0.779 0.787 0.827 0.876 0.881 0.984 Table 5: AED and F0.5 results on the test split of English-ESL data. Lower AED is better; higher F0.5 is better. The first three columns (markers corre- spond to Figure 4) are the punctuation restoration base- lines, which ignore the input punctuation. The fourth and fifth columns are our correction models, which use parsed and gold trees. The final column is the BiLSTM-CRF model tailored for the punctuation cor- rection task. For our experiment, we evaluate an interesting case of syntactic transformation. Wang and Eis- ner (2016) consider a systematic rephrasing pro- cedure by rearranging the order of dependent sub- trees within a UD treebank, in order to synthesize new languages with different word order that can then be used to help train multi-lingual systems (i.e., data augmentation with synthetic data). As Wang and Eisner acknowledge (2016, foot- note 9), their permutations treat surface punctua- tion tokens like ordinary words, which can result in synthetic sentences whose punctuation is quite unlike that of real languages. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 3 1 9 2 3 4 6 6 / / t l a c _ a _ 0 0 2 7 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In our experiment, we use Wang and Eisner’s (2016) “self-permutation” setting, where the de- pendents of each noun and verb are stochastically reordered, but according to a dependent ordering model that has been trained on the same language. rephrasing a English sentence For example, advcl punct root mark det SCONJ If true , the caper failed . nsubj PUNCT NOUN VERB DET ADJ PUNCT punct under an English ordering model may yield det nsubj DET the caper failed . punct VERB NOUN root punct advcl PUNCT SCONJ mark ADJ If true , PUNCT which is still grammatical except that , and . are wrongly swapped (after all, they have the same POS tag and relation type). Worse, permutation may yield bizarre punctuation such as , , at the start of a sentence. Our punctuation model gives a straightforward remedy—instead of permuting the tree directly, we first discover its most likely underlying tree ,advcl, root. mark det ˆ, If true, the caper failed . nsubj by the maximizing variant of Algorithm 1 (§ 3.1). Then, we permute the underlying tree and sample the surface punctuation from the distribution modeled by the trained PFST, yielding 367 Punctuation All Base Half Full Base Half Full Arabic 156.0 231.3 186.1 540.8 590.3 553.4 Chinese 165.2 110.0 61.4 205.0 174.4 78.7 English 98.4 74.5 51.0 140.9 131.4 75.4 Hindi 118.4 118.8 91.8 Spanish 266.2 259.2 194.5 346.3 343.4 239.3 10.8 11.0 9.7 “All” evaluates on all Table 6: Perplexity (evaluated on the train split to avoid evaluating generalization) of a trigram language model trained (with add-0.001 smoothing) on differ- ent versions of rephrased training sentences. “Punc- tuation” only evaluates perplexity on the trigrams that the tri- have punctuation. grams. “Base” permutes all surface dependents includ- ing punctuation (Wang and Eisner, 2016). “Full” is our full approach: recover underlying punctuation, per- mute remaining dependents, regenerate surface punc- tuation. “Half” is like “Full” but it permutes the non- punctuation tokens identically to “Base.” The permu- tation model is trained on surface trees or recovered underlying trees T 0, respectively. In each 3-way com- parison, we boldface the best result (always significant under a paired permutation test over per-sentence log- probabilities, p < 0.05). det nsubj root. ,advcl, mark ˆthe caper failed ,If true ,. ˆthe caper failed ,If true . We leave the handling of capitalization to future work. We test the naturalness of the permuted sen- tences by asking how well a word trigram lan- guage model trained on them could predict the original sentences.23 As shown in Table 6, our per- mutation approach reduces the perplexity over the baseline on 4 of the 5 languages, often dramati- cally. 7 Related Work Punctuation can aid syntactic analysis, since it signals phrase boundaries and sentence structure. Briscoe (1994) and White and Rajkumar (2008) parse punctuated sentences using hand-crafted constraint-based grammars that implement Nun- berg’s approach in a declarative way. These gram- mars treat surface punctuation symbols as ordi- nary words, but annotate the nonterminal cate- gories so as to effectively keep track of the under- lying punctuation. This is tantamount to crafting a grammar for underlyingly punctuated sentences and composing it with a finite-state noisy channel. 23So the two approaches to permutation yield different training data, but are compared fairly on the same test data. 368 The parser of Ma et al. (2014) takes a differ- ent approach and treats punctuation marks as fea- tures of their neighboring words. Zhang et al. (2013) use a generative model for punctuated sen- tences, leting them restore punctuation marks dur- ing transition-based parsing of unpunctuated sen- tences. Li et al. (2005) use punctuation marks to segment a sentence: this "divide and rule" strat- egy reduces ambiguity in parsing of long Chinese sentences. Punctuation can similarly be used to constrain syntactic structure during grammar in- duction (Spitkovsky et al., 2011). Punctuation restoration (§ 6.1) is useful for tran- scribing text from unpunctuated speech. The task is usually treated by tagging each slot with zero or more punctuation tokens, using a traditional sequence labeling method: conditional random fields (Lui and Wang, 2013; Lu and Ng, 2010), re- current neural networks (Tilk and Alumäe, 2016), or transition-based systems (Ballesteros and Wan- ner, 2016). 8 Conclusion and Future Work We have provided a new computational approach to modeling punctuation. In our model, syntactic constituents stochastically generate latent under- lying left and right punctemes. Surface punctu- ation marks are not directly attached to the syn- tax tree, but are generated from sequences of adja- cent punctemes by a (stochastic) finite-state string rewriting process . Our model is inspired by Nun- berg’s (1990) formal grammar for English punctu- ation, but is probabilistic and trainable. We give exact algorithms for training and inference. We trained Nunberg-like models for 5 lan- guages and L2 English. We compared the English model to Nunberg’s, and showed how the trained models can be used across languages for punctua- tion restoration, correction, and adjustment. In the future, we would like to study the usefulness of the recovered underlying trees on tasks such as syntactically sensitive sentiment analysis (Tai et al., 2015), machine translation (Cowan et al., 2006), relation extraction (Cu- lotta and Sorensen, 2004), and coreference reso- lution (Kong et al., 2010). We would also like to investigate how underlying punctuation could aid parsing. For discriminative parsing, features for scoring the tree could refer to the underly- ing punctuation, not just the surface punctuation. For generative parsing (§3), we could follow the l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 3 1 9 2 3 4 6 6 / / t l a c _ a _ 0 0 2 7 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 scheme in equation (1). For example, the psyn factor in equation (1) might be a standard re- current neural network grammar (RNNG) (Dyer et al., 2016); when a subtree of T is completed by the REDUCE operation of psyn, the punctuation- augmented RNNG (1) would stochastically attach subtree-external left and right punctemes with p✓ and transduce the subtree-internal slots with p. In the future, we are also interested in enriching the T 0 representation and making it more differ- ent from T , to underlyingly account for other phe- nomena in T such as capitalization, spacing, mor- phology, and non-projectivity (via reordering). Acknowledgments This material is based upon work supported by the National Science Foundation under Grant Nos. 1423276 and 1718846, including a REU supple- ment to the first author. We are grateful to the state of Maryland for the Maryland Advanced Research Computing Center, a crucial resource. We thank Xiaochen Li for early discussion, Argo lab mem- bers for further discussion, and the three reviewers for quality comments. References Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, and Rebecca Passonneau. 2011. Sen- timent analysis of Twitter data. In Proceedings of the Workshop on Language in Social Media (LSM 2011), pages 30–38. Miguel Ballesteros and Leo Wanner. 2016. A neu- ral network architecture for multilingual punc- tuation generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1048–1053. On formal properties of Yehoshua Bar-Hillel, M. Perles, and E. Shamir. simple 1961. für phrase structure grammars. Phonetik, Sprachwissenschaft und Kommunika- tionsforschung, 14:143–172. Reprinted in Y. Bar-Hillel (1964), Language and Information: Selected Essays on their Theory and Applica- tion, Addison-Wesley 1964, pages 116–150. Zeitschrift Jean Berstel and Christophe Reutenauer. 1988. Rational Series and their Languages. Springer- Verlag. 369 Ann Bies, Mark Ferguson, Karen Katz, Robert MacIntyre, Victoria Tredinnick, Grace Kim, Mary Ann Marcinkiewicz, and Britta Schas- berger. 1995. Bracketing guidelines for Tree- bank II style: Penn Treebank project. Technical Report MS-CIS-95-06, University of Pennsyl- vania. Ted Briscoe. 1994. Parsing (with) punctuation, etc. Technical report, Xerox European Re- search Laboratory. Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neu- ral networks. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 740–750. Noam Chomsky and Morris Halle. 1968. The Sound Pattern of English. Harper and Row, New York. Ryan Cotterell, Nanyun Peng, and Jason Eisner. 2014. Stochastic contextual edit distance and probabilistic FSTs. In Proceedings of the 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages 625–630. Ryan Cotterell, Nanyun Peng, and Jason Eisner. 2015. Modeling word forms using latent under- lying morphs and phonology. Transactions of the Association for Computational Linguistics (TACL), 3:433–447. Ryan Cotterell, Tim Vieira, and Hinrich Schütze. 2016. A joint model of orthography and mor- In Proceedings of phological segmentation. the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 664–669. Brooke Cowan, Ivona Kuˇcerová, and Michael Collins. 2006. A discriminative model for tree- to-tree translation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 232– 241. Aron Culotta and Jeffrey Sorensen. 2004. De- pendency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 3 1 9 2 3 4 6 6 / / t l a c _ a _ 0 0 2 7 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Timothy Dozat and Christopher Manning. 2017. In Efficient third-order dependency parsers. Proceedings of the 5th International Confer- ence on Learning Representations (ICLR). Chris Dyer, Adhiguna Kuncoro, Miguel Balles- teros, and Noah A. Smith. 2016. Recurrent In Proceedings of neural network grammars. the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 199–209. Jason Eisner. 1996. Three new probabilistic mod- els for dependency parsing: An exploration. In Proceedings of the 16th International Confer- ence on Computational Linguistics (COLING), pages 340–345. Jason Eisner. 2016. Inside-outside and forward- backward algorithms are just backprop. In Pro- ceedings of the EMNLP Workshop on Struc- tured Prediction for NLP. A. Caracciolo di Forino. 1968. String process- ing languages and generalized Markov algo- rithms. In D. G. Bobrow, editor, Symbol Manip- ulation Languages and Techniques, pages 191– 206. North-Holland Publishing Company, Am- sterdam. Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for struc- tured latent variable models. Journal of Ma- chine Learning Research, 11:2001–2049. Filip Ginter, Jan Hajiˇc, Juhani Luotolahti, Milan Straka, and Daniel Zeman. 2017. CoNLL 2017 shared task - automatically annotated raw texts and word embeddings. LINDAT/CLARIN dig- ital library at the Institute of Formal and Ap- plied Linguistics (ÚFAL), Faculty of Mathe- matics and Physics, Charles University. Yoav Goldberg and Michael Elhadad. 2010. An efficient algorithm for easy-first non-directional In Human Language dependency parsing. Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), pages 742–750. Joshua Goodman. 1999. Semiring parsing. Com- putational Linguistics, 25(4):573–605. George Heidorn. 2000. Intelligent writing assis- In Robert Dale, Herman Moisl, and tance. Harold Somers, editors, Handbook of Natural Language Processing, pages 181–207. Marcel Dekker, New York. C. Douglas Johnson. 1972. Formal Aspects of Phonological Description. Mouton. Bernard E. M. Jones. 1994. Exploring the role of punctuation in parsing natural text. In COLING 1994 Volume 1: The 15th International Confer- ence on Computational Linguistics. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In Proceed- ings of the International Conference on Learn- ing Representations (ICLR). Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing us- ing bidirectional LSTM feature representations. Transactions of the Association for Computa- tional Linguistics (TACL), 4:313–327. Kevin Knight and Daniel Marcu. 2002. Summa- rization beyond sentence extraction: A proba- bilistic approach to sentence compression. Ar- tificial Intelligence, 139(1):91–107. Fang Kong, Guodong Zhou, Longhua Qian, and Dependency-driven Qiaoming Zhu. 2010. anaphoricity determination for coreference res- In Proceedings of the 23rd Interna- olution. tional Conference on Computational Linguis- tics (COLING), pages 599–607. Terry Koo and Michael Collins. 2010. Efficient third-order dependency parsers. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1– 11. Albert E. Krahn. 2014. A New Paradigm for Punctuation. Ph.D. thesis, The University of Wisconsin-Milwaukee. Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2014. Low-rank tensors for scoring dependency structures. In Proceed- ings of the 52nd Annual Meeting of the As- sociation for Computational Linguistics (ACL), pages 1381–1391. 370 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 3 1 9 2 3 4 6 6 / / t l a c _ a _ 0 0 2 7 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Roger Levy. 2008. A noisy-channel model of human sentence comprehension under uncer- In Proceedings of the 2008 Con- tain input. ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 234–243. Xing Li, Chengqing Zong, and Rile Hu. 2005. A hierarchical parsing approach with punctuation processing for long Chinese sentences. In Pro- ceedings of the International Joint Conference on Natural Language Processing (IJCNLP). Zhifei Li and Jason Eisner. 2009. First- and second-order expectation semirings with appli- cations to minimum-risk training on translation In Proceedings of the Conference on forests. Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 40–51. Wei Lu and Hwee Tou Ng. 2010. Better punctu- ation prediction with dynamic conditional ran- In Proceedings of the 2010 Con- dom fields. ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 177–186. Marco Lui and Li Wang. 2013. Recovering casing and punctuation using conditional ran- dom fields. In Proceedings of the Australasian Language Technology Association Workshop (ALTA), pages 137–141. Ji Ma, Yue Zhang, and Jingbo Zhu. 2014. Punc- tuation processing for projective dependency In Proceedings of the 52nd Annual parsing. Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 791–796. Gideon S. Mann and Andrew McCallum. 2010. Generalized expectation criteria for semi- supervised learning with weakly labeled data. Journal of Machine Learning Research, 11:955–984. Andrey Andreevich Markov. 1960. The theory of algorithms. American Mathematical Society Translations, series 2(15):1–14. Ilia Markov, Vivi Nastase, and Carlo Strappar- ava. 2018. Punctuation as native language in- In Proceedings of the 27th Inter- terference. national Conference on Computational Linguis- tics (COLING), pages 3456–3466. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013. Efficient estimation of word representations in vector space. Computing Re- search Repository (CoRR), arXiv:1301.3781. Mark-Jan Nederhof and Giorgio Satta. 2003. Probabilistic parsing as intersection. In 8th In- ternational Workshop on Parsing Technologies (IWPT), pages 137–148. Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Chris- tian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14. Joakim Nivre, Željko Agi´c, Lars Ahrenberg, Maria Jesus Aranzabe, Masayuki Asahara, John Aitziber Atutxa, Miguel Ballesteros, Bauer, Kepa Bengoetxea, Yevgeni Berzak, Riyaz Ahmad Bhat, Eckhard Bick, Carl Börstell, Cristina Bosco, Gosse Bouma, Sam Bowman, Gül¸sen Cebiro˘glu Eryi˘git, Giuseppe G. A. Celano, Fabricio Chalub, Ça˘grı Çöl- tekin, Miriam Connor, Elizabeth Davidson, Marie-Catherine de Marneffe, Arantza Diaz de Ilarraza, Kaja Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Marhaba Eli, Tomaž Erjavec, Richárd Farkas, Jennifer Foster, Claudia Freitas, Katarína Gajdošová, Daniel Galbraith, Marcos Garcia, Moa Gär- denfors, Sebastian Garza, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Memduh Gökır- mak, Yoav Goldberg, Xavier Gómez Guino- vart, Berta Gonzáles Saavedra, Matias Gri- oni, Normunds Gr¯uz¯ıtis, Bruno Guillaume, Jan Hajiˇc, Linh Hà M˜y, Dag Haug, Barbora Hladká, Radu Ion, Elena Irimia, Anders Jo- hannsen, Fredrik Jørgensen, Hüner Ka¸sıkara, Hiroshi Kanayama, Jenna Kanerva, Boris Katz, Jessica Kenney, Natalia Kotsyba, Si- mon Krek, Veronika Laippala, Lucia Lam, Phuong Lê H`ông, Alessandro Lenci, Nikola Ljubeši´c, Olga Lyashevskaya, Teresa Lynn, Aibek Makazhanov, Christopher Manning, C˘at˘alina M˘ar˘anduc, David Mareˇcek, Héctor Martínez Alonso, André Martins, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Anna Mis- silä, Verginica Mititelu, Yusuke Miyao, Simon- etta Montemagni, Keiko Sophie Mori, Shun- suke Mori, Bohdan Moskalevskyi, Kadri Muis- 371 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 3 1 9 2 3 4 6 6 / / t l a c _ a _ 0 0 2 7 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 chnek, Nina Mustafina, Kaili Müürisep, Luong Nguy˜ên Thi., Huy`ên Nguy˜ên Thi. Minh, Vitaly Nikolaev, Hanna Nurmi, Petya Osenova, Robert Östling, Lilja Øvrelid, Valeria Paiva, Elena Pas- cual, Marco Passarotti, Cenel-Augusto Perez, Slav Petrov, Jussi Piitulainen, Barbara Plank, Martin Popel, Lauma Pretkalnin, a, Prokopis Prokopidis, Tiina Puolakainen, Sampo Pyysalo, Alexandre Rademaker, Loganathan Ramasamy, Livy Real, Laura Rituma, Rudolf Rosa, Shadi Saleh, Baiba Saul¯ıte, Sebastian Schuster, Wolf- gang Seeker, Mojgan Seraji, Lena Shakurova, Mo Shen, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Mária Šimková, Kiril Simov, Aaron Smith, Carolyn Spadine, Alane Suhr, Umut Sulubacak, Zsolt Szántó, Takaaki Tanaka, Reut Tsarfaty, Francis Ty- ers, Sumire Uematsu, Larraitz Uria, Gertjan van Noord, Viktor Varga, Veronika Vincze, Lars Wallin, Jing Xian Wang, Jonathan North Washington, Mats Wirén, Zdenˇek Žabokrt- ský, Amir Zeldes, Daniel Zeman, and Hanzhi Zhu. 2016. Universal Dependencies 1.4. the In- LINDAT/CLARIN digital stitute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. Data available at http: //universaldependencies.org. library at Joakim Nivre, Johan Hall, Sandra Kübler, Ryan McDonald, Jens Nilsson, Sebastian Riedel, and Deniz Yuret. 2007a. The CoNLL 2007 shared In Proceedings task on dependency parsing. of the CoNLL Shared Task Session of EMNLP- CoNLL 2007, pages 915–932. Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, Gül¸sen Eryigit, Sandra Kübler, Sve- toslav Marinov, and Erwin Marsi. 2007b. Malt- parser: A language-independent system for data-driven dependency parsing. Natural Lan- guage Engineering, 13(2):95–135. Joakim Nivre et al. 2018. Universal depen- dencies annotation guidelines. Available at universaldependencies.org. Geoffrey Nunberg. 1990. The Linguistics of Punc- tuation. Number 18 in CSLI Lecture Notes. Center for the Study of Language and Informa- tion. Bo Pang, Lillian Lee, Vaithyanathan. 2002. and Shivakumar Senti- Thumbs up? ment classification using machine learning the 2002 techniques. Conference on Empirical Methods in Natural Language Processing (EMNLP 2002). In Proceedings of Jeffrey Pennington, Richard Socher, and Christo- pher Manning. 2014. GloVe: Global vectors In Proceedings of for word representation. the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Fernando C. N. Pereira and Michael D. Riley. 1996. Speech recognition by composition of weighted finite automata. Computing Research Repository (CoRR), arXiv:cmp-lg/9603001. Mohammad Sadegh Rasooli and Joel R. Tetreault. 2015. Yara parser: A fast and accurate depen- dency parser. Computing Research Repository, arXiv:1503.06733 (version 2). Radim ˇReh˚uˇrek and Petr Sojka. 2010. Software framework for topic modelling with large cor- pora. In Proceedings of the LREC 2010 Work- shop on New Challenges for NLP Frameworks, pages 45–50. Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. 2011. Punctuation: Making a point in unsupervised dependency parsing. In Proceed- ings of the Fifteenth Conference on Computa- tional Natural Language Learning, CoNLL ’11, pages 19–28. Kai Sheng Tai, Richard Socher, and Christo- pher D. Manning. 2015. Improved semantic representations from tree-structured long short- term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Interna- tional Joint Conference on Natural Language Processing (ACL-COLING), pages 1556–1566. Ottokar Tilk and Tanel Alumäe. 2016. Bidirec- tional recurrent neural network with attention mechanism for punctuation restoration. In In- terspeech, pages 3047–3051. Ke M. Tran, Yonatan Bisk, Ashish Vaswani, Daniel Marcu, and Kevin Knight. 2016. Unsu- pervised neural hidden Markov models. In Pro- ceedings of the Workshop on Structured Predic- tion for NLP, pages 63–71. 372 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 3 1 9 2 3 4 6 6 / / t l a c _ a _ 0 0 2 7 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 University of Chicago. 2010. The Chicago Man- ual of Style. University of Chicago Press. Dingquan Wang and Jason Eisner. 2016. The Galactic Dependencies treebanks: Getting more data by synthesizing new languages. Transac- tions of the Association for Computational Lin- guistics (TACL), 4:491–505. Michael White and Rajakrishnan Rajkumar. 2008. A more precise analysis of punctuation for broad-coverage surface realization with CCG. In Proceedings of the COLING 2008 Workshop on Grammar Engineering Across Frameworks, pages 17–24. K. Xu, L. Xie, and K. Yao. 2016. Investigat- ing LSTM for punctuation prediction. In 2016 10th International Symposium on Chinese Spo- ken Language Processing (ISCSLP), pages 1–5. Richard Zens, Franz Josef Och, and Hermann Ney. 2002. Phrase-based statistical machine transla- tion. In Annual Conference on Artificial Intelli- gence, pages 18–32. Dongdong Zhang, Shuangzhi Wu, Nan Yang, and Punctuation prediction with Mu Li. 2013. In Proceedings of transition-based parsing. the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pages 752– 760. Yue Zhang and Joakim Nivre. 2011. Transition- based dependency parsing with rich non-local In Proceedings of the 49th Annual features. Meeting of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 188–193. 373 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 7 3 1 9 2 3 4 6 6 / / t l a c _ a _ 0 0 2 7 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3A Generative Model for Punctuation in Dependency Trees image

PDF Herunterladen