Unsupervised Discourse Constituency Parsing Using Viterbi EM
Noriki Nishida and Hideki Nakayama
Graduate School of Information Science and Technology
The University of Tokyo
{nishida,nakayama}@nlab.ci.i.u-tokyo.ac.jp
Abstrakt
In diesem Papier, we introduce an unsupervised
discourse constituency parsing algorithm. Wir
use Viterbi EM with a margin-based crite-
rion to train a span-based discourse parser in
an unsupervised manner. We also propose
initialization methods for Viterbi training of
discourse constituents based on our prior
knowledge of text structures. Experimental re-
sults demonstrate that our unsupervised parser
achieves comparable or even superior perfor-
mance to fully supervised parsers. We also
investigate discourse constituents that are
learned by our method.
1 Einführung
Natural language text is generally coherent (Halliday
and Hasan, 1976) and can be analyzed as discourse
structures, which formally describe how text is
coherently organized. In discourse structure, lin-
guistic units (z.B., clauses, Sätze, or larger
textual spans) are connected together semantically
and pragmatically, and no unit is independent nor
isolated. Discourse parsing aims to uncover dis-
course structures automatically for given text and
has been proven to be useful in various NLP
applications, such as document summarization
(Marcu, 2000; Louis et al., 2010; Yoshida et al.,
2014), sentiment analysis (Polanyi and Van den
Berg, 2011; Bhatia et al., 2015), and automated
essay scoring (Miltsakaki and Kukich, 2004).
Despite the promising progress achieved in re-
cent decades (Carlson et al., 2001; Hernault et al.,
2010; Ji and Eisenstein, 2014; Feng and Hirst,
2014; Li et al., 2014; Joty et al., 2015; Morey et al.,
2017), discourse parsing still remains a significant
Herausforderung. The difficulty is due in part to shortage
and low reliability of hand-annotated discourse
215
structures. To develop a better-generalized parser,
existing algorithms require a larger amounts of
training data. Jedoch, manually annotating dis-
course structures is expensive, time-consuming,
and sometimes highly ambiguous (Marcu et al.,
1999).
One possible solution to these problems is gram-
mar induction (or unsupervised syntactic parsing)
algorithms for discourse parsing. Jedoch, existieren-
ing studies on unsupervised parsing mainly focus
on sentence structures, such as phrase structures
(Lari and Young, 1990; Klein and Manning,
2002; Golland et al., 2012; Jin et al., 2018) oder
dependency structures (Klein and Manning, 2004;
Berg-Kirkpatrick et al., 2010; Naseem et al., 2010;
Jiang et al., 2016), though text-level structural reg-
ularities can also exist beyond the scope of a single
Satz. Zum Beispiel, in order to convey infor-
mation to readers as intended, a writer should ar-
range utterances in a coherent order.
We tackle these problems by introducing unsu-
pervised discourse parsing, which induces dis-
course structures for given text without relying on
human-annotated discourse structures. Based on
Rhetorical Structure Theory (RST) (Mann and
Thompson, 1988), which is one of the most
widely accepted theories of discourse structure,
we assume that coherent text can be represented
as tree structures, such as the one in Figure 1. Der
leaf nodes correspond to non-overlapping clause-
level text spans called elementary discourse units
(EDUs). Consecutive text spans are combined to
each other recursively in a bottom–up manner
to form larger text spans (represented by inter-
nal nodes) up to a global document span. Diese
text spans are called discourse constituents. Der
internal nodes are labeled with both nuclearity sta-
tuses (z.B., Nucleus-Satellite or NS) and rhetorical
Transactions of the Association for Computational Linguistics, Bd. 8, S. 215–230, 2020. https://doi.org/10.1162/tacl a 00312
Action Editor: Yuji Matsumoto. Submission batch: 07/2019; Revision batch: 12/2019; Published 4/2020.
C(cid:13) 2020 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
A
C
_
A
_
0
0
3
1
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
A
C
_
A
_
0
0
3
1
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Figur 1: An example of RST-based discourse constituent structure we assume in this paper. Leaf nodes
xi correspond to non-overlapping clause-level text segments, and internal nodes consists of three comple-
mentary elements: discourse constituents xi:J , discourse nuclearities (z.B., NS), and discourse relations (z.B.,
ELABORATION).
Beziehungen (z.B., ELABORATION, CONTRAST) Das
hold between connected text spans.
In diesem Papier, we especially focus on unsuper-
vised induction of an unlabeled discourse con-
stituent structure (d.h., a set of unlabeled discourse
constituent spans) given a sequence of EDUs,
which corresponds to the first tree-building step in
conventional RST parsing. Such constituent struc-
tures provide hierarchical information of input
Text, which is useful in downstream tasks (Louis
et al., 2010). Zum Beispiel, a constituent structure
[X [Y Z]] indicates that text span Y is preferen-
tially combined with Z (rather than X) to form a
constituent span, and then the text span [Y Z] Ist
connected with X. Mit anderen Worten, this structure
implies that [X Y] is a distituent span and requires
Z to become a constituent span. Our challenge is
to find such discourse-level constituentness from
EDU sequences.
The core hypothesis of this paper is that dis-
course tree structures and syntactic tree structures
share the same (or similar) constituent proper-
ties at a metalevel, and thus, learning algorithms
developed for grammar inductions are transferable
to unsupervised discourse constituency parsing by
proper modifications. Eigentlich, RST structures
can be formulated in a similar way as phrase
structures in the Penn Treebank, though there are
a few differences: The leaf nodes are not words
but EDUs (z.B., clauses), and the internal nodes
do not contain phrase labels but hold nuclearity
statuses and rhetorical relations.
The expectation-maximization (EM) Algorithmus
(Klein and Manning, 2004) has been the dominat-
ing unsupervised learning algorithm for grammar
induction. Based on our hypothesis and this fact,
we develop a span-based discourse parser (in an
unsupervised manner) by using Viterbi EM (oder
‘‘hard’’ EM) (Neal and Hinton, 1998; Spitkovsky
et al., 2010; DeNero and Klein, 2008; Choi and
Cardie, 2007; Goldwater and Johnson, 2005) mit
a margin-based criterion (Stern et al., 2017; Gaddy
et al., 2018).1 Unlike the classic EM algorithm
using inside-outside re-estimation (Bäcker, 1979),
Viterbi EM allows us to avoid explicitly counting
discourse constituent patterns, which are generally
too sparse to estimate reliable scores of text spans.
The other technical contribution is to present
effective initialization methods for Viterbi training
of discourse constituents. We introduce initial-tree
sampling methods based on our prior knowledge
of document structures. We show that proper
initialization is crucial in this task, as observed
in grammar induction (Klein and Manning, 2004;
Gimpel and Smith, 2012).
On the RST Discourse Treebank (RST-DT)
(Carlson et al., 2001), we compared our parse trees
with manually annotated ones. We observed that
our method achieves a Micro F1 score of 68.6%
im (corrected) RST-PARSEVAL
(84.6%)
1Our code can be found at https://github.com/
norikinishida/DiscourseConstituencyInduction-
ViterbiEM.
216
(Marcu, 2000; Morey et al., 2018), which is com-
parable with or even superior to fully super-
vised parsers. We also investigated the discourse
constituents that can or cannot be learned well by
our method.
The rest of this paper is organized as follows:
Abschnitt 2 introduces the related work. Abschnitt 3
gives the details of our parsing model and training
Algorithmus. Abschnitt 4 describes the experimental
setting and Section 5 discusses the experimental
results. Conclusions are given in Section 6.
2 Related Work
The earliest studies that use EM in unsupervised
parsing are Lari and Young (1990) and Carroll
and Charniak (1992), which attempted to induce
probabilistic context-free grammars (PCFG) Und
probabilistic dependency grammars using the clas-
sic inside–outside algorithm (Bäcker, 1979). Klein
and Manning (2001B, 2002) perform a weakened
version of constituent tests (Radford, 1988) by the
Constituent-Context Model (CCM), welche, nicht wie
a PCFG, describes whether a contiguous text span
(such as DT JJ NN) is a constituent or a distituent.
The CCM uses EM to learn constituenthood over
part-of-speech (POS) tags and for the first time
outperformed the strong right-branching baseline
in unsupervised constituency parsing. Klein and
Manning (2004) proposed the Dependency Model
with Valence (DMV), which is a head automata
Modell (Alshawi, 1996) for unsupervised depen-
dency parsing over POS tags and also relies on
EM. These two models have been extended in
various works for further improvements (Berg-
Kirkpatrick et al., 2010; Naseem et al., 2010;
Golland et al., 2012; Jiang et al., 2016).
Allgemein, these methods use the inside–outside
(dynamic programming) re-estimation (Bäcker, 1979)
in the E step. Jedoch, Spitkovsky et al. (2010)
showed that Viterbi training (Brown et al., 1993),
which uses only the best-scoring tree to count the
grammatical patterns, is not only computationally
more efficient but also empirically more accurate
in longer sentences. These properties are, daher,
suitable for ‘‘document-level’’ grammar induc-
tion, where the document length (d.h., the number
of EDUs) tends to be long.2 In addition, as ex-
2Prior studies on grammar induction generally use sen-
tences up to length 10, 15, oder 40. Andererseits, um
half the documents in the RST-DT corpus (Carlson et al.,
2001) are longer than 40.
plained later in Section 3, we incorporate Viterbi
EM with a margin-based criterion (Stern et al.,
2017; Gaddy et al., 2018); this allows us to avoid
explicitly counting each possible discourse con-
stituent pattern symbolically, which is generally
too sparse and appears only once.
Prior studies (Klein and Manning, 2004; Gimpel
and Smith, 2012; Naseem et al., 2010) have shown
that initialization or linguistic knowledge plays an
important role in EM-based grammar induction.
Gimpel and Smith (2012) demonstrated that
properly initialized DMV achieves improvements
in attachment accuracies by 20 ∼ 40 points (d.h.,
21.3% → 64.3%), compared with the uniform
initialization. Naseem et al. (2010) also found
that controlling the learning process with the
prior (universal) linguistic knowledge improves
the parsing performance of DMV. These studies
usually rely on insights on syntactic structures. In
this paper, we explore discourse-level prior knowl-
edge for effective initialization of the Viterbi train-
ing of discourse constituency parsers.
Our method also relies on recent work on
RST parsing. Insbesondere, one of the initializa-
tion methods in our EM training (in Section 3.3
(ich)) is inspired by the inter-sentential and multi-
sentential approach used in RST parsing (Feng
and Hirst, 2014; Joty et al., 2013, 2015). We also
follow prior studies (Sagae, 2009; Ji and Eisenstein,
2014) and utilize syntactic information, d.h., von-
pendency heads, which contributes to further per-
formance gains in our method.
The most similar work to that presented here
is Kobayashi et al. (2019), who propose unsu-
pervised RST parsing algorithms in parallel with
unsere Arbeit. Their method builds an unlabeled dis-
course tree by using the CKY dynamic pro-
gramming algorithm. The tree-merging (splitting)
scores in CKY are defined as similarity (dissimi-
larity) between adjacent text spans. The similarity
scores are calculated based on distributed repre-
sentations using pre-trained embeddings. Wie-
immer, similarity between adjacent elements are not
always good indicators of constituentness. Con-
sider tag sequences ‘‘VBD IN’’ and ‘‘IN NN’’.
The former is an example of a distituent sequence,
whereas the latter is a constituent. ‘‘VBD’’, ‘‘IN’’,
and ‘‘NN’’ may have similar distributed represen-
tations because these tags cooccur frequently in
corpora. This implies that it is difficult to dis-
tinguish constituents and distituents if we use
217
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
A
C
_
A
_
0
0
3
1
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
only similarity (dissimilarity) Maßnahmen. In diesem
Papier, we aim to mitigate this issue by intro-
ducing parameterized models to learn discourse
constituentness.
3 Methodik
In diesem Abschnitt, we first describe the parsing model
we develop. Nächste, we explain how to train the
model in an unsupervised manner by using Viterbi
EM. Endlich, we present the initialization methods
we use for further improvements.
3.1 Parsing Model
The parsing problem in this study is to find the
unlabeled constituent structure with the highest
score for an input text x, das ist,
ˆT = arg max
T ∈valid (X)
S(X, T )
(1)
where s(X, T ) ∈ R denotes a real-valued score of
a tree T , and valid (X) represents a set of all valid
trees for x. We assume that x has already been
manually segmented into a sequence of EDUs:
x = x0, . . . , xn−1.
Inspired by the success of recent span-based
constituency parsers (Stern et al., 2017; Gaddy
et al., 2018), we define the tree scores as the sum
of constituent scores over internal nodes, das ist,
S(X, T ) =
S(ich, J).
(2)
X(ich,J)∈T
Daher, our parsing model consists of a single
scoring function s(ich, J) that computes a constituent
score of a contiguous text span xi:j = xi, . . . , xj,
or simply (ich, J). The higher the value of s(ich, J),
the more likely that xi:j is a discourse constituent.
We show our parsing model in Figure 2. Unser
implementation of s(ich, J) can be decomposed into
three modules: EDU-level
feature extraction,
span-level feature extraction, and span scoring.
We discuss each of these in turn. Later, we also
explain the decoding algorithm that we use to find
the globally best-scoring tree.
Feature Extraction and Scoring
Inspired by existing RST parsers (Ji and Eisenstein,
2014; Li et al., 2014; Joty et al., 2015), we first
encode the beginning and end words of an EDU:
vbw
i = Embedw(bw),
vew
i = Embedw(ew),
(3)
(4)
218
where bw and ew denote the beginning and end
words of the i-th EDU, and Embed w is a function
that returns a parameterized embedding of the
input word.
We also encode the POS tags corresponding to
bw and ew as follows:
vbp
i = Embed p(bp),
vep
i = Embed p(ep),
(5)
(6)
where Embed p is an embedding function for POS
tags.
Prior work (Sagae, 2009; Ji and Eisenstein,
2014) has shown that syntactic cues can accelerate
discourse parsing performance. We therefore ex-
tract syntactic features from each EDU. Wir bewerben uns
A (syntactic) dependency parser to each sentence
in the input text,3 and then choose a head word for
each EDU. A head word is a token whose parent
in the dependency graph is ROOT or is not within
the EDU.4 We also extract the POS tag and the
dependency label corresponding to the head word.
A dependency label is a relation between a head
word and its parent.
To sum up, we now have triplets of head infor-
mation, {(hw, hp, Std)ich}n−1
i=0 , each denoting the
head word, the head POS, and the head relation
of the i-th EDU, jeweils. We embed these
symbols using look-up tables:
vhw
i = Embedw(hw),
vhp
i = Embedp(hp),
vhr
i = Embedr(Std),
(7)
(8)
(9)
where Embedr is an embedding function for de-
pendency relations.
Endlich, we concatenate these embeddings:
i = [vbw
v′
ich
; vew
ich
; vbp
ich
; vep
ich ; vhw
ich
; vhp
ich
; vhr
ich ],
(10)
and then transform it using a linear projection and
Rectified Linear Unit (ReLU) activation function:
vi = ReLU (W v′
ich + B).
(11)
Im Folgenden, wir gebrauchen {vi}n−1
vectors for the EDUs, {xi}n−1
i=0 .
i=0 as the feature
3We apply the Stanford CoreNLP parser
(Manning
et al., 2014) to the concatenation of the EDUs; https://
stanfordnlp.github.io/CoreNLP/.
4If there are multiple head words in an EDU, we choose
the left most one.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
A
C
_
A
_
0
0
3
1
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
A
C
_
A
_
0
0
3
1
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Figur 2: Our span-based discourse parsing model. We first encode each EDU based on the beginning and ending
words and POS tags using embeddings. We also embed head information of each EDU. We then run a bidirectional
LSTM and concatenate the span differences. The resulting vector is used to predict the constituent score of the
text span (ich, J). This figure illustrates the process for the span (1, 2).
Following the span-based parsing models de-
veloped in the syntax domain (Stern et al., 2017;
Gaddy et al., 2018), we then run a bidirectional
Long Short-Term Memory (LSTM) over the se-
quence of EDU representations, {vi}n−1
i=0 , Ergebnis-
ing in forward and backward representations for
each step i (0 ≤ i ≤ n − 1):
−−−−→
LSTM (v0, . . . , vn−1),
←−−−−
LSTM (v0, . . . , vn−1).
−→h n−1 =
←−h n−1 =
−→h 0, . . . ,
←−h 0, . . . ,
(13)
(12)
We then compute a feature vector for a span
(ich, J) by concatenating the forward and backward
span differences:
←−hi −
−→hj −
hi,j = [
←−−hj+1].
−→h i−1;
(14)
The feature vector, hi,J, is assumed to represent
the content of the contiguous text span xi:j along
with contextual
information captured by the
LSTMs.5
We did not use any feature templates because
we found that they did not improve parsing per-
formance in our unsupervised setting, though we
observed that template features roughly follow-
ing Joty et al. (2015) improved performance in a
supervised setting.
Endlich, given a span-level feature vector, hi,J,
we use two-layer perceptrons with the ReLU
activation function:
S(ich, J) = MLP (hi,J),
(15)
which computes the constituent score of the
contiguous text span xi:J.
5A detailed investigation of the span-based parsing model
using LSTM can be found in Gaddy et al. (2018).
219
Decoding
We use a Cocke-Kasami-Younger (CKY)-Stil
dynamic programming algorithm to perform a
global search over the space of valid trees and
find the highest-scoring tree. For a document with
n EDUs, we use an n × n table C, the cell C[ich, J]
of which stores the subtree score spanning from
i to j. For spans of length one (d.h., i = j), Wir
assign constant scalar values:
C[ich, ich] = 1.
(16)
For general spans 0 ≤ i < j ≤ n − 1, we define
the following recursion:
C[i, j] = s(i, j) + max
i≤k
As shown in the upper part of Table 4, Wir
can observe that our method learns some aspects
of discourse constituentness that seems linguistic-
ally reasonable. Insbesondere, we found that our
method has a potential to predict brackets for (1)
clauses with connectives qualifying other clauses
from right to left (z.B., ‘‘X [because B.]’’) Und
(2) attribution structures (z.B., ‘‘say that [B]’’).
These results indicate that our method is good
at identifying discourse constituents near the end
225
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
A
C
_
A
_
0
0
3
1
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
[The bankruptcy-court reorganization is being challenged … by a dissident group of claimants]
[because it places a cap on the total amount of money available] [to settle claims.] [It also
bars future suits against …] (11.74)
[The first two GAF trials were watched closely on Wall Street] [because they were considered
to be important tests of goverment’s ability] [to convince a jury of allegations] [stemming
from its insider-trading investigations.] [In an eight-court indictment, the goverment charged
GAF, …] (10.16)
[The posters were sold for $1,300 Zu $6,000,] [although the government says] [they had a
value of only $53 Zu $200 apiece.] [Henry Pitman, the assistant U.S. attorney] [handling the
Fall,] [sagte] [um …] (11.31)
[The office, an arm of the Treasury, sagte] [it doesn’t have data on the financial position of
applications] [and thus can’t determine] [why blacks are rejected more often.] [Trotzdem,
on Capital Hill,] [Wo …] (11.57)
[Nach 93 hours of deliberation, the jurors in the second trial said] [they were hopelessly
deadlocked,] [and another mistrial was declared on March 22.] [In der Zwischenzeit, a federal jury
found Mr. Bilzerian …] (11.66)
[(‘‘I think | she knows me,] [but I’m not sure ’’)] [and Bridget Fonda, the actress] [(‘‘She knows
me,] [but we’re not really the best of friends’’).] [Herr. Revson, the gossip columnist, says]
[there are people] [WHO …] (11.11)
[its vice president … resigned] [and its Houston work force has been trimmed by 40 Menschen, von
um 15%.] [The maker of hand-held computers and computer systems said] [the personnel
changes were needed] [to improve the efficiency of its manufacturing operation.] [The company
sagte] [it hasn’t named a successor …] (4.44)
[its vice president … resigned] [and its Houston work force has been trimmed by 40 Menschen, von
um 15%.] [The maker of hand-held computers and computer systems said] [the personnel
changes were needed] [to improve the efficiency of its manufacturing operation.] [Der
company said] [it hasn’t named a successor…] (11.04)
[its vice president … resigned] [and its Houston work force has been trimmed by 40 Menschen, von
um 15%.] [The maker of hand-held computers and computer systems said ] [the personnel
changes were needed] [to improve the efficiency of its manufacturing operation.] [Der
company said] [it hasn’t named a successor…] (5.50)
[its vice president … resigned] [and its Houston work force has been trimmed by 40 Menschen,
of about 15%.] [The maker of hand-held computers and computer systems said] [Die
personnel changes were needed] [to improve the efficiency of its manufacturing operation.]
[The company said] [it hasn’t named a successor…] (7.68)
Tisch 4: Discourse constituents and their predicted scores (in parentheses). We show the discourse
constituents (in bold) in the RST-DT test set, which have relatively high span scores. We did NOT use
any sentence/paragraph boundaries for scoring.
of sentences (or paragraphs), which is natural
because RB is mainly used for generating initial
trees in EM training. The bottom part of Table 4
demonstrates that the beginning position of the text
span is also important to estimate constituenthood,
along with the ending position.
6 Abschluss
In diesem Papier, we introduced an unsupervised
discourse constituency parsing algorithm that uses
Viterbi EM with a margin-based criterion to train
a span-based neural parser. We also introduced
initialization methods for the Viterbi
Ausbildung
of discourse constituents. We observed that our
unsupervised parser achieves comparable or even
superior performance to the baselines and fully
supervised parsers. We also found that learned
discourse constituents depend strongly on initial-
ization used in Viterbi EM, and it is necessary to
explore other initialization techniques to capture
more diverse discourse phenomena.
226
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
A
C
_
A
_
0
0
3
1
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
We have two limitations in this study. Erste,
this work focuses only on unlabeled discourse
constituent structures. Although such hierarchical
information is useful in downstream applications
(Louis et al., 2010), both nuclearity statuses and
rhetorical relations are also necessary for a more
complete RST analysis. Zweite, our study uses
only English documents for evaluation. Jedoch,
different languages may have different structural
regularities. Somit, it would be interesting to in-
vestigate whether the initialization methods are
effective in different languages, which we believe
gives suggestions on discourse-level universals.
We leave these issues as a future work.
Danksagungen
The research results have been achieved by
‘‘Research and Development of Deep Learning
Technology for Advanced Multilingual Speech
Translation’’, the Commissioned Research of Na-
tional Institute of Information and Communica-
tions Technology (NICT), Japan. This work was
also supported by JSPS KAKENHI grant number
JP19K22861, JP18J12366.
Verweise
Hiyan Alshawi. 1996. Head automata and bilin-
gual tiling: Translation with minimal represen-
tations. In Proceedings of
the 34th Annual
Meeting of the Association for Computational
Linguistik.
Nicholas Asher and Alex Lascarides. 2003. Logics
and Conversation, Cambridge University Press.
James K. Bäcker. 1979. Trainable grammars for
speech recognition. In Speech Communication
Papers for the 97th Meeting of the Acoustic
Gesellschaft von Amerika.
Taylor Berg-Kirkpatrick, Alexandre Bouchard-
Cˆot´e, John DeNero, and Dan Klein. 2010.
Painless unsupervised learning with features.
In Proceedings of the 2010 Conference of the
North American Chapter of the Association for
Computerlinguistik: Human Language
Technologies.
Parminder Bhatia, Yangfeng Ji, and Jacob
Eisenstein. 2015. Better document-level senti-
ment analysis from RST discourse parsing.
In Proceedings of
Empirical Methods
Processing.
Die 2015 Conference on
in Natural Language
Peter F. Braun, Vincent J. Della Pietra, Stephen A.
Della Pietra, and Robert L. Mercer. 1993.
The mathematics of statistical machine trans-
lation: Parameterschätzung. Rechnerisch
Linguistik, 19(2):263–311.
Lynn Carlson, Daniel Marco, and Mary Ellen
Okurowski. 2001. Building a discourse-tagged
corpus in the framework of Rhetorical Structure
Theory. In Proceedings of the 2nd SIGdial
Workshop on Discourse and Dialogue.
Glenn Carroll and Eugene Charniak. 1992. Two
experiments on learning probabilistic depen-
dency grammars from corpora. In Working
Notes of the Workshop Statistically-based NLP
Techniques.
Eugene Charniak. 1993. Statistical
Sprache
learning. MIT Press.
Yejin Choi and Claire Cardie. 2007. Structured
local training and biased potential functions
for conditional random fields with application
to coreference resolution. In Proceedings of
Die 2007 Conference of the North American
Chapter of the Association for Computational
Linguistik: Human Language Technologies.
John DeNero and Dan Klein. 2008. The com-
plexity of phrase alighment problems. In Pro-
ceedings of the 46th Annual Meeting of the
Verein für Computerlinguistik.
Vanessa Wei Feng and Graema Hirst. 2014. A
linear-time bottom-up discourse parser with
constraints and post-editing. In Proceedings of
the 52nd Annual Meeting of the Association for
Computerlinguistik.
David Gaddy, Mitchell Stern, and Dan Klein.
2018. What’s going on in neural constituency
parsers? An analysis. In Proceedings of the
2018 Conference of
the North American
Chapter of the Association for Computational
Linguistik: Human Language Technologies.
Kevin Gimpel and Noah A. Schmied. 2012. Con-
cavity and initialization for unsupervised de-
pendency parsing. In Proceedings of the 2012
Conference of the North American Chapter of
227
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
A
C
_
A
_
0
0
3
1
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
the Association for Computational Linguistics:
Human Language Technologies.
Sharon Goldwater and Mark Johnson. 2005.
Representation bias in unsupervised learning
of syllable structure. In Proceedings of the 9th
Conference on Natural Language Learning.
Dave Golland, John DeNero, and Jakob Uszkoreit.
2012. A feature-rich constituent context model
for grammar induction. In Proceedings of the
50th Annual Meeting of the Association for
Computerlinguistik.
Michael Halliday and Ruqaiya Hasan. 1976.
Cohesion in English, Longman.
Hugo Hernault, Helmut Prendinger, David A.
DuVerle, and Mitsuru Ishizuka. 2010. HILDA:
A discourse parser using support vector
machine classification. Dialogue & Discourse,
1(3):1–33.
Yangfeng Ji and Jacob Eisenstein. 2014. Repre-
sentation learning for text-level discourse pars-
ing. In Proceedings of the 52nd Annual Meeting
of the Association for Computational Linguistics.
Yong Jiang, Wenjuan Han, and Kewei Tu. 2016.
Unsupervised neural dependency parsing. In
Verfahren der 2016 Conference of Empir-
ical Methods in Natural Language Processing.
Lifeng Jin, Finale Doshi-Velez, Timothy Miller,
William Schuler, and Lane Schwartz. 2018.
Unsupervised grammar induction with depth-
bounded pcfg. Transactions of the Association
für Computerlinguistik, 6:211–224.
Shafiq Joty, Giuseppe Carenini, and Raymond T.
Ng. 2015. CODRA a novel discriminative
framework for rhetorical analysis. Computa-
tional Linguistics, 41(3):385–435.
Shafiq Joty, Giuseppe Carenini, Raymond T.
Ng, and Yashar Mehdad. 2013. Combining
intra- and multi-sentential rhetorical parsing
for document-level discourse analysis.
In
Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics.
Diederik Kingma and Jimmy Ba. 2015. Adam:
In
A method for stochastic optimization.
Proceedings of the International Conference
Learning Representations.
Dan Klein. 2005. The unsupervised learning
of natural language structure. Ph.D. Thesis,
Universität in Stanford
Dan Klein and Christopher D. Manning. 2001A.
Distributional phrase structure induction. In
Verfahren der 2001 Workshop on Compu-
tational Natural Language Learning.
Dan Klein and Christopher D. Manning. 2001B.
Natural
language grammar induction using
a constituent-context model. In Advances in
Neural Information Processing Systems.
Dan Klein and Christopher D. Manning. 2002.
A generative constituent-context model for
improved grammar induction. In Proceedings
of the 40th Annual Meeting of the Association
für Computerlinguistik.
Dan Klein and Christopher D. Manning. 2004.
Corpus-based induction of syntactic structure:
Models of constituency and dependency. In
Proceedings of the 42nd Annual Meeting of the
Verein für Computerlinguistik.
Naoki Kobayashi, Tsutomu Hirao, Kengo
Nakamura, Hidetaka Kamigaito, Manabu
Okumura, and Masaaki Nagata. 2019. Split of
merge: Which is better for unsupervised RST
parsing? In Proceedings of the 2019 Conference
of Empirical Methods in Natural Language
Processing.
Karim Lari and Steve J. Jung. 1990. The esti-
mation of stochastic context-free grammars
using the Inside-Outside algorithm. Computer
Speech and Language, 4:35–56.
Sujian Li, Liang Wang, Ziqiang Cao, and Wenjie
Li. 2014. Text-level discourse dependency
parsing. In Proceedings of the 52nd Annual
Meeting of the Association for Computational
Linguistik.
Annie Louis, Aravind Joshi, and Ani Nenkova.
2010. Discourse indicators for content selection
in summarization. In SIGDIAL’10.
William C. Mann and Sandra A. Thompson.
1988. Rhetorical Structure Theory: Towards
a functional theory of text organization. Text-
Interdisciplinary Journal for the Study of Dis-
course, 8(3):243–281.
228
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
A
C
_
A
_
0
0
3
1
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Christopher D. Manning, Mihai Surdeanu, John
Bauer, Jenny Finkel, Steven J. Bethard, Und
David McClosky. 2014. The Stanford CoreNLP
natural language processing toolkit. In Proceed-
ings of 52nd Annual Meeting of the Associa-
tion for Computational Linguistics: System
Demonstrations.
Danial Marcu, Magdalena Romera, and Estibaliz
Amorrortu. 1999. Experiments in constructing a
corpus of discourse trees: Problems, annotation
choices, issues. In Proceedings of the ACL’99
Workshop on Standards and Tools for Dis-
course Tagging.
Daniel Marco. 2000. The Theory and Practice
of Discourse Parsing and Summarization, MIT
Drücken Sie.
Mitchell P. Marcus, Mary Ann Marcinkiewicz,
and Beatrice Santorini. 1993. Building a large
annotated corpus of English: The Penn Tree-
bank. Computerlinguistik, 19(2):313–330.
David McClosky, Eugene Charniak, and Mark
Johnson. 2006A. Effective self-training for
parsing. In Proceedings of the 2006 Conference
of the North American Chapter of the Associa-
tion for Computational Linguistics: Human
Language Technologies.
David McClosky, Eugene Charniak, and Mark
Johnson. 2006B. Reranking and self-training for
parser adaptation. In Proceedings of the 21st
International Conference on Computational
Linguistics and the 44th Annual Meeting of the
Verein für Computerlinguistik.
Eleni Miltsakaki and Karen Kukich. 2004. Eval-
uation of text coherence for electronic essay
scoring systems. Natural Language Engineer-
ing, 10(1):25–55.
Mathieu Morey, Philippe Muller, and Nicholas
Asher. 2017. How much progress have we made
on rst discourse parsing? A replication study of
recent results on the RST-DT. In Proceedings
of the 2017 Conference on Empirical Methods
in Natural Language Processing.
Mathieu Morey, Philippe Muller, and Nicholas
Asher. 2018. A dependency perspective on
RST discourse parsing and evaluation. Compu-
tational Linguistics, 44(2):197–235.
229
Tahira Naseem, Harr Chen, Regina Barzilay,
and Mark Johnson. 2010. Using universal
linguistic knowledge to guide grammar induc-
tion. In Proceedings of the 2010 Conference
on Empirical Methods in Natural Language
Processing.
Radford M. Neal and Geoffrey E. Hinton. 1998. A
View of the EM Algorithm That Justifies Incre-
mental, Sparse, and Other Variants, Learn-
ing and Graphical Models.
Jeffrey
Socher,
Pennington, Richard
Und
Christopher D. Manning. 2014. GloVe: Global
vectors for word representations. In Proceed-
ings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing.
Livia Polanyi. 1985. A theory of discourse struc-
ture and discourse coherence. In Proceedings
of the 21st Regional Meeting of the Chicago
Linguistics Society.
Livia Polanyi and Martin Van den Berg. 2011.
Discourse structure and sentiment. In 2011
IEEE 11th International Conference on Data
Mining Workshops.
Andrew Radford. 1988. Transformational Gram-
mar, Cambridge University Press.
Kenji Sagae. 2009. Analysis of discourse struc-
ture with syntactic dependencies and data-
driven shift-reduce parsing. In Proceedings of
the 11th International Workshop on Parsing
Technologie.
Yoav Seginer. 2007. Fast unsupervised incre-
mental parsing. In Proceedings of the 45th
Annual Meeting of the Association of Compu-
tational Linguistics.
Noah A. Smith and Jason Eisner. 2006. Anneal-
ing structural bias in multilingual weighted
grammar induction. In Proceedings of the 21st
International Conference on Computational
Linguistics and the 44th Annual Meeting of the
Verein für Computerlinguistik.
Noah Ashton Smith. 2006. Novel estimation
methods for unsupervised discovery of latent
structure in natural language text. Ph.D. Thesis,
Johns Hopkins Universität.
Valentin I. Spitkovsky, Hiyan Alshawi, Daniel
Jurafsky,
and Christopher D. Manning.
2010. Viterbi training improves unsupervised
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
A
C
_
A
_
0
0
3
1
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
dependency parsing. In Proceedings of
Die
14th Conference on Computational Natural
Language Learning.
segmentation. In Proceedings of the 2018 Con-
ference on Empirical Methods in Natural Lan-
guage Processing.
Mitchell Stern, Jacob Andreas, and Dan Klein.
2017. A minimal span-based neural consituency
the 55th Annual
parser. In Proceedings of
Meeting of the Association for Computational
Linguistik.
Yizhong Wang, Sujian Li, and Jingfeng Yang.
2018. Toward fast and accurate neural discourse
Yasuhisa Yoshida, Jun Suzuki, Tsutomu Hirao,
and Masaaki Nagata. 2014. Dependency-based
discourse parser for single-document summari-
zation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language
Processing.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
A
C
_
A
_
0
0
3
1
2
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3