Unsupervised Discourse Constituency Parsing Using Viterbi EM
Noriki Nishida and Hideki Nakayama
Graduate School of Information Science and Technology
The University of Tokyo
{nishida,nakayama}@nlab.ci.i.u-tokyo.ac.jp
Astratto
in questo documento, we introduce an unsupervised
discourse constituency parsing algorithm. Noi
use Viterbi EM with a margin-based crite-
rion to train a span-based discourse parser in
an unsupervised manner. We also propose
initialization methods for Viterbi training of
discourse constituents based on our prior
knowledge of text structures. Experimental re-
sults demonstrate that our unsupervised parser
achieves comparable or even superior perfor-
mance to fully supervised parsers. We also
investigate discourse constituents that are
learned by our method.
1 introduzione
Natural language text is generally coherent (Halliday
and Hasan, 1976) and can be analyzed as discourse
structures, which formally describe how text is
coherently organized. In discourse structure, lin-
guistic units (per esempio., clauses, sentences, or larger
textual spans) are connected together semantically
and pragmatically, and no unit is independent nor
isolated. Discourse parsing aims to uncover dis-
course structures automatically for given text and
has been proven to be useful in various NLP
applications, such as document summarization
(Marcu, 2000; Louis et al., 2010; Yoshida et al.,
2014), sentiment analysis (Polanyi and Van den
Berg, 2011; Bhatia et al., 2015), and automated
essay scoring (Miltsakaki and Kukich, 2004).
Despite the promising progress achieved in re-
cent decades (Carlson et al., 2001; Hernault et al.,
2010; Ji and Eisenstein, 2014; Feng and Hirst,
2014; Li et al., 2014; Joty et al., 2015; Morey et al.,
2017), discourse parsing still remains a significant
challenge. The difficulty is due in part to shortage
and low reliability of hand-annotated discourse
215
structures. To develop a better-generalized parser,
existing algorithms require a larger amounts of
training data. Tuttavia, manually annotating dis-
course structures is expensive, time-consuming,
and sometimes highly ambiguous (Marcu et al.,
1999).
One possible solution to these problems is gram-
mar induction (or unsupervised syntactic parsing)
algorithms for discourse parsing. Tuttavia, exist-
ing studies on unsupervised parsing mainly focus
on sentence structures, such as phrase structures
(Lari and Young, 1990; Klein and Manning,
2002; Golland et al., 2012; Jin et al., 2018) O
dependency structures (Klein and Manning, 2004;
Berg-Kirkpatrick et al., 2010; Naseem et al., 2010;
Jiang et al., 2016), though text-level structural reg-
ularities can also exist beyond the scope of a single
sentence. For instance, in order to convey infor-
mation to readers as intended, a writer should ar-
range utterances in a coherent order.
We tackle these problems by introducing unsu-
pervised discourse parsing, which induces dis-
course structures for given text without relying on
human-annotated discourse structures. Based on
Rhetorical Structure Theory (RST) (Mann and
Thompson, 1988), which is one of the most
widely accepted theories of discourse structure,
we assume that coherent text can be represented
as tree structures, such as the one in Figure 1. IL
leaf nodes correspond to non-overlapping clause-
level text spans called elementary discourse units
(EDUs). Consecutive text spans are combined to
each other recursively in a bottom–up manner
to form larger text spans (represented by inter-
nal nodes) up to a global document span. These
text spans are called discourse constituents. IL
internal nodes are labeled with both nuclearity sta-
tuses (per esempio., Nucleus-Satellite or NS) and rhetorical
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 8, pag. 215–230, 2020. https://doi.org/10.1162/tacl a 00312
Redattore di azioni: Yuji Matsumoto. Lotto di invio: 07/2019; Lotto di revisione: 12/2019; Pubblicato 4/2020.
C(cid:13) 2020 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
UN
C
_
UN
_
0
0
3
1
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
UN
C
_
UN
_
0
0
3
1
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Figura 1: An example of RST-based discourse constituent structure we assume in this paper. Leaf nodes
xi correspond to non-overlapping clause-level text segments, and internal nodes consists of three comple-
mentary elements: discourse constituents xi:j , discourse nuclearities (per esempio., NS), and discourse relations (per esempio.,
ELABORATION).
relations (per esempio., ELABORATION, CONTRAST) Quello
hold between connected text spans.
in questo documento, we especially focus on unsuper-
vised induction of an unlabeled discourse con-
stituent structure (cioè., a set of unlabeled discourse
constituent spans) given a sequence of EDUs,
which corresponds to the first tree-building step in
conventional RST parsing. Such constituent struc-
tures provide hierarchical information of input
testo, which is useful in downstream tasks (Louis
et al., 2010). For instance, a constituent structure
[X [Y Z]] indicates that text span Y is preferen-
tially combined with Z (rather than X) to form a
constituent span, and then the text span [Y Z] È
connected with X. In other words, this structure
implies that [X Y] is a distituent span and requires
Z to become a constituent span. Our challenge is
to find such discourse-level constituentness from
EDU sequences.
The core hypothesis of this paper is that dis-
course tree structures and syntactic tree structures
share the same (or similar) constituent proper-
ties at a metalevel, and thus, learning algorithms
developed for grammar inductions are transferable
to unsupervised discourse constituency parsing by
proper modifications. Actually, RST structures
can be formulated in a similar way as phrase
structures in the Penn Treebank, though there are
a few differences: The leaf nodes are not words
but EDUs (per esempio., clauses), and the internal nodes
do not contain phrase labels but hold nuclearity
statuses and rhetorical relations.
The expectation-maximization (EM) algorithm
(Klein and Manning, 2004) has been the dominat-
ing unsupervised learning algorithm for grammar
induction. Based on our hypothesis and this fact,
we develop a span-based discourse parser (in an
unsupervised manner) by using Viterbi EM (O
‘‘hard’’ EM) (Neal and Hinton, 1998; Spitkovsky
et al., 2010; DeNero and Klein, 2008; Choi and
Cardie, 2007; Goldwater and Johnson, 2005) con
a margin-based criterion (Stern et al., 2017; Gaddy
et al., 2018).1 Unlike the classic EM algorithm
using inside-outside re-estimation (Baker, 1979),
Viterbi EM allows us to avoid explicitly counting
discourse constituent patterns, which are generally
too sparse to estimate reliable scores of text spans.
The other technical contribution is to present
effective initialization methods for Viterbi training
of discourse constituents. We introduce initial-tree
sampling methods based on our prior knowledge
of document structures. We show that proper
initialization is crucial in this task, as observed
in grammar induction (Klein and Manning, 2004;
Gimpel and Smith, 2012).
On the RST Discourse Treebank (RST-DT)
(Carlson et al., 2001), we compared our parse trees
with manually annotated ones. We observed that
our method achieves a Micro F1 score of 68.6%
in the (corrected) RST-PARSEVAL
(84.6%)
1Our code can be found at https://github.com/
norikinishida/DiscourseConstituencyInduction-
ViterbiEM.
216
(Marcu, 2000; Morey et al., 2018), which is com-
parable with or even superior to fully super-
vised parsers. We also investigated the discourse
constituents that can or cannot be learned well by
our method.
The rest of this paper is organized as follows:
Sezione 2 introduces the related work. Sezione 3
gives the details of our parsing model and training
algorithm. Sezione 4 describes the experimental
setting and Section 5 discusses the experimental
risultati. Conclusions are given in Section 6.
2 Related Work
The earliest studies that use EM in unsupervised
parsing are Lari and Young (1990) and Carroll
and Charniak (1992), which attempted to induce
probabilistic context-free grammars (PCFG) E
probabilistic dependency grammars using the clas-
sic inside–outside algorithm (Baker, 1979). Klein
and Manning (2001B, 2002) perform a weakened
version of constituent tests (Radford, 1988) by the
Constituent-Context Model (CCM), Quale, unlike
a PCFG, describes whether a contiguous text span
(such as DT JJ NN) is a constituent or a distituent.
The CCM uses EM to learn constituenthood over
part-of-speech (POS) tags and for the first time
outperformed the strong right-branching baseline
in unsupervised constituency parsing. Klein and
Equipaggio (2004) proposed the Dependency Model
with Valence (DMV), which is a head automata
modello (Alshawi, 1996) for unsupervised depen-
dency parsing over POS tags and also relies on
EM. These two models have been extended in
various works for further improvements (Berg-
Kirkpatrick et al., 2010; Naseem et al., 2010;
Golland et al., 2012; Jiang et al., 2016).
Generalmente, these methods use the inside–outside
(dynamic programming) re-estimation (Baker, 1979)
in the E step. Tuttavia, Spitkovsky et al. (2010)
showed that Viterbi training (Brown et al., 1993),
which uses only the best-scoring tree to count the
grammatical patterns, is not only computationally
more efficient but also empirically more accurate
in longer sentences. These properties are, così,
suitable for ‘‘document-level’’ grammar induc-
zione, where the document length (cioè., the number
of EDUs) tends to be long.2 In addition, as ex-
2Prior studies on grammar induction generally use sen-
tences up to length 10, 15, O 40. D'altra parte, Di
half the documents in the RST-DT corpus (Carlson et al.,
2001) are longer than 40.
plained later in Section 3, we incorporate Viterbi
EM with a margin-based criterion (Stern et al.,
2017; Gaddy et al., 2018); this allows us to avoid
explicitly counting each possible discourse con-
stituent pattern symbolically, which is generally
too sparse and appears only once.
Prior studies (Klein and Manning, 2004; Gimpel
and Smith, 2012; Naseem et al., 2010) have shown
that initialization or linguistic knowledge plays an
important role in EM-based grammar induction.
Gimpel and Smith (2012) demonstrated that
properly initialized DMV achieves improvements
in attachment accuracies by 20 ∼ 40 points (cioè.,
21.3% → 64.3%), compared with the uniform
initialization. Naseem et al. (2010) also found
that controlling the learning process with the
prior (universal) linguistic knowledge improves
the parsing performance of DMV. These studies
usually rely on insights on syntactic structures. In
this paper, we explore discourse-level prior knowl-
edge for effective initialization of the Viterbi train-
ing of discourse constituency parsers.
Our method also relies on recent work on
RST parsing. In particular, one of the initializa-
tion methods in our EM training (in Section 3.3
(io)) is inspired by the inter-sentential and multi-
sentential approach used in RST parsing (Feng
and Hirst, 2014; Joty et al., 2013, 2015). We also
follow prior studies (Sagae, 2009; Ji and Eisenstein,
2014) and utilize syntactic information, cioè., Di-
pendency heads, which contributes to further per-
formance gains in our method.
The most similar work to that presented here
is Kobayashi et al. (2019), who propose unsu-
pervised RST parsing algorithms in parallel with
our work. Their method builds an unlabeled dis-
course tree by using the CKY dynamic pro-
gramming algorithm. The tree-merging (splitting)
scores in CKY are defined as similarity (dissimi-
larity) between adjacent text spans. The similarity
scores are calculated based on distributed repre-
sentations using pre-trained embeddings. How-
ever, similarity between adjacent elements are not
always good indicators of constituentness. Contro-
sider tag sequences ‘‘VBD IN’’ and ‘‘IN NN’’.
The former is an example of a distituent sequence,
whereas the latter is a constituent. ‘‘VBD’’, ‘‘IN’’,
and ‘‘NN’’ may have similar distributed represen-
tations because these tags cooccur frequently in
corpora. This implies that it is difficult to dis-
tinguish constituents and distituents if we use
217
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
UN
C
_
UN
_
0
0
3
1
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
only similarity (dissimilarity) measures. In this
paper, we aim to mitigate this issue by intro-
ducing parameterized models to learn discourse
constituentness.
3 Methodology
In this section, we first describe the parsing model
we develop. Prossimo, we explain how to train the
model in an unsupervised manner by using Viterbi
EM. Finalmente, we present the initialization methods
we use for further improvements.
3.1 Parsing Model
The parsing problem in this study is to find the
unlabeled constituent structure with the highest
score for an input text x, questo è,
ˆT = arg max
T ∈valid (X)
S(X, T )
(1)
where s(X, T ) ∈ R denotes a real-valued score of
a tree T , and valid (X) represents a set of all valid
trees for x. We assume that x has already been
manually segmented into a sequence of EDUs:
x = x0, . . . , xn−1.
Inspired by the success of recent span-based
constituency parsers (Stern et al., 2017; Gaddy
et al., 2018), we define the tree scores as the sum
of constituent scores over internal nodes, questo è,
S(X, T ) =
S(io, j).
(2)
X(io,j)∈T
Così, our parsing model consists of a single
scoring function s(io, j) that computes a constituent
score of a contiguous text span xi:j = xi, . . . , xj,
or simply (io, j). The higher the value of s(io, j),
the more likely that xi:j is a discourse constituent.
We show our parsing model in Figure 2. Nostro
implementation of s(io, j) can be decomposed into
three modules: EDU-level
feature extraction,
span-level feature extraction, and span scoring.
We discuss each of these in turn. Later, we also
explain the decoding algorithm that we use to find
the globally best-scoring tree.
Feature Extraction and Scoring
Inspired by existing RST parsers (Ji and Eisenstein,
2014; Li et al., 2014; Joty et al., 2015), we first
encode the beginning and end words of an EDU:
vbw
i = Embedw(bw),
vew
i = Embedw(ew),
(3)
(4)
218
where bw and ew denote the beginning and end
words of the i-th EDU, and Embed w is a function
that returns a parameterized embedding of the
input word.
We also encode the POS tags corresponding to
bw and ew as follows:
vbp
i = Embed p(bp),
vep
i = Embed p(ep),
(5)
(6)
where Embed p is an embedding function for POS
tags.
Prior work (Sagae, 2009; Ji and Eisenstein,
2014) has shown that syntactic cues can accelerate
discourse parsing performance. We therefore ex-
tract syntactic features from each EDU. We apply
UN (syntactic) dependency parser to each sentence
in the input text,3 and then choose a head word for
each EDU. A head word is a token whose parent
in the dependency graph is ROOT or is not within
the EDU.4 We also extract the POS tag and the
dependency label corresponding to the head word.
A dependency label is a relation between a head
word and its parent.
To sum up, we now have triplets of head infor-
mazione, {(hw, hp, hr)io}n−1
i=0 , each denoting the
head word, the head POS, and the head relation
of the i-th EDU, rispettivamente. We embed these
symbols using look-up tables:
vhw
i = Embedw(hw),
vhp
i = Embedp(hp),
vhr
i = Embedr(hr),
(7)
(8)
(9)
where Embedr is an embedding function for de-
pendency relations.
Finalmente, we concatenate these embeddings:
i = [vbw
v′
io
; vew
io
; vbp
io
; vep
io ; vhw
io
; vhp
io
; vhr
io ],
(10)
and then transform it using a linear projection and
Rectified Linear Unit (ReLU) activation function:
vi = ReLU (W v′
io + B).
(11)
In the following, noi usiamo {vi}n−1
vectors for the EDUs, {xi}n−1
i=0 .
i=0 as the feature
3We apply the Stanford CoreNLP parser
(Equipaggio
et al., 2014) to the concatenation of the EDUs; https://
stanfordnlp.github.io/CoreNLP/.
4If there are multiple head words in an EDU, we choose
the left most one.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
UN
C
_
UN
_
0
0
3
1
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
UN
C
_
UN
_
0
0
3
1
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Figura 2: Our span-based discourse parsing model. We first encode each EDU based on the beginning and ending
words and POS tags using embeddings. We also embed head information of each EDU. We then run a bidirectional
LSTM and concatenate the span differences. The resulting vector is used to predict the constituent score of the
text span (io, j). This figure illustrates the process for the span (1, 2).
Following the span-based parsing models de-
veloped in the syntax domain (Stern et al., 2017;
Gaddy et al., 2018), we then run a bidirectional
Long Short-Term Memory (LSTM) over the se-
quence of EDU representations, {vi}n−1
i=0 , result-
ing in forward and backward representations for
each step i (0 ≤ i ≤ n − 1):
−−−−→
LSTM (v0, . . . , vn−1),
←−−−−
LSTM (v0, . . . , vn−1).
−→h n−1 =
←−h n−1 =
−→h 0, . . . ,
←−h 0, . . . ,
(13)
(12)
We then compute a feature vector for a span
(io, j) by concatenating the forward and backward
span differences:
←−hi −
−→hj −
CIAO,j = [
←−−hj+1].
−→h i−1;
(14)
The feature vector, CIAO,j, is assumed to represent
the content of the contiguous text span xi:j along
with contextual
information captured by the
LSTMs.5
We did not use any feature templates because
we found that they did not improve parsing per-
formance in our unsupervised setting, though we
observed that template features roughly follow-
ing Joty et al. (2015) improved performance in a
supervised setting.
Finalmente, given a span-level feature vector, CIAO,j,
we use two-layer perceptrons with the ReLU
activation function:
S(io, j) = MLP (CIAO,j),
(15)
which computes the constituent score of the
contiguous text span xi:j.
5A detailed investigation of the span-based parsing model
using LSTM can be found in Gaddy et al. (2018).
219
Decoding
We use a Cocke-Kasami-Younger (CKY)-style
dynamic programming algorithm to perform a
global search over the space of valid trees and
find the highest-scoring tree. For a document with
n EDUs, we use an n × n table C, the cell C[io, j]
of which stores the subtree score spanning from
i to j. For spans of length one (cioè., i = j), we
assign constant scalar values:
C[io, io] = 1.
(16)
For general spans 0 ≤ i < j ≤ n − 1, we define
the following recursion:
C[i, j] = s(i, j) + max
i≤k
As shown in the upper part of Table 4, we
can observe that our method learns some aspects
of discourse constituentness that seems linguistic-
ally reasonable. In particular, we found that our
method has a potential to predict brackets for (1)
clauses with connectives qualifying other clauses
from right to left (per esempio., ‘‘X [because B.]’’) E
(2) attribution structures (per esempio., ‘‘say that [B]’’).
These results indicate that our method is good
at identifying discourse constituents near the end
225
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
UN
C
_
UN
_
0
0
3
1
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
[The bankruptcy-court reorganization is being challenged … by a dissident group of claimants]
[because it places a cap on the total amount of money available] [to settle claims.] [It also
bars future suits against …] (11.74)
[The first two GAF trials were watched closely on Wall Street] [because they were considered
to be important tests of goverment’s ability] [to convince a jury of allegations] [stemming
from its insider-trading investigations.] [In an eight-court indictment, the goverment charged
GAF, …] (10.16)
[The posters were sold for $1,300 A $6,000,] [although the government says] [they had a
value of only $53 A $200 apiece.] [Henry Pitman, the assistant U.S. attorney] [handling the
case,] [said] [Di …] (11.31)
[The office, an arm of the Treasury, said] [it doesn’t have data on the financial position of
applications] [and thus can’t determine] [why blacks are rejected more often.] [Nevertheless,
on Capital Hill,] [Dove …] (11.57)
[Dopo 93 hours of deliberation, the jurors in the second trial said] [they were hopelessly
deadlocked,] [and another mistrial was declared on March 22.] [Nel frattempo, a federal jury
found Mr. Bilzerian …] (11.66)
[(‘‘I think | she knows me,] [but I’m not sure ’’)] [and Bridget Fonda, the actress] [(‘‘She knows
me,] [but we’re not really the best of friends’’).] [Mr. Revson, the gossip columnist, says]
[there are people] [who …] (11.11)
[its vice president … resigned] [and its Houston work force has been trimmed by 40 people, Di
Di 15%.] [The maker of hand-held computers and computer systems said] [the personnel
changes were needed] [to improve the efficiency of its manufacturing operation.] [The company
said] [it hasn’t named a successor …] (4.44)
[its vice president … resigned] [and its Houston work force has been trimmed by 40 people, Di
Di 15%.] [The maker of hand-held computers and computer systems said] [the personnel
changes were needed] [to improve the efficiency of its manufacturing operation.] [IL
company said] [it hasn’t named a successor…] (11.04)
[its vice president … resigned] [and its Houston work force has been trimmed by 40 people, Di
Di 15%.] [The maker of hand-held computers and computer systems said ] [the personnel
changes were needed] [to improve the efficiency of its manufacturing operation.] [IL
company said] [it hasn’t named a successor…] (5.50)
[its vice president … resigned] [and its Houston work force has been trimmed by 40 people,
of about 15%.] [The maker of hand-held computers and computer systems said] [IL
personnel changes were needed] [to improve the efficiency of its manufacturing operation.]
[The company said] [it hasn’t named a successor…] (7.68)
Tavolo 4: Discourse constituents and their predicted scores (in parentheses). We show the discourse
constituents (in bold) in the RST-DT test set, which have relatively high span scores. We did NOT use
any sentence/paragraph boundaries for scoring.
of sentences (or paragraphs), which is natural
because RB is mainly used for generating initial
trees in EM training. The bottom part of Table 4
demonstrates that the beginning position of the text
span is also important to estimate constituenthood,
along with the ending position.
6 Conclusione
in questo documento, we introduced an unsupervised
discourse constituency parsing algorithm that uses
Viterbi EM with a margin-based criterion to train
a span-based neural parser. We also introduced
initialization methods for the Viterbi
training
of discourse constituents. We observed that our
unsupervised parser achieves comparable or even
superior performance to the baselines and fully
supervised parsers. We also found that learned
discourse constituents depend strongly on initial-
ization used in Viterbi EM, and it is necessary to
explore other initialization techniques to capture
more diverse discourse phenomena.
226
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
UN
C
_
UN
_
0
0
3
1
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
We have two limitations in this study. Primo,
this work focuses only on unlabeled discourse
constituent structures. Although such hierarchical
information is useful in downstream applications
(Louis et al., 2010), both nuclearity statuses and
rhetorical relations are also necessary for a more
complete RST analysis. Secondo, our study uses
only English documents for evaluation. Tuttavia,
different languages may have different structural
regularities. Hence, it would be interesting to in-
vestigate whether the initialization methods are
effective in different languages, which we believe
gives suggestions on discourse-level universals.
We leave these issues as a future work.
Ringraziamenti
The research results have been achieved by
‘‘Research and Development of Deep Learning
Technology for Advanced Multilingual Speech
Translation’’, the Commissioned Research of Na-
tional Institute of Information and Communica-
tions Technology (NICT), Japan. This work was
also supported by JSPS KAKENHI grant number
JP19K22861, JP18J12366.
Riferimenti
Hiyan Alshawi. 1996. Head automata and bilin-
gual tiling: Translation with minimal represen-
tations. Negli Atti di
the 34th Annual
Riunione dell'Associazione per il Computazionale
Linguistica.
Nicholas Asher and Alex Lascarides. 2003. Logics
and Conversation, Cambridge University Press.
James K. Baker. 1979. Trainable grammars for
speech recognition. In Speech Communication
Papers for the 97th Meeting of the Acoustic
Society of America.
Taylor Berg-Kirkpatrick, Alexandre Bouchard-
Cˆot´e, John DeNero, and Dan Klein. 2010.
Painless unsupervised learning with features.
Negli Atti del 2010 Conference of the
North American Chapter of the Association for
Linguistica computazionale: Human Language
Technologies.
Parminder Bhatia, Yangfeng Ji, and Jacob
Eisenstein. 2015. Better document-level senti-
ment analysis from RST discourse parsing.
Negli Atti di
Empirical Methods
in lavorazione.
IL 2015 Conference on
in Natural Language
Peter F. Brown, Vincent J. Della Pietra, Stephen A.
Della Pietra, and Robert L. Mercer. 1993.
The mathematics of statistical machine trans-
lation: Parameter estimation. Computational
Linguistica, 19(2):263–311.
Lynn Carlson, Daniel Marcu, and Mary Ellen
Okurowski. 2001. Building a discourse-tagged
corpus in the framework of Rhetorical Structure
Theory. In Proceedings of the 2nd SIGdial
Workshop on Discourse and Dialogue.
Glenn Carroll and Eugene Charniak. 1992. Two
experiments on learning probabilistic depen-
dency grammars from corpora. In Working
Notes of the Workshop Statistically-based NLP
Techniques.
Eugene Charniak. 1993. Statistical
lingua
apprendimento. CON Premere.
Yejin Choi and Claire Cardie. 2007. Structured
local training and biased potential functions
for conditional random fields with application
to coreference resolution. Negli Atti di
IL 2007 Conferenza del Nord America
Capitolo dell'Associazione per il calcolo
Linguistica: Tecnologie del linguaggio umano.
John DeNero and Dan Klein. 2008. The com-
plexity of phrase alighment problems. Nel professionista-
ceedings of the 46th Annual Meeting of the
Associazione per la Linguistica Computazionale.
Vanessa Wei Feng and Graema Hirst. 2014. UN
linear-time bottom-up discourse parser with
constraints and post-editing. Negli Atti di
the 52nd Annual Meeting of the Association for
Linguistica computazionale.
David Gaddy, Mitchell Stern, and Dan Klein.
2018. What’s going on in neural constituency
parsers? An analysis. Negli Atti del
2018 Conference of
the North American
Capitolo dell'Associazione per il calcolo
Linguistica: Tecnologie del linguaggio umano.
Kevin Gimpel and Noah A. Smith. 2012. Contro-
cavity and initialization for unsupervised de-
pendency parsing. Negli Atti del 2012
Conference of the North American Chapter of
227
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
UN
C
_
UN
_
0
0
3
1
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
the Association for Computational Linguistics:
Tecnologie del linguaggio umano.
Sharon Goldwater and Mark Johnson. 2005.
Representation bias in unsupervised learning
of syllable structure. In Proceedings of the 9th
Conference on Natural Language Learning.
Dave Golland, John DeNero, and Jakob Uszkoreit.
2012. A feature-rich constituent context model
for grammar induction. Negli Atti del
50esima Assemblea Annuale dell'Associazione per
Linguistica computazionale.
Michael Halliday and Ruqaiya Hasan. 1976.
Cohesion in English, Longman.
Hugo Hernault, Helmut Prendinger, David A.
DuVerle, and Mitsuru Ishizuka. 2010. HILDA:
A discourse parser using support vector
machine classification. Dialogue & Discourse,
1(3):1–33.
Yangfeng Ji and Jacob Eisenstein. 2014. Repre-
sentation learning for text-level discourse pars-
ing. In Proceedings of the 52nd Annual Meeting
of the Association for Computational Linguistics.
Yong Jiang, Wenjuan Han, and Kewei Tu. 2016.
Unsupervised neural dependency parsing. In
Atti del 2016 Conference of Empir-
ical Methods in Natural Language Processing.
Lifeng Jin, Finale Doshi-Velez, Timothy Miller,
William Schuler, and Lane Schwartz. 2018.
Unsupervised grammar induction with depth-
bounded pcfg. Transactions of the Association
for Computational Linguistics, 6:211–224.
Shafiq Joty, Giuseppe Carenini, and Raymond T.
Di. 2015. CODRA a novel discriminative
framework for rhetorical analysis. Computa-
linguistica nazionale, 41(3):385–435.
Shafiq Joty, Giuseppe Carenini, Raymond T.
Di, and Yashar Mehdad. 2013. Combining
intra- and multi-sentential rhetorical parsing
for document-level discourse analysis.
In
Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics.
Diederik Kingma and Jimmy Ba. 2015. Adam:
In
A method for stochastic optimization.
Proceedings of the International Conference
Learning Representations.
Dan Klein. 2005. The unsupervised learning
of natural language structure. Ph.D. Thesis,
Stanford University
Dan Klein and Christopher D. Equipaggio. 2001UN.
Distributional phrase structure induction. In
Atti del 2001 Workshop on Compu-
tational Natural Language Learning.
Dan Klein and Christopher D. Equipaggio. 2001B.
Natural
language grammar induction using
a constituent-context model. In Advances in
Neural Information Processing Systems.
Dan Klein and Christopher D. Equipaggio. 2002.
A generative constituent-context model for
improved grammar induction. Negli Atti
of the 40th Annual Meeting of the Association
for Computational Linguistics.
Dan Klein and Christopher D. Equipaggio. 2004.
Corpus-based induction of syntactic structure:
Models of constituency and dependency. In
Proceedings of the 42nd Annual Meeting of the
Associazione per la Linguistica Computazionale.
Naoki Kobayashi, Tsutomu Hirao, Kengo
Nakamura, Hidetaka Kamigaito, Manabu
Okumura, and Masaaki Nagata. 2019. Split of
merge: Which is better for unsupervised RST
parsing? Negli Atti del 2019 Conferenza
of Empirical Methods in Natural Language
in lavorazione.
Karim Lari and Steve J. Young. 1990. The esti-
mation of stochastic context-free grammars
using the Inside-Outside algorithm. Computer
Speech and Language, 4:35–56.
Sujian Li, Liang Wang, Ziqiang Cao, and Wenjie
Li. 2014. Text-level discourse dependency
parsing. In Proceedings of the 52nd Annual
Riunione dell'Associazione per il Computazionale
Linguistica.
Annie Louis, Aravind Joshi, and Ani Nenkova.
2010. Discourse indicators for content selection
in summarization. In SIGDIAL’10.
William C. Mann and Sandra A. Thompson.
1988. Rhetorical Structure Theory: Towards
a functional theory of text organization. Testo-
Interdisciplinary Journal for the Study of Dis-
course, 8(3):243–281.
228
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
UN
C
_
UN
_
0
0
3
1
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Christopher D. Equipaggio, Mihai Surdeanu, John
Bauer, Jenny Finkel, Steven J. Bethard, E
David McClosky. 2014. The Stanford CoreNLP
natural language processing toolkit. In Procedi-
ings of 52nd Annual Meeting of the Associa-
tion for Computational Linguistics: System
Demonstrations.
Danial Marcu, Magdalena Romera, and Estibaliz
Amorrortu. 1999. Experiments in constructing a
corpus of discourse trees: Problems, annotazione
choices, issues. In Proceedings of the ACL’99
Workshop on Standards and Tools for Dis-
course Tagging.
Daniel Marcu. 2000. The Theory and Practice
of Discourse Parsing and Summarization, MIT
Press.
Mitchell P. Marcus, Mary Ann Marcinkiewicz,
and Beatrice Santorini. 1993. Building a large
annotated corpus of English: The Penn Tree-
bank. Linguistica computazionale, 19(2):313–330.
David McClosky, Eugene Charniak, and Mark
Johnson. 2006UN. Effective self-training for
parsing. Negli Atti del 2006 Conferenza
of the North American Chapter of the Associa-
tion for Computational Linguistics: Umano
Language Technologies.
David McClosky, Eugene Charniak, and Mark
Johnson. 2006B. Reranking and self-training for
parser adaptation. In Proceedings of the 21st
Conferenza internazionale sul calcolo
Linguistics and the 44th Annual Meeting of the
Associazione per la Linguistica Computazionale.
Eleni Miltsakaki and Karen Kukich. 2004. Eval-
uation of text coherence for electronic essay
scoring systems. Natural Language Engineer-
ing, 10(1):25–55.
Mathieu Morey, Philippe Muller, and Nicholas
Asher. 2017. How much progress have we made
on rst discourse parsing? A replication study of
recent results on the RST-DT. Negli Atti
del 2017 Conferenza sui metodi empirici
nell'elaborazione del linguaggio naturale.
Mathieu Morey, Philippe Muller, and Nicholas
Asher. 2018. A dependency perspective on
RST discourse parsing and evaluation. Compu-
linguistica nazionale, 44(2):197–235.
229
Tahira Naseem, Harr Chen, Regina Barzilay,
and Mark Johnson. 2010. Using universal
linguistic knowledge to guide grammar induc-
zione. Negli Atti del 2010 Conferenza
sui metodi empirici nel linguaggio naturale
in lavorazione.
Radford M. Neal and Geoffrey E. Hinton. 1998. UN
View of the EM Algorithm That Justifies Incre-
mental, Sparse, and Other Variants, Learn-
ing and Graphical Models.
Jeffrey
Socher,
Pennington, Richard
E
Christopher D. Equipaggio. 2014. Guanto: Globale
vectors for word representations. In Procedi-
ings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing.
Livia Polanyi. 1985. A theory of discourse struc-
ture and discourse coherence. Negli Atti
of the 21st Regional Meeting of the Chicago
Linguistics Society.
Livia Polanyi and Martin Van den Berg. 2011.
Discourse structure and sentiment. In 2011
IEEE 11th International Conference on Data
Mining Workshops.
Andrew Radford. 1988. Transformational Gram-
mar, Cambridge University Press.
Kenji Sagae. 2009. Analysis of discourse struc-
ture with syntactic dependencies and data-
driven shift-reduce parsing. Negli Atti di
the 11th International Workshop on Parsing
Tecnologia.
Yoav Seginer. 2007. Fast unsupervised incre-
mental parsing. In Proceedings of the 45th
Annual Meeting of the Association of Compu-
linguistica nazionale.
Noah A. Smith and Jason Eisner. 2006. Anneal-
ing structural bias in multilingual weighted
grammar induction. In Proceedings of the 21st
Conferenza internazionale sul calcolo
Linguistics and the 44th Annual Meeting of the
Associazione per la Linguistica Computazionale.
Noah Ashton Smith. 2006. Novel estimation
methods for unsupervised discovery of latent
structure in natural language text. Ph.D. Thesis,
Johns Hopkins University.
Valentin I. Spitkovsky, Hiyan Alshawi, Daniel
Jurafsky,
e Christopher D. Equipaggio.
2010. Viterbi training improves unsupervised
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
UN
C
_
UN
_
0
0
3
1
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
dependency parsing. Negli Atti di
IL
14th Conference on Computational Natural
Language Learning.
segmentation. Negli Atti del 2018 Contro-
ference on Empirical Methods in Natural Lan-
guage Processing.
Mitchell Stern, Jacob Andreas, and Dan Klein.
2017. A minimal span-based neural consituency
the 55th Annual
parser. Negli Atti di
Riunione dell'Associazione per il Computazionale
Linguistica.
Yizhong Wang, Sujian Li, and Jingfeng Yang.
2018. Toward fast and accurate neural discourse
Yasuhisa Yoshida, Jun Suzuki, Tsutomu Hirao,
and Masaaki Nagata. 2014. Dependency-based
discourse parser for single-document summari-
zation. Negli Atti del 2014 Conferenza
sui metodi empirici nel linguaggio naturale
in lavorazione.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
T
l
UN
C
_
UN
_
0
0
3
1
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3