Unsupervised Discourse Constituency Parsing Using Viterbi EM
Noriki Nishida and Hideki Nakayama
Graduate School of Information Science and Technology
The University of Tokyo
{nishida,nakayama}@nlab.ci.i.u-tokyo.ac.jp
Abstracto
en este documento, we introduce an unsupervised
discourse constituency parsing algorithm. Nosotros
use Viterbi EM with a margin-based crite-
rion to train a span-based discourse parser in
an unsupervised manner. We also propose
initialization methods for Viterbi training of
discourse constituents based on our prior
knowledge of text structures. Experimental re-
sults demonstrate that our unsupervised parser
achieves comparable or even superior perfor-
mance to fully supervised parsers. Nosotros también
investigate discourse constituents that are
learned by our method.
1 Introducción
Natural language text is generally coherent (Halliday
and Hasan, 1976) and can be analyzed as discourse
estructuras, which formally describe how text is
coherently organized. In discourse structure, lin-
guistic units (p.ej., clausulas, oraciones, or larger
textual spans) are connected together semantically
and pragmatically, and no unit is independent nor
isolated. Discourse parsing aims to uncover dis-
course structures automatically for given text and
has been proven to be useful in various NLP
applications, such as document summarization
(marco, 2000; Louis et al., 2010; Yoshida et al.,
2014), sentiment analysis (Polanyi and Van den
Iceberg, 2011; Bhatia et al., 2015), and automated
essay scoring (Miltsakaki and Kukich, 2004).
Despite the promising progress achieved in re-
cent decades (Carlson et al., 2001; Hernault et al.,
2010; Ji and Eisenstein, 2014; Feng and Hirst,
2014; Le et al., 2014; Joty et al., 2015; Morey et al.,
2017), discourse parsing still remains a significant
challenge. The difficulty is due in part to shortage
and low reliability of hand-annotated discourse
215
estructuras. To develop a better-generalized parser,
existing algorithms require a larger amounts of
training data. Sin embargo, manually annotating dis-
course structures is expensive, time-consuming,
and sometimes highly ambiguous (Marcu et al.,
1999).
One possible solution to these problems is gram-
mar induction (or unsupervised syntactic parsing)
algorithms for discourse parsing. Sin embargo, existir-
ing studies on unsupervised parsing mainly focus
on sentence structures, such as phrase structures
(Lari and Young, 1990; Klein and Manning,
2002; Golland et al., 2012; Jin et al., 2018) o
dependency structures (Klein and Manning, 2004;
Berg-Kirkpatrick et al., 2010; Naseem et al., 2010;
Jiang et al., 2016), though text-level structural reg-
ularities can also exist beyond the scope of a single
oración. Por ejemplo, in order to convey infor-
mation to readers as intended, a writer should ar-
range utterances in a coherent order.
We tackle these problems by introducing unsu-
pervised discourse parsing, which induces dis-
course structures for given text without relying on
human-annotated discourse structures. Based on
Teoría de la estructura retórica (primero) (Mann y
Thompson, 1988), which is one of the most
widely accepted theories of discourse structure,
we assume that coherent text can be represented
as tree structures, such as the one in Figure 1. El
leaf nodes correspond to non-overlapping clause-
level text spans called elementary discourse units
(EDUs). Consecutive text spans are combined to
each other recursively in a bottom–up manner
to form larger text spans (represented by inter-
nal nodes) up to a global document span. Estos
text spans are called discourse constituents. El
internal nodes are labeled with both nuclearity sta-
tuses (p.ej., Nucleus-Satellite or NS) y retórico
Transacciones de la Asociación de Lingüística Computacional, volumen. 8, páginas. 215–230, 2020. https://doi.org/10.1162/tacl a 00312
Editor de acciones: Yuji Matsumoto. Lote de envío: 07/2019; Lote de revisión: 12/2019; Publicado 4/2020.
C(cid:13) 2020 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
t
yo
a
C
_
a
_
0
0
3
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
t
yo
a
C
_
a
_
0
0
3
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 1: An example of RST-based discourse constituent structure we assume in this paper. Leaf nodes
xi correspond to non-overlapping clause-level text segments, and internal nodes consists of three comple-
mentary elements: discourse constituents xi:j , discourse nuclearities (p.ej., NS), and discourse relations (p.ej.,
ELABORATION).
relaciones (p.ej., ELABORATION, CONTRAST) eso
hold between connected text spans.
en este documento, we especially focus on unsuper-
vised induction of an unlabeled discourse con-
stituent structure (es decir., a set of unlabeled discourse
constituent spans) given a sequence of EDUs,
which corresponds to the first tree-building step in
conventional RST parsing. Such constituent struc-
tures provide hierarchical information of input
texto, which is useful in downstream tasks (luis
et al., 2010). Por ejemplo, a constituent structure
[X [Y Z]] indicates that text span Y is preferen-
tially combined with Z (rather than X) to form a
constituent span, and then the text span [Y Z] es
connected with X. En otras palabras, this structure
implica que [X Y] is a distituent span and requires
Z to become a constituent span. Our challenge is
to find such discourse-level constituentness from
EDU sequences.
The core hypothesis of this paper is that dis-
course tree structures and syntactic tree structures
share the same (or similar) constituent proper-
ties at a metalevel, y por lo tanto, learning algorithms
developed for grammar inductions are transferable
to unsupervised discourse constituency parsing by
proper modifications. De hecho, RST structures
can be formulated in a similar way as phrase
structures in the Penn Treebank, though there are
a few differences: The leaf nodes are not words
but EDUs (p.ej., clausulas), and the internal nodes
do not contain phrase labels but hold nuclearity
statuses and rhetorical relations.
The expectation-maximization (EM) algoritmo
(Klein and Manning, 2004) has been the dominat-
ing unsupervised learning algorithm for grammar
induction. Based on our hypothesis and this fact,
we develop a span-based discourse parser (in an
unsupervised manner) by using Viterbi EM (o
‘‘hard’’ EM) (Neal and Hinton, 1998; Spitkovsky
et al., 2010; DeNero and Klein, 2008; Choi and
Cárdigan, 2007; Goldwater and Johnson, 2005) con
a margin-based criterion (Stern et al., 2017; Gaddy
et al., 2018).1 Unlike the classic EM algorithm
using inside-outside re-estimation (Panadero, 1979),
Viterbi EM allows us to avoid explicitly counting
discourse constituent patterns, which are generally
too sparse to estimate reliable scores of text spans.
The other technical contribution is to present
effective initialization methods for Viterbi training
of discourse constituents. We introduce initial-tree
sampling methods based on our prior knowledge
of document structures. We show that proper
initialization is crucial in this task, as observed
in grammar induction (Klein and Manning, 2004;
Gimpel and Smith, 2012).
On the RST Discourse Treebank (RST-DT)
(Carlson et al., 2001), we compared our parse trees
with manually annotated ones. We observed that
our method achieves a Micro F1 score of 68.6%
en el (corregido) RST-PARSEVAL
(84.6%)
1Our code can be found at https://github.com/
norikinishida/DiscourseConstituencyInduction-
ViterbiEM.
216
(marco, 2000; Morey et al., 2018), which is com-
parable with or even superior to fully super-
vised parsers. We also investigated the discourse
constituents that can or cannot be learned well by
our method.
The rest of this paper is organized as follows:
Sección 2 introduces the related work. Sección 3
gives the details of our parsing model and training
algoritmo. Sección 4 describes the experimental
setting and Section 5 discusses the experimental
resultados. Conclusions are given in Section 6.
2 Trabajo relacionado
The earliest studies that use EM in unsupervised
parsing are Lari and Young (1990) and Carroll
and Charniak (1992), which attempted to induce
probabilistic context-free grammars (PCFG) y
probabilistic dependency grammars using the clas-
sic inside–outside algorithm (Panadero, 1979). Klein
and Manning (2001b, 2002) perform a weakened
version of constituent tests (Radford, 1988) by the
Constituent-Context Model (CCM), cual, a diferencia de
a PCFG, describes whether a contiguous text span
(such as DT JJ NN) is a constituent or a distituent.
The CCM uses EM to learn constituenthood over
part-of-speech (POS) tags and for the first time
outperformed the strong right-branching baseline
in unsupervised constituency parsing. Klein and
Manning (2004) proposed the Dependency Model
with Valence (DMV), which is a head automata
modelo (Alshawi, 1996) for unsupervised depen-
dency parsing over POS tags and also relies on
EM. These two models have been extended in
various works for further improvements (Iceberg-
Kirkpatrick et al., 2010; Naseem et al., 2010;
Golland et al., 2012; Jiang et al., 2016).
En general, these methods use the inside–outside
(dynamic programming) re-estimation (Panadero, 1979)
in the E step. Sin embargo, Spitkovsky et al. (2010)
showed that Viterbi training (Brown y cols., 1993),
which uses only the best-scoring tree to count the
grammatical patterns, is not only computationally
more efficient but also empirically more accurate
in longer sentences. These properties are, de este modo,
suitable for ‘‘document-level’’ grammar induc-
ción, where the document length (es decir., the number
of EDUs) tends to be long.2 In addition, as ex-
2Prior studies on grammar induction generally use sen-
tences up to length 10, 15, o 40. Por otro lado, acerca de
half the documents in the RST-DT corpus (Carlson et al.,
2001) are longer than 40.
plained later in Section 3, we incorporate Viterbi
EM with a margin-based criterion (Stern et al.,
2017; Gaddy et al., 2018); this allows us to avoid
explicitly counting each possible discourse con-
stituent pattern symbolically, que es generalmente
too sparse and appears only once.
Prior studies (Klein and Manning, 2004; Gimpel
and Smith, 2012; Naseem et al., 2010) have shown
that initialization or linguistic knowledge plays an
important role in EM-based grammar induction.
Gimpel and Smith (2012) demonstrated that
properly initialized DMV achieves improvements
in attachment accuracies by 20 ∼ 40 puntos (es decir.,
21.3% → 64.3%), compared with the uniform
inicialización. Naseem et al. (2010) also found
that controlling the learning process with the
previo (universal) linguistic knowledge improves
the parsing performance of DMV. These studies
usually rely on insights on syntactic structures. En
this paper, we explore discourse-level prior knowl-
edge for effective initialization of the Viterbi train-
ing of discourse constituency parsers.
Our method also relies on recent work on
RST parsing. En particular, one of the initializa-
tion methods in our EM training (en la sección 3.3
(i)) is inspired by the inter-sentential and multi-
sentential approach used in RST parsing (feng
and Hirst, 2014; Joty et al., 2013, 2015). Nosotros también
follow prior studies (Sagae, 2009; Ji and Eisenstein,
2014) and utilize syntactic information, es decir., de-
pendency heads, which contributes to further per-
formance gains in our method.
The most similar work to that presented here
is Kobayashi et al. (2019), who propose unsu-
pervised RST parsing algorithms in parallel with
nuestro trabajo. Their method builds an unlabeled dis-
course tree by using the CKY dynamic pro-
gramming algorithm. The tree-merging (splitting)
scores in CKY are defined as similarity (dissimi-
larity) between adjacent text spans. The similarity
scores are calculated based on distributed repre-
sentations using pre-trained embeddings. Cómo-
alguna vez, similarity between adjacent elements are not
always good indicators of constituentness. Estafa-
sider tag sequences ‘‘VBD IN’’ and ‘‘IN NN’’.
The former is an example of a distituent sequence,
whereas the latter is a constituent. ‘‘VBD’’, ‘‘IN’’,
and ‘‘NN’’ may have similar distributed represen-
tations because these tags cooccur frequently in
corpus. This implies that it is difficult to dis-
tinguish constituents and distituents if we use
217
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
t
yo
a
C
_
a
_
0
0
3
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
only similarity (dissimilarity) measures. En esto
paper, we aim to mitigate this issue by intro-
ducing parameterized models to learn discourse
constituentness.
3 Metodología
En esta sección, we first describe the parsing model
we develop. Próximo, we explain how to train the
model in an unsupervised manner by using Viterbi
EM. Finalmente, we present the initialization methods
we use for further improvements.
3.1 Parsing Model
The parsing problem in this study is to find the
unlabeled constituent structure with the highest
score for an input text x, eso es,
ˆT = arg max
T ∈valid (X)
s(X, t )
(1)
where s(X, t ) ∈ R denotes a real-valued score of
a tree T , and valid (X) represents a set of all valid
trees for x. We assume that x has already been
manually segmented into a sequence of EDUs:
x = x0, . . . , xn−1.
Inspired by the success of recent span-based
constituency parsers (Stern et al., 2017; Gaddy
et al., 2018), we define the tree scores as the sum
of constituent scores over internal nodes, eso es,
s(X, t ) =
s(i, j).
(2)
X(i,j)∈T
De este modo, our parsing model consists of a single
scoring function s(i, j) that computes a constituent
score of a contiguous text span xi:j = xi, . . . , xj,
or simply (i, j). The higher the value of s(i, j),
the more likely that xi:j is a discourse constituent.
We show our parsing model in Figure 2. Nuestro
implementation of s(i, j) can be decomposed into
three modules: EDU-level
feature extraction,
span-level feature extraction, and span scoring.
We discuss each of these in turn. Más tarde, nosotros también
explain the decoding algorithm that we use to find
the globally best-scoring tree.
Feature Extraction and Scoring
Inspired by existing RST parsers (Ji and Eisenstein,
2014; Le et al., 2014; Joty et al., 2015), we first
encode the beginning and end words of an EDU:
vbw
i = Embedw(bw),
vew
i = Embedw(ew),
(3)
(4)
218
where bw and ew denote the beginning and end
words of the i-th EDU, and Embed w is a function
that returns a parameterized embedding of the
input word.
We also encode the POS tags corresponding to
bw and ew as follows:
vbp
i = Embed p(bp),
vep
i = Embed p(ep),
(5)
(6)
where Embed p is an embedding function for POS
tags.
Prior work (Sagae, 2009; Ji and Eisenstein,
2014) has shown that syntactic cues can accelerate
discourse parsing performance. We therefore ex-
tract syntactic features from each EDU. We apply
a (syntactic) dependency parser to each sentence
in the input text,3 and then choose a head word for
each EDU. A head word is a token whose parent
in the dependency graph is ROOT or is not within
the EDU.4 We also extract the POS tag and the
dependency label corresponding to the head word.
A dependency label is a relation between a head
word and its parent.
To sum up, we now have triplets of head infor-
formación, {(hw, hp, hr)i}n−1
i=0 , each denoting the
head word, the head POS, and the head relation
of the i-th EDU, respectivamente. We embed these
symbols using look-up tables:
vhw
i = Embedw(hw),
vhp
i = Embedp(hp),
vhr
i = Embedr(hr),
(7)
(8)
(9)
where Embedr is an embedding function for de-
pendency relations.
Finalmente, we concatenate these embeddings:
i = [vbw
v′
i
; vew
i
; vbp
i
; vep
i ; vhw
i
; vhp
i
; vhr
i ],
(10)
and then transform it using a linear projection and
Rectified Linear Unit (ReLU) activation function:
vi = ReLU (W v′
i + b).
(11)
En el siguiente, we use {vi}n−1
vectors for the EDUs, {xi}n−1
i=0 .
i=0 as the feature
3We apply the Stanford CoreNLP parser
(Manning
et al., 2014) to the concatenation of the EDUs; https://
stanfordnlp.github.io/CoreNLP/.
4If there are multiple head words in an EDU, we choose
the left most one.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
t
yo
a
C
_
a
_
0
0
3
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
t
yo
a
C
_
a
_
0
0
3
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 2: Our span-based discourse parsing model. We first encode each EDU based on the beginning and ending
words and POS tags using embeddings. We also embed head information of each EDU. We then run a bidirectional
LSTM and concatenate the span differences. The resulting vector is used to predict the constituent score of the
text span (i, j). This figure illustrates the process for the span (1, 2).
Following the span-based parsing models de-
veloped in the syntax domain (Stern et al., 2017;
Gaddy et al., 2018), we then run a bidirectional
Long Short-Term Memory (LSTM) over the se-
quence of EDU representations, {vi}n−1
i=0 , resultado-
ing in forward and backward representations for
each step i (0 ≤ i ≤ n − 1):
−−−−→
LSTM (v0, . . . , vn−1),
←−−−−
LSTM (v0, . . . , vn−1).
−→h n−1 =
←−h n−1 =
−→h 0, . . . ,
←−h 0, . . . ,
(13)
(12)
We then compute a feature vector for a span
(i, j) by concatenating the forward and backward
span differences:
←−hi −
−→hj −
hi,j = [
←−−hj+1].
−→h i−1;
(14)
The feature vector, hi,j, is assumed to represent
the content of the contiguous text span xi:j along
with contextual
information captured by the
LSTMs.5
We did not use any feature templates because
we found that they did not improve parsing per-
formance in our unsupervised setting, though we
observed that template features roughly follow-
ing Joty et al. (2015) improved performance in a
supervised setting.
Finalmente, given a span-level feature vector, hi,j,
we use two-layer perceptrons with the ReLU
activation function:
s(i, j) = MLP (hi,j),
(15)
which computes the constituent score of the
contiguous text span xi:j.
5A detailed investigation of the span-based parsing model
using LSTM can be found in Gaddy et al. (2018).
219
Decoding
We use a Cocke-Kasami-Younger (CKY)-style
dynamic programming algorithm to perform a
global search over the space of valid trees and
find the highest-scoring tree. For a document with
n EDUs, we use an n × n table C, the cell C[i, j]
of which stores the subtree score spanning from
i to j. For spans of length one (es decir., i = j), nosotros
assign constant scalar values:
C[i, i] = 1.
(16)
For general spans 0 ≤ i < j ≤ n − 1, we define
the following recursion:
C[i, j] = s(i, j) + max
i≤k
As shown in the upper part of Table 4, nosotros
can observe that our method learns some aspects
of discourse constituentness that seems linguistic-
ally reasonable. En particular, we found that our
method has a potential to predict brackets for (1)
clauses with connectives qualifying other clauses
from right to left (p.ej., ‘‘X [because B.]'') y
(2) attribution structures (p.ej., ‘‘say that [B]'').
These results indicate that our method is good
at identifying discourse constituents near the end
225
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
t
yo
a
C
_
a
_
0
0
3
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
[The bankruptcy-court reorganization is being challenged … by a dissident group of claimants]
[because it places a cap on the total amount of money available] [to settle claims.] [It also
bars future suits against …] (11.74)
[The first two GAF trials were watched closely on Wall Street] [because they were considered
to be important tests of goverment’s ability] [to convince a jury of allegations] [stemming
from its insider-trading investigations.] [In an eight-court indictment, the goverment charged
GAF, …] (10.16)
[The posters were sold for $1,300 a $6,000,] [although the government says] [they had a
value of only $53 a $200 apiece.] [Henry Pitman, the assistant U.S. attorney] [handling the
caso,] [dicho] [acerca de …] (11.31)
[The office, an arm of the Treasury, dicho] [it doesn’t have data on the financial position of
applications] [and thus can’t determine] [why blacks are rejected more often.] [Sin embargo,
on Capital Hill,] [dónde …] (11.57)
[Después 93 hours of deliberation, the jurors in the second trial said] [they were hopelessly
deadlocked,] [and another mistrial was declared on March 22.] [Mientras tanto, a federal jury
found Mr. Bilzerian …] (11.66)
[(‘‘I think | she knows me,] [but I’m not sure ’’)] [and Bridget Fonda, the actress] [(‘‘She knows
a mí,] [but we’re not really the best of friends’’).] [Señor. Revson, the gossip columnist, dice]
[there are people] [OMS …] (11.11)
[its vice president … resigned] [and its Houston work force has been trimmed by 40 gente, de
acerca de 15%.] [The maker of hand-held computers and computer systems said] [the personnel
changes were needed] [to improve the efficiency of its manufacturing operation.] [The company
dicho] [it hasn’t named a successor …] (4.44)
[its vice president … resigned] [and its Houston work force has been trimmed by 40 gente, de
acerca de 15%.] [The maker of hand-held computers and computer systems said] [the personnel
changes were needed] [to improve the efficiency of its manufacturing operation.] [El
company said] [it hasn’t named a successor…] (11.04)
[its vice president … resigned] [and its Houston work force has been trimmed by 40 gente, de
acerca de 15%.] [The maker of hand-held computers and computer systems said ] [the personnel
changes were needed] [to improve the efficiency of its manufacturing operation.] [El
company said] [it hasn’t named a successor…] (5.50)
[its vice president … resigned] [and its Houston work force has been trimmed by 40 gente,
of about 15%.] [The maker of hand-held computers and computer systems said] [el
personnel changes were needed] [to improve the efficiency of its manufacturing operation.]
[The company said] [it hasn’t named a successor…] (7.68)
Mesa 4: Discourse constituents and their predicted scores (entre paréntesis). We show the discourse
constituents (in bold) in the RST-DT test set, which have relatively high span scores. We did NOT use
any sentence/paragraph boundaries for scoring.
of sentences (or paragraphs), which is natural
because RB is mainly used for generating initial
trees in EM training. The bottom part of Table 4
demonstrates that the beginning position of the text
span is also important to estimate constituenthood,
along with the ending position.
6 Conclusión
en este documento, we introduced an unsupervised
discourse constituency parsing algorithm that uses
Viterbi EM with a margin-based criterion to train
a span-based neural parser. We also introduced
initialization methods for the Viterbi
training
of discourse constituents. We observed that our
unsupervised parser achieves comparable or even
superior performance to the baselines and fully
supervised parsers. We also found that learned
discourse constituents depend strongly on initial-
ization used in Viterbi EM, and it is necessary to
explore other initialization techniques to capture
more diverse discourse phenomena.
226
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
t
yo
a
C
_
a
_
0
0
3
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
We have two limitations in this study. Primero,
this work focuses only on unlabeled discourse
constituent structures. Although such hierarchical
information is useful in downstream applications
(Louis et al., 2010), both nuclearity statuses and
rhetorical relations are also necessary for a more
complete RST analysis. Segundo, our study uses
only English documents for evaluation. Sin embargo,
different languages may have different structural
regularities. Por eso, it would be interesting to in-
vestigate whether the initialization methods are
effective in different languages, which we believe
gives suggestions on discourse-level universals.
We leave these issues as a future work.
Expresiones de gratitud
The research results have been achieved by
‘‘Research and Development of Deep Learning
Technology for Advanced Multilingual Speech
Translation’’, the Commissioned Research of Na-
tional Institute of Information and Communica-
tions Technology (NICT), Japón. This work was
also supported by JSPS KAKENHI grant number
JP19K22861, JP18J12366.
Referencias
Hiyan Alshawi. 1996. Head automata and bilin-
gual tiling: Translation with minimal represen-
taciones. En procedimientos de
the 34th Annual
Meeting of the Association for Computational
Lingüística.
Nicholas Asher and Alex Lascarides. 2003. Logics
and Conversation, Prensa de la Universidad de Cambridge.
James K. Panadero. 1979. Trainable grammars for
speech recognition. In Speech Communication
Papers for the 97th Meeting of the Acoustic
Society of America.
Taylor Berg-Kirkpatrick, Alexandre Bouchard-
Cˆot´e, John DeNero, and Dan Klein. 2010.
Painless unsupervised learning with features.
En Actas de la 2010 Conference of the
North American Chapter of the Association for
Ligüística computacional: Human Language
Technologies.
Parminder Bhatia, Yangfeng Ji, and Jacob
Eisenstein. 2015. Better document-level senti-
ment analysis from RST discourse parsing.
En procedimientos de
Empirical Methods
Procesando.
el 2015 Conferencia sobre
in Natural Language
Peter F. Marrón, Vincent J. Della Pietra, Esteban A..
Della Pietra, and Robert L. Mercer. 1993.
The mathematics of statistical machine trans-
lación: Parameter estimation. computacional
Lingüística, 19(2):263–311.
Lynn Carlson, Daniel Marcu, and Mary Ellen
Okurowski. 2001. Building a discourse-tagged
corpus in the framework of Rhetorical Structure
Teoría. In Proceedings of the 2nd SIGdial
Workshop on Discourse and Dialogue.
Glenn Carroll and Eugene Charniak. 1992. Two
experiments on learning probabilistic depen-
dency grammars from corpora. In Working
Notes of the Workshop Statistically-based NLP
Techniques.
Eugene Charniak. 1993. Statistical
idioma
aprendiendo. CON prensa.
Yejin Choi and Claire Cardie. 2007. Structured
local training and biased potential functions
for conditional random fields with application
to coreference resolution. En procedimientos de
el 2007 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano.
John DeNero and Dan Klein. 2008. El COM-
plexity of phrase alighment problems. En profesional-
ceedings of the 46th Annual Meeting of the
Asociación de Lingüística Computacional.
Vanessa Wei Feng and Graema Hirst. 2014. A
linear-time bottom-up discourse parser with
constraints and post-editing. En procedimientos de
the 52nd Annual Meeting of the Association for
Ligüística computacional.
David Gaddy, Mitchell Stern, and Dan Klein.
2018. What’s going on in neural constituency
analizadores? An analysis. En Actas de la
2018 Conference of
the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano.
Kevin Gimpel and Noah A. Herrero. 2012. Estafa-
cavity and initialization for unsupervised de-
pendency parsing. En Actas de la 2012
Conference of the North American Chapter of
227
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
t
yo
a
C
_
a
_
0
0
3
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
la Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano.
Sharon Goldwater and Mark Johnson. 2005.
Representation bias in unsupervised learning
of syllable structure. In Proceedings of the 9th
Conference on Natural Language Learning.
Dave Golland, John DeNero, y Jakob Uszkoreit.
2012. A feature-rich constituent context model
for grammar induction. En Actas de la
50ª Reunión Anual de la Asociación de
Ligüística computacional.
Michael Halliday and Ruqaiya Hasan. 1976.
Cohesion in English, Longman.
Hugo Hernault, Helmut Prendinger, David A.
DuVerle, and Mitsuru Ishizuka. 2010. HILDA:
A discourse parser using support vector
machine classification. Dialogue & Discurso,
1(3):1–33.
Yangfeng Ji and Jacob Eisenstein. 2014. Repre-
sentation learning for text-level discourse pars-
En g. In Proceedings of the 52nd Annual Meeting
of the Association for Computational Linguistics.
Yong Jiang, Wenjuan Han, and Kewei Tu. 2016.
Unsupervised neural dependency parsing. En
Actas de la 2016 Conference of Empir-
Métodos icales en el procesamiento del lenguaje natural.
Lifeng Jin, Finale Doshi-Velez, Timothy Miller,
William Schuler, and Lane Schwartz. 2018.
Unsupervised grammar induction with depth-
bounded pcfg. Transacciones de la Asociación
para Lingüística Computacional, 6:211–224.
Shafiq Joty, Giuseppe Carenini, and Raymond T.
Ng. 2015. CODRA a novel discriminative
framework for rhetorical analysis. Computa-
lingüística nacional, 41(3):385–435.
Shafiq Joty, Giuseppe Carenini, Raymond T.
Ng, and Yashar Mehdad. 2013. Combining
intra- and multi-sentential rhetorical parsing
for document-level discourse analysis.
En
Proceedings of the 51st Annual Meeting of
la Asociación de Lingüística Computacional.
Diederik Kingma and Jimmy Ba. 2015. Adán:
En
A method for stochastic optimization.
Proceedings of the International Conference
Learning Representations.
Dan Klein. 2005. The unsupervised learning
of natural language structure. Doctor. Thesis,
Universidad Stanford
Dan Klein and Christopher D. Manning. 2001a.
Distributional phrase structure induction. En
Actas de la 2001 Workshop on Compu-
tational Natural Language Learning.
Dan Klein and Christopher D. Manning. 2001b.
Natural
language grammar induction using
a constituent-context model. In Advances in
Neural Information Processing Systems.
Dan Klein and Christopher D. Manning. 2002.
A generative constituent-context model for
improved grammar induction. En procedimientos
of the 40th Annual Meeting of the Association
para Lingüística Computacional.
Dan Klein and Christopher D. Manning. 2004.
Corpus-based induction of syntactic structure:
Models of constituency and dependency. En
Proceedings of the 42nd Annual Meeting of the
Asociación de Lingüística Computacional.
Naoki Kobayashi, Tsutomu Hirao, Kengo
Nakamura, Hidetaka Kamigaito, Manabu
Okumura, and Masaaki Nagata. 2019. Split of
merge: Which is better for unsupervised RST
analizando? En Actas de la 2019 Conferencia
of Empirical Methods in Natural Language
Procesando.
Karim Lari and Steve J. Joven. 1990. The esti-
mation of stochastic context-free grammars
using the Inside-Outside algorithm. Computadora
Speech and Language, 4:35–56.
Sujian Li, Liang Wang, Ziqiang Cao, and Wenjie
li. 2014. Text-level discourse dependency
analizando. In Proceedings of the 52nd Annual
Meeting of the Association for Computational
Lingüística.
Annie Louis, Aravind Joshi, and Ani Nenkova.
2010. Discourse indicators for content selection
in summarization. In SIGDIAL’10.
William C. Mann and Sandra A. Thompson.
1988. Teoría de la estructura retórica: Towards
a functional theory of text organization. Texto-
Interdisciplinary Journal for the Study of Dis-
curso, 8(3):243–281.
228
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
t
yo
a
C
_
a
_
0
0
3
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cristóbal D.. Manning, Mihai Surdeanu, John
Bauer, Jenny Finkel, Steven J. Bethard, y
David McClosky. 2014. The Stanford CoreNLP
natural language processing toolkit. En curso-
ings of 52nd Annual Meeting of the Associa-
ción para la Lingüística Computacional: Sistema
Demonstrations.
Danial Marcu, Magdalena Romera, and Estibaliz
Amorrortu. 1999. Experiments in constructing a
corpus of discourse trees: Problems, annotation
choices, asuntos. In Proceedings of the ACL’99
Workshop on Standards and Tools for Dis-
course Tagging.
Daniel Marcu. 2000. The Theory and Practice
of Discourse Parsing and Summarization, CON
Prensa.
Mitchell P. marco, Mary Ann Marcinkiewicz,
and Beatrice Santorini. 1993. Building a large
annotated corpus of English: The Penn Tree-
bank. Ligüística computacional, 19(2):313–330.
David McClosky, Eugene Charniak, and Mark
Johnson. 2006a. Effective self-training for
analizando. En Actas de la 2006 Conferencia
of the North American Chapter of the Associa-
ción para la Lingüística Computacional: Humano
Language Technologies.
David McClosky, Eugene Charniak, and Mark
Johnson. 2006b. Reranking and self-training for
parser adaptation. In Proceedings of the 21st
International Conference on Computational
Linguistics and the 44th Annual Meeting of the
Asociación de Lingüística Computacional.
Eleni Miltsakaki and Karen Kukich. 2004. Eval-
uation of text coherence for electronic essay
scoring systems. Natural Language Engineer-
En g, 10(1):25–55.
Mathieu Morey, Philippe Muller, and Nicholas
Asher. 2017. How much progress have we made
on rst discourse parsing? A replication study of
recent results on the RST-DT. En procedimientos
del 2017 Conferencia sobre métodos empíricos
en procesamiento del lenguaje natural.
Mathieu Morey, Philippe Muller, and Nicholas
Asher. 2018. A dependency perspective on
RST discourse parsing and evaluation. Compu-
lingüística nacional, 44(2):197–235.
229
Tahira Naseem, Harr Chen, Regina Barzilay,
and Mark Johnson. 2010. Using universal
linguistic knowledge to guide grammar induc-
ción. En Actas de la 2010 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando.
Radford M. Neal and Geoffrey E. Hinton. 1998. A
View of the EM Algorithm That Justifies Incre-
mental, Sparse, and Other Variants, Learn-
ing and Graphical Models.
jeffrey
Socher,
Pennington, Ricardo
y
Cristóbal D.. Manning. 2014. GloVe: Global
vectors for word representations. En curso-
ings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing.
Livia Polanyi. 1985. A theory of discourse struc-
ture and discourse coherence. En procedimientos
of the 21st Regional Meeting of the Chicago
Linguistics Society.
Livia Polanyi and Martin Van den Berg. 2011.
Discourse structure and sentiment. En 2011
IEEE 11th International Conference on Data
Mining Workshops.
Andrew Radford. 1988. Transformational Gram-
mar, Prensa de la Universidad de Cambridge.
Kenji Sagae. 2009. Analysis of discourse struc-
ture with syntactic dependencies and data-
driven shift-reduce parsing. En procedimientos de
the 11th International Workshop on Parsing
Tecnología.
Yoav Seginer. 2007. Fast unsupervised incre-
mental parsing. In Proceedings of the 45th
Annual Meeting of the Association of Compu-
lingüística nacional.
Noah A. Smith and Jason Eisner. 2006. Anneal-
ing structural bias in multilingual weighted
grammar induction. In Proceedings of the 21st
International Conference on Computational
Linguistics and the 44th Annual Meeting of the
Asociación de Lingüística Computacional.
Noah Ashton Smith. 2006. Novel estimation
methods for unsupervised discovery of latent
structure in natural language text. Doctor. Thesis,
Universidad Johns Hopkins.
Valentin I. Spitkovsky, Hiyan Alshawi, Daniel
Jurafsky,
and Christopher D. Manning.
2010. Viterbi training improves unsupervised
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
t
yo
a
C
_
a
_
0
0
3
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
dependency parsing. En procedimientos de
el
14th Conference on Computational Natural
Aprendizaje de idiomas.
segmentation. En Actas de la 2018 Estafa-
Conferencia sobre métodos empíricos en Lan Natural.-
Procesamiento de calibre.
Mitchell Stern, jacob andreas, and Dan Klein.
2017. A minimal span-based neural consituency
la 55ª edición anual
parser. En procedimientos de
Meeting of the Association for Computational
Lingüística.
Yizhong Wang, Sujian Li, and Jingfeng Yang.
2018. Toward fast and accurate neural discourse
Yasuhisa Yoshida, Jun Suzuki, Tsutomu Hirao,
and Masaaki Nagata. 2014. Dependency-based
discourse parser for single-document summari-
zación. En Actas de la 2014 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
2
1
9
2
3
1
4
5
/
/
t
yo
a
C
_
a
_
0
0
3
1
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3