Unsupervised Discourse Constituency Parsing Using Viterbi EM

Unsupervised Discourse Constituency Parsing Using Viterbi EM

Noriki Nishida and Hideki Nakayama

Graduate School of Information Science and Technology
The University of Tokyo
{nishida,nakayama}@nlab.ci.i.u-tokyo.ac.jp

抽象的

在本文中, we introduce an unsupervised
discourse constituency parsing algorithm. 我们
use Viterbi EM with a margin-based crite-
rion to train a span-based discourse parser in
an unsupervised manner. We also propose
initialization methods for Viterbi training of
discourse constituents based on our prior
knowledge of text structures. Experimental re-
sults demonstrate that our unsupervised parser
achieves comparable or even superior perfor-
mance to fully supervised parsers. 我们也
investigate discourse constituents that are
learned by our method.

1 介绍

Natural language text is generally coherent (Halliday
and Hasan, 1976) and can be analyzed as discourse
结构, which formally describe how text is
coherently organized. In discourse structure, 林-
guistic units (例如, 条款, 句子, or larger
textual spans) are connected together semantically
and pragmatically, and no unit is independent nor
孤立. Discourse parsing aims to uncover dis-
course structures automatically for given text and
has been proven to be useful in various NLP
applications, such as document summarization
(马可, 2000; Louis et al., 2010; Yoshida et al.,
2014), sentiment analysis (Polanyi and Van den
伯格, 2011; Bhatia et al., 2015), 和自动化
essay scoring (Miltsakaki and Kukich, 2004).

Despite the promising progress achieved in re-
cent decades (Carlson et al., 2001; Hernault et al.,
2010; Ji and Eisenstein, 2014; 冯和赫斯特,
2014; 李等人。, 2014; Joty et al., 2015; Morey et al.,
2017), discourse parsing still remains a significant
challenge. The difficulty is due in part to shortage
and low reliability of hand-annotated discourse

215

结构. To develop a better-generalized parser,
existing algorithms require a larger amounts of
training data. 然而, manually annotating dis-
course structures is expensive, time-consuming,
and sometimes highly ambiguous (Marcu et al.,
1999).

One possible solution to these problems is gram-
mar induction (or unsupervised syntactic parsing)
algorithms for discourse parsing. 然而, 存在-
ing studies on unsupervised parsing mainly focus
on sentence structures, such as phrase structures
(Lari and Young, 1990; 克莱因和曼宁,
2002; Golland et al., 2012; Jin et al., 2018) 或者
dependency structures (克莱因和曼宁, 2004;
Berg-Kirkpatrick et al., 2010; Naseem et al., 2010;
Jiang et al., 2016), though text-level structural reg-
ularities can also exist beyond the scope of a single
句子. 例如, in order to convey infor-
mation to readers as intended, a writer should ar-
range utterances in a coherent order.

We tackle these problems by introducing unsu-
pervised discourse parsing, which induces dis-
course structures for given text without relying on
human-annotated discourse structures. Based on
Rhetorical Structure Theory (RST) (Mann and
汤普森, 1988), which is one of the most
widely accepted theories of discourse structure,
we assume that coherent text can be represented
as tree structures, such as the one in Figure 1. 这
leaf nodes correspond to non-overlapping clause-
level text spans called elementary discourse units
(EDUs). Consecutive text spans are combined to
each other recursively in a bottom–up manner
to form larger text spans (represented by inter-
nal nodes) up to a global document span. 这些
text spans are called discourse constituents. 这
internal nodes are labeled with both nuclearity sta-
tuses (例如, Nucleus-Satellite or NS) and rhetorical

计算语言学协会会刊, 卷. 8, PP. 215–230, 2020. https://doi.org/10.1162/tacl 00312
动作编辑器: Yuji Matsumoto. 提交批次: 07/2019; 修改批次: 12/2019; 已发表 4/2020.
C(西德:13) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5

/

/
t

A
C
_
A
_
0
0
3
1
2
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5

/

/
t

A
C
_
A
_
0
0
3
1
2
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

数字 1: An example of RST-based discourse constituent structure we assume in this paper. Leaf nodes
xi correspond to non-overlapping clause-level text segments, and internal nodes consists of three comple-
mentary elements: discourse constituents xi:j , discourse nuclearities (例如, NS), and discourse relations (例如,
ELABORATION).

关系 (例如, ELABORATION, CONTRAST) 那
hold between connected text spans.

在本文中, we especially focus on unsuper-
vised induction of an unlabeled discourse con-
stituent structure (IE。, a set of unlabeled discourse
constituent spans) given a sequence of EDUs,
which corresponds to the first tree-building step in
conventional RST parsing. Such constituent struc-
tures provide hierarchical information of input
文本, which is useful in downstream tasks (路易斯
等人。, 2010). 例如, a constituent structure
[X [Y Z]] indicates that text span Y is preferen-
tially combined with Z (rather than X) 形成一个
constituent span, and then the text span [Y Z] 是
connected with X. 换句话说, this structure
implies that [X Y] is a distituent span and requires
Z to become a constituent span. Our challenge is
to find such discourse-level constituentness from
EDU sequences.

The core hypothesis of this paper is that dis-
course tree structures and syntactic tree structures
share the same (or similar) constituent proper-
ties at a metalevel, 因此, learning algorithms
developed for grammar inductions are transferable
to unsupervised discourse constituency parsing by
proper modifications. Actually, RST structures
can be formulated in a similar way as phrase
structures in the Penn Treebank, though there are
a few differences: The leaf nodes are not words
but EDUs (例如, 条款), and the internal nodes
do not contain phrase labels but hold nuclearity
statuses and rhetorical relations.

The expectation-maximization (EM) algorithm
(克莱因和曼宁, 2004) has been the dominat-
ing unsupervised learning algorithm for grammar
induction. Based on our hypothesis and this fact,
we develop a span-based discourse parser (在一个
unsupervised manner) by using Viterbi EM (或者
‘‘hard’’ EM) (Neal and Hinton, 1998; Spitkovsky
等人。, 2010; DeNero and Klein, 2008; Choi and
Cardie, 2007; Goldwater and Johnson, 2005) 和
a margin-based criterion (Stern et al., 2017; Gaddy
等人。, 2018).1 Unlike the classic EM algorithm
using inside-outside re-estimation (贝克, 1979),
Viterbi EM allows us to avoid explicitly counting
discourse constituent patterns, which are generally
too sparse to estimate reliable scores of text spans.
The other technical contribution is to present
effective initialization methods for Viterbi training
of discourse constituents. We introduce initial-tree
sampling methods based on our prior knowledge
of document structures. We show that proper
initialization is crucial in this task, as observed
in grammar induction (克莱因和曼宁, 2004;
Gimpel and Smith, 2012).

On the RST Discourse Treebank (RST-DT)
(Carlson et al., 2001), we compared our parse trees
with manually annotated ones. We observed that
our method achieves a Micro F1 score of 68.6%
在里面 (corrected) RST-PARSEVAL
(84.6%)

1Our code can be found at https://github.com/
norikinishida/DiscourseConstituencyInduction-
ViterbiEM.

216

(马可, 2000; Morey et al., 2018), which is com-
parable with or even superior to fully super-
vised parsers. We also investigated the discourse
constituents that can or cannot be learned well by
our method.

The rest of this paper is organized as follows:
部分 2 introduces the related work. 部分 3
gives the details of our parsing model and training
algorithm. 部分 4 describes the experimental
setting and Section 5 discusses the experimental
结果. Conclusions are given in Section 6.

2 相关工作

The earliest studies that use EM in unsupervised
parsing are Lari and Young (1990) and Carroll
and Charniak (1992), which attempted to induce
probabilistic context-free grammars (PCFG) 和
probabilistic dependency grammars using the clas-
sic inside–outside algorithm (贝克, 1979). 克莱因
and Manning (2001乙, 2002) perform a weakened
version of constituent tests (雷德福, 1988) 由
Constituent-Context Model (CCM), 哪个, unlike
a PCFG, describes whether a contiguous text span
(such as DT JJ NN) is a constituent or a distituent.
The CCM uses EM to learn constituenthood over
part-of-speech (销售点) tags and for the first time
outperformed the strong right-branching baseline
in unsupervised constituency parsing. Klein and
曼宁 (2004) proposed the Dependency Model
with Valence (DMV), which is a head automata
模型 (Alshawi, 1996) for unsupervised depen-
dency parsing over POS tags and also relies on
EM. These two models have been extended in
various works for further improvements (伯格-
Kirkpatrick et al., 2010; Naseem et al., 2010;
Golland et al., 2012; Jiang et al., 2016).

一般来说, these methods use the inside–outside
(dynamic programming) re-estimation (贝克, 1979)
in the E step. 然而, Spitkovsky et al. (2010)
showed that Viterbi training (Brown et al., 1993),
which uses only the best-scoring tree to count the
grammatical patterns, is not only computationally
more efficient but also empirically more accurate
in longer sentences. These properties are, 因此,
suitable for ‘‘document-level’’ grammar induc-
的, where the document length (IE。, 号码
of EDUs) tends to be long.2 In addition, as ex-

2Prior studies on grammar induction generally use sen-
tences up to length 10, 15, 或者 40. 另一方面, 关于
half the documents in the RST-DT corpus (Carlson et al.,
2001) are longer than 40.

plained later in Section 3, we incorporate Viterbi
EM with a margin-based criterion (Stern et al.,
2017; Gaddy et al., 2018); this allows us to avoid
explicitly counting each possible discourse con-
stituent pattern symbolically, which is generally
too sparse and appears only once.

Prior studies (克莱因和曼宁, 2004; Gimpel
和史密斯, 2012; Naseem et al., 2010) have shown
that initialization or linguistic knowledge plays an
important role in EM-based grammar induction.
Gimpel and Smith (2012) demonstrated that
properly initialized DMV achieves improvements
in attachment accuracies by 20 ~ 40 点 (IE。,
21.3% → 64.3%), compared with the uniform
initialization. Naseem et al. (2010) also found
that controlling the learning process with the
prior (universal) linguistic knowledge improves
the parsing performance of DMV. These studies
usually rely on insights on syntactic structures. 在
这篇论文, we explore discourse-level prior knowl-
edge for effective initialization of the Viterbi train-
ing of discourse constituency parsers.

Our method also relies on recent work on
RST parsing. 尤其, one of the initializa-
tion methods in our EM training (in Section 3.3
(我)) is inspired by the inter-sentential and multi-
sentential approach used in RST parsing (冯
and Hirst, 2014; Joty et al., 2013, 2015). 我们也
follow prior studies (Sagae, 2009; Ji and Eisenstein,
2014) and utilize syntactic information, IE。, 的-
pendency heads, which contributes to further per-
formance gains in our method.

The most similar work to that presented here
is Kobayashi et al. (2019), who propose unsu-
pervised RST parsing algorithms in parallel with
我们的工作. Their method builds an unlabeled dis-
course tree by using the CKY dynamic pro-
gramming algorithm. The tree-merging (splitting)
scores in CKY are defined as similarity (dissimi-
larity) between adjacent text spans. 相似度
scores are calculated based on distributed repre-
sentations using pre-trained embeddings. 如何-
曾经, similarity between adjacent elements are not
always good indicators of constituentness. 骗局-
sider tag sequences ‘‘VBD IN’’ and ‘‘IN NN’’.
The former is an example of a distituent sequence,
whereas the latter is a constituent. ‘‘VBD’’, ‘‘IN’’,
and ‘‘NN’’ may have similar distributed represen-
tations because these tags cooccur frequently in
语料库. This implies that it is difficult to dis-
tinguish constituents and distituents if we use

217

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5

/

/
t

A
C
_
A
_
0
0
3
1
2
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

only similarity (dissimilarity) 措施. 在这个
纸, we aim to mitigate this issue by intro-
ducing parameterized models to learn discourse
constituentness.

3 方法

在这个部分, we first describe the parsing model
we develop. 下一个, we explain how to train the
model in an unsupervised manner by using Viterbi
EM. 最后, we present the initialization methods
we use for further improvements.

3.1 Parsing Model

The parsing problem in this study is to find the
unlabeled constituent structure with the highest
score for an input text x, 那是,

ˆT = arg max
T ∈valid (X)

s(X, 时间 )

(1)

where s(X, 时间 ) ∈ R denotes a real-valued score of
a tree T , and valid (X) represents a set of all valid
trees for x. We assume that x has already been
manually segmented into a sequence of EDUs:
x = x0, . . . , xn−1.

Inspired by the success of recent span-based
constituency parsers (Stern et al., 2017; Gaddy
等人。, 2018), we define the tree scores as the sum
of constituent scores over internal nodes, 那是,

s(X, 时间 ) =

s(我, j).

(2)

X(我,j)∈T

因此, our parsing model consists of a single
scoring function s(我, j) that computes a constituent
score of a contiguous text span xi:j = xi, . . . , xj,
or simply (我, j). The higher the value of s(我, j),
the more likely that xi:j is a discourse constituent.
We show our parsing model in Figure 2. 我们的
implementation of s(我, j) can be decomposed into
three modules: EDU-level
feature extraction,
span-level feature extraction, and span scoring.
We discuss each of these in turn. 之后, we also
explain the decoding algorithm that we use to find
the globally best-scoring tree.

Feature Extraction and Scoring
Inspired by existing RST parsers (Ji and Eisenstein,
2014; 李等人。, 2014; Joty et al., 2015), we first
encode the beginning and end words of an EDU:

vbw
i = Embedw(bw),
vew
i = Embedw(ew),

(3)

(4)

218

where bw and ew denote the beginning and end
words of the i-th EDU, and Embed w is a function
that returns a parameterized embedding of the
input word.

We also encode the POS tags corresponding to

bw and ew as follows:

vbp
i = Embed p(bp),
vep
i = Embed p(ep),

(5)

(6)

where Embed p is an embedding function for POS
tags.

Prior work (Sagae, 2009; Ji and Eisenstein,
2014) has shown that syntactic cues can accelerate
discourse parsing performance. We therefore ex-
tract syntactic features from each EDU. We apply
A (句法的) dependency parser to each sentence
in the input text,3 and then choose a head word for
each EDU. A head word is a token whose parent
in the dependency graph is ROOT or is not within
the EDU.4 We also extract the POS tag and the
dependency label corresponding to the head word.
A dependency label is a relation between a head
word and its parent.

总结, we now have triplets of head infor-
运动, {(hw, hp, 小时)我}n−1
i=0 , each denoting the
head word, the head POS, and the head relation
of the i-th EDU, 分别. We embed these
symbols using look-up tables:

vhw
i = Embedw(hw),
vhp
i = Embedp(hp),
vhr
i = Embedr(小时),

(7)

(8)

(9)

where Embedr is an embedding function for de-
pendency relations.

最后, we concatenate these embeddings:

i = [vbw
v′

; vew

; vbp

; vep

我 ; vhw

; vhp

; vhr
我 ],

(10)

and then transform it using a linear projection and
Rectified Linear Unit (ReLU) activation function:

vi = ReLU (W v′

我 + 乙).

(11)

In the following, 我们用 {六}n−1
vectors for the EDUs, {希}n−1
i=0 .

i=0 as the feature

3We apply the Stanford CoreNLP parser

(曼宁
等人。, 2014) to the concatenation of the EDUs; https://
stanfordnlp.github.io/CoreNLP/.

4If there are multiple head words in an EDU, we choose

the left most one.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5

/

/
t

A
C
_
A
_
0
0
3
1
2
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5

/

/
t

A
C
_
A
_
0
0
3
1
2
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

数字 2: Our span-based discourse parsing model. We first encode each EDU based on the beginning and ending
words and POS tags using embeddings. We also embed head information of each EDU. We then run a bidirectional
LSTM and concatenate the span differences. The resulting vector is used to predict the constituent score of the
text span (我, j). This figure illustrates the process for the span (1, 2).

Following the span-based parsing models de-
veloped in the syntax domain (Stern et al., 2017;
Gaddy et al., 2018), we then run a bidirectional
Long Short-Term Memory (LSTM) over the se-
quence of EDU representations, {六}n−1
i=0 , 结果-
ing in forward and backward representations for
each step i (0 ≤ i ≤ n − 1):
−−−−→
LSTM (v0, . . . , vn−1),
←−−−−
LSTM (v0, . . . , vn−1).

−→h n−1 =
←−h n−1 =

−→h 0, . . . ,
←−h 0, . . . ,

(13)

(12)

We then compute a feature vector for a span
(我, j) by concatenating the forward and backward
span differences:

←−hi −

−→hj −

你好,j = [

←−−hj+1].

−→h i−1;
(14)
The feature vector, 你好,j, is assumed to represent
the content of the contiguous text span xi:j along
with contextual
information captured by the
LSTMs.5

We did not use any feature templates because
we found that they did not improve parsing per-
formance in our unsupervised setting, though we
observed that template features roughly follow-
ing Joty et al. (2015) improved performance in a
supervised setting.

最后, given a span-level feature vector, 你好,j,
we use two-layer perceptrons with the ReLU
activation function:

s(我, j) = MLP (你好,j),

(15)

which computes the constituent score of the
contiguous text span xi:j.

5A detailed investigation of the span-based parsing model

using LSTM can be found in Gaddy et al. (2018).

219

Decoding
We use a Cocke-Kasami-Younger (CKY)-style
dynamic programming algorithm to perform a
global search over the space of valid trees and
find the highest-scoring tree. For a document with
n EDUs, we use an n × n table C, the cell C[我, j]
of which stores the subtree score spanning from
i to j. For spans of length one (IE。, i = j), 我们
assign constant scalar values:

C[我, 我] = 1.

(16)

For general spans 0 ≤ 我 < j ≤ n − 1, we define the following recursion: C[i, j] = s(i, j) + max i≤k 10.0.

As shown in the upper part of Table 4, 我们
can observe that our method learns some aspects
of discourse constituentness that seems linguistic-
ally reasonable. 尤其, we found that our
method has a potential to predict brackets for (1)
clauses with connectives qualifying other clauses
from right to left (例如, ‘‘X [because B.]’’) 和
(2) attribution structures (例如, ‘‘say that [乙]’’).
These results indicate that our method is good
at identifying discourse constituents near the end

225

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5

/

/
t

A
C
_
A
_
0
0
3
1
2
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

[The bankruptcy-court reorganization is being challengedby a dissident group of claimants]
[because it places a cap on the total amount of money available] [to settle claims.] [它也是
bars future suits against …] (11.74)
[The first two GAF trials were watched closely on Wall Street] [because they were considered
to be important tests of goverment’s ability] [to convince a jury of allegations] [stemming
from its insider-trading investigations.] [In an eight-court indictment, the goverment charged
GAF, …] (10.16)
[The posters were sold for $1,300 到 $6,000,] [although the government says] [they had a
value of only $53 到 $200 apiece.] [Henry Pitman, the assistant U.S. attorney] [handling the
案件,] [说] [关于 …] (11.31)
[The office, an arm of the Treasury, 说] [it doesn’t have data on the financial position of
applications] [and thus can’t determine] [why blacks are rejected more often.] [尽管如此,
on Capital Hill,] [在哪里 …] (11.57)
[后 93 hours of deliberation, the jurors in the second trial said] [they were hopelessly
deadlocked,] [and another mistrial was declared on March 22.] [同时, a federal jury
found Mr. Bilzerian …] (11.66)
[(‘‘I think | she knows me,] [but I’m not sure ’’)] [and Bridget Fonda, the actress] [(‘‘She knows
我,] [but we’re not really the best of friends’’).] [先生. Revson, the gossip columnist, 说]
[there are people] [WHO …] (11.11)
[its vice presidentresigned] [and its Houston work force has been trimmed by 40 人们, 的
关于 15%.] [The maker of hand-held computers and computer systems said] [the personnel
changes were needed] [to improve the efficiency of its manufacturing operation.] [The company
说] [it hasn’t named a successor …] (4.44)
[its vice presidentresigned] [and its Houston work force has been trimmed by 40 人们, 的
关于 15%.] [The maker of hand-held computers and computer systems said] [the personnel
changes were needed] [to improve the efficiency of its manufacturing operation.] [这
company said] [it hasn’t named a successor…] (11.04)
[its vice presidentresigned] [and its Houston work force has been trimmed by 40 人们, 的
关于 15%.] [The maker of hand-held computers and computer systems said ] [the personnel
changes were needed] [to improve the efficiency of its manufacturing operation.] [这
company said] [it hasn’t named a successor…] (5.50)
[its vice presidentresigned] [and its Houston work force has been trimmed by 40 人们,
of about 15%.] [The maker of hand-held computers and computer systems said] [这
personnel changes were needed] [to improve the efficiency of its manufacturing operation.]
[The company said] [it hasn’t named a successor…] (7.68)

桌子 4: Discourse constituents and their predicted scores (in parentheses). We show the discourse
constituents (in bold) in the RST-DT test set, which have relatively high span scores. We did NOT use
any sentence/paragraph boundaries for scoring.

句子数 (or paragraphs), which is natural
because RB is mainly used for generating initial
trees in EM training. The bottom part of Table 4
demonstrates that the beginning position of the text
span is also important to estimate constituenthood,
along with the ending position.

6 结论

在本文中, we introduced an unsupervised
discourse constituency parsing algorithm that uses

Viterbi EM with a margin-based criterion to train
a span-based neural parser. We also introduced
initialization methods for the Viterbi
训练
of discourse constituents. We observed that our
unsupervised parser achieves comparable or even
superior performance to the baselines and fully
supervised parsers. We also found that learned
discourse constituents depend strongly on initial-
ization used in Viterbi EM, and it is necessary to
explore other initialization techniques to capture
more diverse discourse phenomena.

226

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5

/

/
t

A
C
_
A
_
0
0
3
1
2
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

We have two limitations in this study. 第一的,
this work focuses only on unlabeled discourse
constituent structures. Although such hierarchical
information is useful in downstream applications
(Louis et al., 2010), both nuclearity statuses and
rhetorical relations are also necessary for a more
complete RST analysis. 第二, our study uses
only English documents for evaluation. 然而,
different languages may have different structural
规律性. 因此, it would be interesting to in-
vestigate whether the initialization methods are
effective in different languages, which we believe
gives suggestions on discourse-level universals.
We leave these issues as a future work.

致谢

The research results have been achieved by
‘‘Research and Development of Deep Learning
Technology for Advanced Multilingual Speech
Translation’’, the Commissioned Research of Na-
tional Institute of Information and Communica-
tions Technology (NICT), 日本. This work was
also supported by JSPS KAKENHI grant number
JP19K22861, JP18J12366.

参考

Hiyan Alshawi. 1996. Head automata and bilin-
gual tiling: Translation with minimal represen-
tations. 在诉讼程序中
the 34th Annual
Meeting of the Association for Computational
语言学.

Nicholas Asher and Alex Lascarides. 2003. Logics
and Conversation, 剑桥大学出版社.

James K. 贝克. 1979. Trainable grammars for
speech recognition. In Speech Communication
Papers for the 97th Meeting of the Acoustic
美国协会.

Taylor Berg-Kirkpatrick, Alexandre Bouchard-
Cˆot´e, John DeNero, and Dan Klein. 2010.
Painless unsupervised learning with features.
在诉讼程序中 2010 Conference of the
North American Chapter of the Association for
计算语言学: Human Language
Technologies.

Parminder Bhatia, Yangfeng Ji, and Jacob
Eisenstein. 2015. Better document-level senti-
ment analysis from RST discourse parsing.

在诉讼程序中
Empirical Methods
加工.

这 2015 会议
in Natural Language

Peter F. 棕色的, Vincent J. Della Pietra, Stephen A.
Della Pietra, and Robert L. 美世. 1993.
The mathematics of statistical machine trans-
关系: Parameter estimation. 计算型
语言学, 19(2):263–311.

Lynn Carlson, 丹尼尔·马可, and Mary Ellen
Okurowski. 2001. Building a discourse-tagged
corpus in the framework of Rhetorical Structure
理论. In Proceedings of the 2nd SIGdial
Workshop on Discourse and Dialogue.

Glenn Carroll and Eugene Charniak. 1992. 二
experiments on learning probabilistic depen-
dency grammars from corpora. In Working
Notes of the Workshop Statistically-based NLP
Techniques.

Eugene Charniak. 1993. Statistical

语言

学习. 与新闻界.

Yejin Choi and Claire Cardie. 2007. Structured
local training and biased potential functions
for conditional random fields with application
to coreference resolution. 在诉讼程序中
这 2007 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术.

John DeNero and Dan Klein. 2008. com-
plexity of phrase alighment problems. In Pro-
ceedings of the 46th Annual Meeting of the
计算语言学协会.

Vanessa Wei Feng and Graema Hirst. 2014. A
linear-time bottom-up discourse parser with
constraints and post-editing. 在诉讼程序中
the 52nd Annual Meeting of the Association for
计算语言学.

David Gaddy, Mitchell Stern, and Dan Klein.
2018. What’s going on in neural constituency
parsers? An analysis. 在诉讼程序中
2018 Conference of
the North American
Chapter of the Association for Computational
语言学: 人类语言技术.

Kevin Gimpel and Noah A. 史密斯. 2012. 骗局-
cavity and initialization for unsupervised de-
pendency parsing. 在诉讼程序中 2012
Conference of the North American Chapter of

227

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5

/

/
t

A
C
_
A
_
0
0
3
1
2
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

the Association for Computational Linguistics:
人类语言技术.

Sharon Goldwater and Mark Johnson. 2005.
Representation bias in unsupervised learning
of syllable structure. In Proceedings of the 9th
Conference on Natural Language Learning.

Dave Golland, John DeNero, and Jakob Uszkoreit.
2012. A feature-rich constituent context model
for grammar induction. 在诉讼程序中
50th Annual Meeting of the Association for
计算语言学.

Michael Halliday and Ruqaiya Hasan. 1976.

Cohesion in English, 朗文.

Hugo Hernault, Helmut Prendinger, 赋予生命.
DuVerle, and Mitsuru Ishizuka. 2010. HILDA:
A discourse parser using support vector
machine classification. Dialogue & 话语,
1(3):1–33.

Yangfeng Ji and Jacob Eisenstein. 2014. Repre-
sentation learning for text-level discourse pars-
英. In Proceedings of the 52nd Annual Meeting
of the Association for Computational Linguistics.

Yong Jiang, Wenjuan Han, and Kewei Tu. 2016.
Unsupervised neural dependency parsing. 在
诉讼程序 2016 Conference of Empir-
ical Methods in Natural Language Processing.

Lifeng Jin, Finale Doshi-Velez, Timothy Miller,
William Schuler, and Lane Schwartz. 2018.
Unsupervised grammar induction with depth-
bounded pcfg. Transactions of the Association
for Computational Linguistics, 6:211–224.

Shafiq Joty, Giuseppe Carenini, and Raymond T.
的. 2015. CODRA a novel discriminative
framework for rhetorical analysis. Computa-
tional Linguistics, 41(3):385–435.

Shafiq Joty, Giuseppe Carenini, Raymond T.
的, and Yashar Mehdad. 2013. Combining
intra- and multi-sentential rhetorical parsing
for document-level discourse analysis.

Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics.

Diederik Kingma and Jimmy Ba. 2015. 亚当:

A method for stochastic optimization.
Proceedings of the International Conference
Learning Representations.

Dan Klein. 2005. The unsupervised learning
of natural language structure. 博士. 论文,
斯坦福大学

Dan Klein and Christopher D. 曼宁. 2001A.
Distributional phrase structure induction. 在
诉讼程序 2001 Workshop on Compu-
tational Natural Language Learning.

Dan Klein and Christopher D. 曼宁. 2001乙.
Natural
language grammar induction using
a constituent-context model. In Advances in
Neural Information Processing Systems.

Dan Klein and Christopher D. 曼宁. 2002.
A generative constituent-context model for
improved grammar induction. In Proceedings
of the 40th Annual Meeting of the Association
for Computational Linguistics.

Dan Klein and Christopher D. 曼宁. 2004.
Corpus-based induction of syntactic structure:
Models of constituency and dependency. 在
Proceedings of the 42nd Annual Meeting of the
计算语言学协会.

Naoki Kobayashi, Tsutomu Hirao, Kengo
Nakamura, Hidetaka Kamigaito, Manabu
Okumura, and Masaaki Nagata. 2019. Split of
merge: Which is better for unsupervised RST
解析? 在诉讼程序中 2019 会议
of Empirical Methods in Natural Language
加工.

Karim Lari and Steve J. Young. 1990. The esti-
mation of stochastic context-free grammars
using the Inside-Outside algorithm. Computer
Speech and Language, 4:35–56.

Sujian Li, Liang Wang, Ziqiang Cao, and Wenjie
李. 2014. Text-level discourse dependency
解析. In Proceedings of the 52nd Annual
Meeting of the Association for Computational
语言学.

Annie Louis, Aravind Joshi, and Ani Nenkova.
2010. Discourse indicators for content selection
in summarization. In SIGDIAL’10.

William C. Mann and Sandra A. 汤普森.
1988. Rhetorical Structure Theory: Towards
a functional theory of text organization. Text-
Interdisciplinary Journal for the Study of Dis-
课程, 8(3):243–281.

228

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5

/

/
t

A
C
_
A
_
0
0
3
1
2
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

Christopher D. 曼宁, Mihai Surdeanu, 约翰
Bauer, Jenny Finkel, Steven J. Bethard, 和
David McClosky. 2014. The Stanford CoreNLP
natural language processing toolkit. In Proceed-
ings of 52nd Annual Meeting of the Associa-
tion for Computational Linguistics: 系统
Demonstrations.

Danial Marcu, Magdalena Romera, and Estibaliz
Amorrortu. 1999. Experiments in constructing a
corpus of discourse trees: 问题, 注解
choices, 问题. In Proceedings of the ACL’99
Workshop on Standards and Tools for Dis-
course Tagging.

丹尼尔·马可. 2000. The Theory and Practice
of Discourse Parsing and Summarization, 和
按.

Mitchell P. 马库斯, Mary Ann Marcinkiewicz,
and Beatrice Santorini. 1993. Building a large
annotated corpus of English: The Penn Tree-
bank. 计算语言学, 19(2):313–330.

David McClosky, Eugene Charniak, and Mark
约翰逊. 2006A. Effective self-training for
解析. 在诉讼程序中 2006 会议
of the North American Chapter of the Associa-
tion for Computational Linguistics: 人类
语言技术.

David McClosky, Eugene Charniak, and Mark
约翰逊. 2006乙. Reranking and self-training for
parser adaptation. In Proceedings of the 21st
国际计算会议
Linguistics and the 44th Annual Meeting of the
计算语言学协会.

Eleni Miltsakaki and Karen Kukich. 2004. Eval-
uation of text coherence for electronic essay
scoring systems. Natural Language Engineer-
英, 10(1):25–55.

Mathieu Morey, Philippe Muller, and Nicholas
亚瑟. 2017. How much progress have we made
on rst discourse parsing? A replication study of
recent results on the RST-DT. In Proceedings
的 2017 经验方法会议
自然语言处理博士.

Mathieu Morey, Philippe Muller, and Nicholas
亚瑟. 2018. A dependency perspective on
RST discourse parsing and evaluation. Compu-
tational Linguistics, 44(2):197–235.

229

Tahira Naseem, Harr Chen, Regina Barzilay,
and Mark Johnson. 2010. Using universal
linguistic knowledge to guide grammar induc-
的. 在诉讼程序中 2010 会议
on Empirical Methods in Natural Language
加工.

Radford M. Neal and Geoffrey E. 欣顿. 1998. A
View of the EM Algorithm That Justifies Incre-
精神的, Sparse, and Other Variants, 学习-
ing and Graphical Models.

杰弗里

Socher,

Pennington, 理查德


Christopher D. 曼宁. 2014. GloVe: 全球的
vectors for word representations. In Proceed-
ings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing.

Livia Polanyi. 1985. A theory of discourse struc-
ture and discourse coherence. In Proceedings
of the 21st Regional Meeting of the Chicago
Linguistics Society.

Livia Polanyi and Martin Van den Berg. 2011.
Discourse structure and sentiment. 在 2011
IEEE 11th International Conference on Data
Mining Workshops.

Andrew Radford. 1988. Transformational Gram-

三月, 剑桥大学出版社.

Kenji Sagae. 2009. Analysis of discourse struc-
ture with syntactic dependencies and data-
driven shift-reduce parsing. 在诉讼程序中
the 11th International Workshop on Parsing
技术.

Yoav Seginer. 2007. Fast unsupervised incre-
mental parsing. In Proceedings of the 45th
Annual Meeting of the Association of Compu-
tational Linguistics.

诺亚A. Smith and Jason Eisner. 2006. Anneal-
ing structural bias in multilingual weighted
grammar induction. In Proceedings of the 21st
国际计算会议
Linguistics and the 44th Annual Meeting of the
计算语言学协会.

Noah Ashton Smith. 2006. Novel estimation
methods for unsupervised discovery of latent
structure in natural language text. 博士. 论文,
Johns Hopkins University.

Valentin I. Spitkovsky, Hiyan Alshawi, Daniel
Jurafsky,
and Christopher D. 曼宁.
2010. Viterbi training improves unsupervised

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5

/

/
t

A
C
_
A
_
0
0
3
1
2
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

dependency parsing. 在诉讼程序中

14th Conference on Computational Natural
Language Learning.

segmentation. 在诉讼程序中 2018 骗局-
ference on Empirical Methods in Natural Lan-
guage Processing.

Mitchell Stern, Jacob Andreas, and Dan Klein.
2017. A minimal span-based neural consituency
the 55th Annual
parser. 在诉讼程序中
Meeting of the Association for Computational
语言学.

Yizhong Wang, Sujian Li, and Jingfeng Yang.
2018. Toward fast and accurate neural discourse

Yasuhisa Yoshida, Jun Suzuki, Tsutomu Hirao,
and Masaaki Nagata. 2014. Dependency-based
discourse parser for single-document summari-
扎化. 在诉讼程序中 2014 会议
on Empirical Methods in Natural Language
加工.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
2
1
9
2
3
1
4
5

/

/
t

A
C
_
A
_
0
0
3
1
2
p
d

.

F


y
G

e
s
t

t


n
0
8
S
e
p
e


e
r
2
0
2
3

230
下载pdf