What You Say and How You Say it: Joint Modeling of
Topics and Discourse in Microblog Conversations
Jichuan Zeng1∗ Jing Li2 ∗ Yulan He3 Cuiyun Gao1 Michael R. Lyu1
Irwin King1
1计算机科学与工程系
The Chinese University of Hong Kong, HKSAR, 中国
2Tencent AI Lab, Shenzhen, 中国
3计算机科学系, University of Warwick, 英国
1{jczeng, cygao, lyu, king}@cse.cuhk.edu.hk
2ameliajli@tencent.com, 3yulan.he@warwick.ac.uk
抽象的
This paper presents an unsupervised frame-
work for jointly modeling topic content and
discourse behavior in microblog conversa-
系统蒸发散. Concretely, we propose a neural model
to discover word clusters indicating what a
conversation concerns (IE。, 主题) 以及那些
reflecting how participants voice their opin-
ions (IE。, 话语).1 Extensive experiments
show that our model can yield both coherent
topics and meaningful discourse behavior. 毛皮-
ther study shows that our topic and discourse
representations can benefit the classification
of microblog messages, especially when they
are jointly trained with the classifier.
1
介绍
The last decade has witnessed the revolution
沟通的, where the ‘‘kitchen table
conversations’’ have been expanded to public dis-
cussions on online platforms. 作为结果,
in our daily life, the exposure to new information
and the exchange of personal opinions have been
mediated through microblogs, one popular online
platform genre (Bakshy et al., 2015). The flourish
of microblogs has also led to the sheer quantity
of user-created conversations emerging every
天, exposing individuals to superfluous infor-
运动. Facing such an unprecedented number of
conversations relative to limited attention of indi-
个人, how shall we automatically extract the
critical points and make sense of these microblog
conversations?
∗This work was partially conducted in Jichuan Zeng’s
internship in Tencent AI Lab. 通讯作者: Jing Li.
1Our data sets and code are available at: http://
github.com/zengjichuan/Topic_Disc.
267
Toward key focus understanding of a conver-
站, previous work has shown the benefits of
discourse structure (李等人。, 2016乙; Qin et al.,
2017; 李等人。, 2018), which shapes how messages
interact with each other, forming the discussion
流动, and can usefully reflect salient topics raised
in the discussion process. 毕竟, the topical
content of a message naturally occurs in context
of the conversation discourse and hence should not
be modeled in isolation. 反过来, the extracted
topics can reveal the purpose of participants and
further facilitate the understanding of their dis-
course behavior (Qin et al., 2017). 更远, 这
joint effects of topics and discourse will contribute
to better understanding of social media conversa-
系统蒸发散, benefiting downstream tasks such as the
management of discussion topics and discourse
behavior of social chatbots (Zhou et al., 2018) 和
the prediction of user engagements for conversa-
tion recommendation (Zeng et al., 2018乙).
To illustrate how the topics and discourse inter-
play in a conversation, 数字 1 displays a snip-
pet of Twitter conversation. As can be seen, 这
content words reflecting the discussion topics
(such as ‘‘supreme court’’ and ‘‘gun rights’’) 美联社-
pear in context of the discourse flow, 在哪里
participants carry the conversation forward via
making a statement, giving a comment, asking a
问题, 等等. Motivated by such an ob-
servation, we assume that a microblog conver-
sation can be decomposed into two crucially
different components: one for topical content
and the other for discourse behavior. 这里, 这
topic components indicate what a conversation
is centered around and reflect the important dis-
cussion points put forward in the conversation
过程. The discourse components signal
这
discourse roles of messages, such as making
计算语言学协会会刊, 卷. 7, PP. 267–281, 2019. 动作编辑器: Jianfeng Gao.
提交批次: 12/2018; 修改批次: 2/2019; 已发表 5/2019.
C(西德:13) 2019 计算语言学协会. 根据 CC-BY 分发 4.0 执照.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
我
A
C
_
A
_
0
0
2
6
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
end-to-end training of topic and discourse repre-
sentation learning with other neural models for
diverse tasks.
For model evaluation, we conduct an extensive
empirical study on two large-scale Twitter data
套. The intrinsic results show that our model
can produce latent
topics and discourse roles
with better interpretability than the state-of-the-
art models from previous studies. The extrinsic
evaluations on a tweet classification task exhibit
the model’s ability to capture useful representa-
tions for microblog messages. Particularly, 我们的
model enables an easy combination with existing
neural models for end-to-end training, 例如
convolutional neural networks, which is shown to
perform better in classification than the pipeline
approach without joint training.
2 相关工作
Our work is in the line with previous studies
that use non-neural models to leverage discourse
structure for extracting topical content from con-
诗篇 (李等人。, 2016乙; Qin et al., 2017; 李
等人。, 2018). Zeng et al. (2018乙) explore how dis-
course and topics jointly affect user engagements
in microblog discussions. Different from them,
we build our model in a neural network frame-
工作, where the joint effects of topic and discourse
representations can be exploited for various down-
stream deep learning tasks in an end-to-end man-
ner. 此外, we are inspired by prior research
that only models topics or conversation discourse.
In the following, we discuss them in turn.
Topic Modeling. Our work is closely related
with the topic model studies. In this field, 尽管
the huge success achieved by the springboard topic
型号 (例如, pLSA [Hofmann, 1999] and LDA
[Blei et al., 2001]), and their extensions (Blei et al.,
2003; Rosen-Zvi et al., 2004), the applications of
these models have been limited to formal and well-
edited documents, such as news reports (Blei et al.,
2003) and scientific articles (Rosen-Zvi et al.,
2004), attributed to their reliance on document-
level word collocations. When processing short
文本, such as the messages on microblogs, 这是
likely that the performance of these models will
be inevitably compromised, due to the severe data
sparsity issue.
To deal with such an issue, many previous ef-
forts incorporate the external representations, 这样的
as word embeddings (Nguyen et al., 2015; 李等人。,
数字 1: A Twitter conversation snippet about the
gun control issue in U.S. Topic words reflecting the
conversation focus are in boldface. The italic words in
[ ] are our interpretations of the messages’ discourse
角色.
a statement, asking a question, and other dia-
logue acts (Ritter et al., 2010; Joty et al., 2011),
which further shape the discourse structure of a
conversation.2 To distinguish the above two com-
ponents, we examine the conversation contexts
and identify two types of words: topic words,
indicating what a conversation focuses on, 和
discourse words, reflecting how the opinion is
voiced in each message. 例如, 图中 1,
the topic words ‘‘gun’’ and ‘‘control’’ indicate the
conversation topic while the discourse word ‘‘what’’
and ‘‘?’’ signal the question in M3.
Concretely, we propose a neural framework
built upon topic models, enabling the joint explo-
ration of word clusters to represent topic and
discourse in microblog conversations. 不同的
from the prior models trained on annotated data
(李等人。, 2016乙; Qin et al., 2017), our model is
fully unsupervised, not dependent on annotations
for either topics or discourse, which ensures its
immediate applicability in any domain or lan-
规格. 而且, taking advantages of the recent
advances in neural topic models (Srivastava and
Sutton, 2017; Miao et al., 2017), we are able to
approximate Bayesian variational inference with-
out requiring model-specific derivations, 然而
most existing work (Ritter et al., 2010; Joty et al.,
2011; Alvarez-Melis and Saveski, 2016; 曾
等人。, 2018乙; 李等人。, 2018) require expertise
inference algo-
involved to customize model
rithms. 此外, our neural nature enables
2在本文中, the discourse role refers to a certain type of
dialogue act (例如, statement or question) for each message.
And the discourse structure refers to some combination of
discourse roles in a conversation.
268
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
我
A
C
_
A
_
0
0
2
6
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
2016A; Shi et al., 2017) and knowledge (歌曲
等人。, 2011; 杨等人。, 2015; Hu et al., 2016),
pre-trained on large-scale high-quality resources.
Different from them, our model learns topic and
discourse representations only with the internal
data and thus can be widely applied on scenarios
where the specific external resource is unavailable.
In another line of the research, most prior work
focuses on how to enrich the context of short
消息. 为此, biterm topic model (BTM)
(Yan et al., 2013) extends a message into a biterm
set with all combinations of any two distinct words
appearing in the message. 相反, 我们的
model allows the richer context in a conversation
to be exploited, where word collocation patterns
can be captured beyond a short message.
此外,
there are many methods using
some heuristic rules to aggregate short messages
into long pseudo-documents, such as those based
on authorship (Hong and Davison, 2010; 赵
等人。, 2011) and hashtags (Ramage et al., 2010;
Mehrotra et al., 2013). Compared with these
方法, we model messages in the context of
their conversations, which has been demonstrated
to be a more natural and effective text aggregation
strategy for topic modeling (Alvarez-Melis and
Saveski, 2016).
Conversation Discourse. Our work is also in
the area of discourse analysis for conversa-
系统蒸发散, ranging from the prediction of the shallow
discourse roles on utterance level (Stolcke et al.,
2000; Ji et al., 2016; 赵等人。, 2018) 到
discourse parsing for a more complex conversation
结构 (Elsner and Charniak, 2008, 2010;
Afantenos et al., 2015). In this area, most existing
models heavily rely on the data annotated with
discourse labels for learning (赵等人。, 2017).
in a fully
从他们, our model,
不同的
unsupervised way, identifies distributional word
clusters to represent latent discourse factors in
conversations. Although such latent discourse
variables have been studied in previous work
(Ritter et al., 2010; Joty et al., 2011; Ji et al., 2016;
赵等人。, 2018), none of them explores the
effects of latent discourse on the identification of
conversation topic, which is a gap our work fills in.
3 Our Neural Model for Topics and
Discourse in Conversations
This section introduces our neural model that
jointly explores latent representations for topics
and discourse in conversations. We first present
an overview of our model in Section 3.1, followed
by the model generative process and inference
procedure in Section 3.2 和 3.3, 分别.
3.1 Model Overview
一般来说, our model aims to learn coherent word
clusters that reflect the latent topics and discourse
roles embedded in the microblog conversations.
为此, we distinguish two latent components
in the given collection: topics and discourse, each
represented by a certain type of word distribution
(distributional word cluster). 具体来说, 在
corpus level, we assume that there are K topics,
represented by φT
k (k = 1, 2, . . . , K), and D dis-
course roles, captured with φD
d (d = 1, 2, . . . , D).
φT and φD are all multinomial word distributions
over the vocabulary size V . Inspired by the neural
topic models in Miao et al. (2017), our model
encodes topic and discourse distributions (φT and
φD) as latent variables in a neural network and
learns the parameters via back propagation.
Before touching the details of our model, 我们
first describe how we formulate the input. 在
microblogs, as a message might have multiple
replies, messages in an entire conversation can
be organized as a tree with replying relations (李
等人。, 2016乙, 2018). Though the recent progress
in recursive models allows the representation
learning from the tree-structured data, previous
studies have pointed out that, 在实践中, 顺序
models serve as a more simple yet robust alterna-
主动的 (李等人。, 2015). 在这项工作中, we follow the
common practice in most conversation modeling
研究 (Ritter et al., 2010; Joty et al., 2011;
赵等人。, 2018) to take a conversation as a
sequence of turns. 为此, each conversation
tree is flattened into root-to-leaf paths. Each one of
such paths is hence considered as a conversation
实例, and a message on the path corresponds
to a conversation turn (Zarisheva and Scheffler,
2015; Cerisara et al., 2018; Jiao et al., 2018).
The overall architecture of our model is shown
图中 2. 正式地, we formulate a conversation
c as a sequence of messages (x1, x2, . . . , xMc),
where Mc denotes the number of messages in
C. In the conversation, each message x, 作为
target message, is fed into our model sequentially.
Here we process the target message x as the
bag-of-words (BoW) term vector xBoW ∈ RV ,
following the bag-of-words assumption in most
269
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
我
A
C
_
A
_
0
0
2
6
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
主题 (李等人。, 2018; Zeng et al., 2018乙). 骗局-
cretely, we define the latent topic variable z ∈ RK
at the conversation level and generate the topic
mixture of c, denoted as a K-dimensional distri-
bution θ, via a softmax construction conditioned
on z (Miao et al., 2017).
Latent Discourse. For modeling the discourse
structure of conversations, we capture the message-
level discourse roles reflecting the dialogue acts
of each message, as is done in Ritter et al. (2010).
Concretely, given the target message x, 我们用
a D-dimensional one-hot vector to represent the
latent discourse variable d, where the high bit
indicates the index of a discourse word distribu-
tion that can best express x’s discourse role. 在
the generative process, the latent discourse d is
drawn from a multinomial distribution with pa-
rameters estimated from the input data.
Data Generative Process As mentioned pre-
viously, our entire framework is based on VAE,
which consists of an encoder and a decoder.
The encoder maps a given input into latent topic
and discourse representations and the decoder
reconstructs the original input from the latent
陈述. In the following, we first describe
the decoder followed by the encoder.
一般来说, our decoder is learned to reconstruct
the words in the target message x (in the BoW
形式) from the latent topic z and latent discourse
d. We show the generative story that reflects the
reconstruction process below:
• Draw the latent topic z ∼ N (µ, σ2)
• c’s topic mixture θ = softmax(fθ(z))
• Draw the latent discourse d ∼ M ulti(圆周率)
• For the n-th word in x
– βn = softmax(fφT (我) + fφD (d))
– Draw the word wn ∼ M ulti(βn)
where f∗(·) is a neural perceptron, with a linear
transformation of inputs activated by a non-linear
转型. Here we use rectified linear units
(Nair and Hinton, 2010) as the activate func-
系统蒸发散. 尤其, the weight matrix of fφT (·)
(after the softmax normalization) is considered
as the topic-word distributions φT . The discourse-
word distributions φD are similarly obtained from
fφD (·).
数字 2: The architecture of our neural framework.
The latent topics z and latent discourse d are jointly
modeled from conversation c and target message x,
分别. Mutual information penalty (MI) is used
to separate words clusters representing topics and
话语. Afterward, z and d are used to reconstruct
the target message x(西德:48).
topic models (Blei et al., 2003; Miao et al., 2017).
The conversation, C, where the target message x
is involved, is considered as the context of x.
It is also encoded in the BoW form (denoted as
cBoW ∈ RV ) and fed into our model. In doing
所以, we ensure that
the context of the target
message is incorporated while learning its latent
陈述.
Following the previous practice in neural topic
型号 (Miao et al., 2017; Srivastava and Sutton,
2017), we utilize the variational auto-encoder
(VAE) (Kingma and Welling, 2013) to resemble
the data generative process via two steps. 第一的,
given the target message x and its conversation c,
our model converts them into two latent variables:
topic variable z and discourse variable d. 然后,
using the intermediate representations captured by
z and d, we reconstruct the target message, X(西德:48).
3.2 Generative Process
在这个部分, we first describe the two latent
variables in our model: the topic variable z and the
discourse variable d. 然后, we present our data
generative process from the latent variables.
Latent Topics. For latent topic learning, 我们
examine the main discussion points in the context
of a conversation. Our assumption is that messages
in the same conversation tend to focus on similar
270
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
我
A
C
_
A
_
0
0
2
6
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
For the encoder, we learn the parameters µ, σ,
and π from the input xBoW and cBoW (the BoW
form of the target message and its conversation),
following the following formula:
µ = fµ(fe(cBoW )), log σ = fσ(fe(cBoW ))
π = softmax(fπ(xBoW ))
(1)
3.3 Model Inference
For the objective function of our entire framework,
we take three aspects into account: the learning of
latent topics and discourse, the reconstruction of
the target messages, and the separation of topic-
associated words and discourse-related words.
Learning Latent Topics and Discourse. 为了
learning the latent topics/discourse in our model,
we utilize the variational inference (Blei et al.,
2016) to approximate posterior distribution over
the latent topic z and the latent discourse d given
all the training data. 为此, we maximize the
variational lower bound Lz for z and Ld for d,
each defined as following:
Lz = Eq(z | C)[p(C | z)] − DKL(q(z | C) || p(z))
Ld = Eq(d | X)[p(X | d)] − DKL(q(d | X) || p(d))
(2)
q(z | C) and q(d | X) are approximated posterior
probabilities describing how the latent topic z
and the latent discourse d are generated from
数据. p(C | z) 和 p(X | d) represent the corpus
likelihoods conditioned on the latent variables.
这里, to facilitate coherent topic production, 在
p(C | z), we penalize the likelihood of stopwords
to be generated from latent topics following Li
等人. (2018). p(z) follows the standard normal
prior N (0, 我) 和 p(d) is the uniform distribu-
tion U nif (0, 1). DKL refers to the Kullback-
Leibler divergence that ensures the approximated
posteriors to be close to the true ones. 了解更多
derivation details, we refer readers to Miao et al.
(2017).
Reconstructing target messages. From the la-
tent variables z and d, the goal of our model is
to reconstruct the target message x. The corre-
sponding learning objective is to maximize Lx
定义为:
Lx = Eq(z | X)q(d | C)[log p(X | z, d)]
(3)
Here we design Lx to ensure that the learned
latent topics and discourse can reconstruct x.
Distinguishing Topics and Discourse. 我们的
model aims to distinguish word distributions for
主题 (φT ) and discourse (φD), which enables topics
and discourse to capture different information
in conversations. Concretely, we use the mutual
信息, given below, to measure the mutual
dependency between the latent topics z and the
latent discourse d.3
Eq(z)q(d)
(西德:20)
日志
(西德:21)
p(z, d)
p(z)p(d)
(4)
方程 4 can be further derived as the Kullback-
Leibler divergence of the conditional distribution,
p(d | z), and marginal distribution, p(d). The de-
rived formula, defined as the mutual information
(MI) loss (LM I ) and shown in Equation 5, is used
to map z and d into the separated semantic space.
LM I = Eq(z)[DKL(p(d | z)||p(d))]
(5)
We can hence minimize LM I for guiding our
model to separate word distributions that represent
topics and discourse.
The Final Objective. To capture the joint ef-
fects of the learning objectives described above
(Lz, Ld, Lx, and LM I ), we design the final ob-
jective function for our entire framework as the
following:
L = Lz + Ld + Lx − λLM I
(6)
where the hyperparameter λ is the trade-off
parameter for balancing between the MI loss
(LM I ) and the other learning objectives. 经过
maximizing the final objective L via back pro-
pagation, the word distributions of topics and
discourse can be jointly learned from microblog
conversations.4
4 实验装置
Data Collection. 对于我们的实验, we col-
lected two microblog conversation data sets from
推特. One is released by the TREC 2011
microblog track (henceforth TREC), containing
conversations concerning a wide range of topics.5
3The distributions in Equation 4 are all conditional
probability distributions given the target message x and its
conversation c. We omit the conditions for simplicity.
4To smooth the gradients in implementation, for z ∼
氮 (µ, σ), we apply the reparameterization on z (Kingma and
Welling, 2013; Rezende et al., 2014), and for d ∼ M ulti(圆周率),
we adopt the Gumbel-Softmax trick (Maddison et al., 2016;
Jang et al., 2016).
5http://trec.nist.gov/data/tweets/.
271
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
我
A
C
_
A
_
0
0
2
6
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Data sets
TREC
TWT16
# 的
convs
116,612
29,502
Avg msgs Avg words
per conv
3.95
8.67
per msg
11.38
14.70
|Vocab|
9,463
7,544
桌子 1: Statistics of the two data sets containing
Twitter conversations.
The other is crawled from January to June 2016
with Twitter streaming API6 (henceforth TWT16,
short for Twitter 2016), following the way of
building the TREC data set. 在这段时期,
there are a large volume of discussions centered
around the U.S. presidential election. 此外,
for both data sets, we apply Twitter search API7
to retrieve the missing tweets in the conversation
历史, as the Twitter streaming API (used to
collect both data sets) only returns sampled tweets
from the entire pool.
The statistics of the two experiment data sets
are shown in Table 1. For model training and
评估, we randomly sampled 80%, 10%, 和
10% of the data to form the training, 发展,
and test set, 分别.
Data Preprocessing. We preprocessed the data
with the following steps. 第一的, non-English
tweets were filtered out. 然后, hashtags, 男人-
系统蒸发散 (@username), and links were replaced with
generic tags ‘‘HASH’’, ‘‘MENT’’, and ‘‘URL’’,
分别. 下一个, the natural languge toolkit was
applied for tweet tokenization.8 After that, 全部
letters were normalized to lower cases. 最后,
words that occurred fewer than 20 times were
filtered out from the data.
Parameter Setting. To ensure comparable
results with Li et al. (2018) (the prior work
focusing on the same task as ours), in the topic
coherence evaluation, we follow their setup to
report the results under two sets of K (号码
of topics): K = 50 and K = 100, and with the
number of discourse roles (D) set to 10. 这
analysis for the effects of K and D will be further
presented in Section 5.5. For all the other hyper-
参数, we tuned them on development set by
grid search. The trade-off parameter λ (defined
6https://developer.twitter.com/en/docs/
tweets/filter-realtime/api-reference/post-
statuses-filter.html.
7https://developer.twitter.com/en/docs/
tweets/search/api-reference/get-savedsearches-
show-id.
8https://www.nltk.org/.
in Equation 6), balancing the MI loss and the
other objective functions, is set to 0.01. In model
训练, we use the Adam optimizer (Kingma and
Ba, 2014) and run 100 epochs with early stop
strategy adopted.
基线.
In topic modeling experiments, 我们
consider the five topic model baselines treating
each tweet as a document: LDA (Blei et al., 2003),
BTM (Yan et al., 2013), LF-LDA, LF-DMM
(Nguyen et al., 2015), and NTM (Miao et al.,
2017). 尤其, BTM and LF-DMM are the
state-of-the-art topic models for short texts. BTM
explores the topics of all word pairs (biterms) 在
each message to alleviate data sparsity in short
文本. LF-DMM incorporates word embeddings
pre-trained on external data to expand semantic
meanings of words, so does LF-LDA. In Nguyen
等人. (2015), LF-DMM, based on one-topic-per-
document Dirichlet Multinomial Mixture (DMM)
(Nigam et al., 2000), was reported to perform
better than LF-LDA, based on LDA. For LF-LDA
and LF-DMM, we use GloVe Twitter embeddings
(Pennington et al., 2014) as the pre-trained word
embeddings.9
For the discourse modeling experiments, 我们
compare our results with LAED (赵等人。,
2018), a VAE-based representation learning model
for conversation discourse. 此外, for both
topic and discourse evaluation, we compare with
李等人. (2018), a recently proposed model for
microblog conversations, where topics and dis-
course are jointly explored with a non-neural
框架. Besides the existing models from pre-
vious studies, we also compare with the variants
of our model that only models topics (因此-
forth TOPIC ONLY) or discourse (henceforth DISC
ONLY).10 Our joint model of topics and discourse
is referred to as TOPIC+DISC.
In the preprocessing procedure for the baselines,
we removed stop words and punctuation for topic
models unable to learn discourse representations
following the common practice in previous work
(Yan et al., 2013; Miao et al., 2017). For the other
型号, stop words and punctuation were retained
in the vocabulary, considering their usefulness
as discourse indicators (李等人。, 2018).
9https://nlp.stanford.edu/projects/glove/.
10In our ablation without mutual information loss (LM I
defined in Equation 4), topics and discourse are learned inde-
悬垂地. 因此, its topic representation can be used for the
output of TOPIC ONLY, so does its discourse one for DISC ONLY.
272
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
我
A
C
_
A
_
0
0
2
6
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
楷模
基线
LDA
BTM
LF-DMM
LF-LDA
NTM
李等人. (2018)
Our models
TOPIC ONLY
TOPIC+DISC
K = 50
K = 100
TREC TWT16 TREC TWT16
0.467
0.460
0.456
0.470
0.478
0.463
0.454
0.461
0.448
0.456
0.479
0.433
0.467
0.466
0.463
0.467
0.482
0.464
0.454
0.463
0.466
0.453
0.443
0.435
0.478
0.485
0.482
0.487
0.481
0.496
0.471
0.480
桌子 2: Cv coherence scores for latent topics
produced by different models. The best result in
each column is highlighted in bold. Our joint
model TOPIC+DISC achieves significantly better
coherence scores than all the baselines (p < 0.01,
paired test).
5 Experimental Results
In this section, we first report the topic coher-
ence results in Section 5.1, followed by a dis-
cussion in Section 5.2 comparing the latent
discourse roles discovered by our model with the
manually annotated dialogue acts. Then, we study
whether we can capture useful representations for
microblog messages in a tweet classification task
(in Section 5.3). A qualitative analysis, showing
some example topics and discourse roles, is further
provided in Section 5.4. Finally, in Section 5.5,
we provide more discussions on our model.
5.1 Topic Coherence
For the topic coherence, we adopt the Cv scores
measured via the open-source Palmetto toolkit as
our evaluation metric.11 Cv scores assume that
the top N words in a coherent topics (ranked by
likelihood) tend to co-occur in the same document
and have shown comparable evaluation results to
human judgments (R¨oder et al., 2015). Table 2
shows the average Cv scores over the produced
topics given N = 5 and N = 10. The values
range from 0.0 to 1.0, and higher scores indicate
better topic coherence. We can observe that:
• Models assuming a single topic for each mes-
sage do not work well. It has long been pointed
out that the one-topic-per-message assumption
(each message contains only one topic) helps
11https://github.com/dice-group/Palmetto.
topic models alleviate the data sparsity issue in
short texts on microblogs (Zhao et al., 2011;
Quan et al., 2015; Nguyen et al., 2015; Li et al.,
2018). However, we observe contradictory results
because both LF-DMM and Li et al. (2018),
following this assumption, achieve generally worse
performance than the other models. This might
be attributed to the large-scale data used in our
experiments (each data set has over 250K mes-
sages as shown in Table 1), which potentially
provide richer word co-occurrence patterns and
thus partially alleviate the data sparsity issue.
they result
• Pre-trained word embeddings do not bring ben-
efits. Comparing LF-LDA with LDA, we found
that
in similar coherence scores.
This shows that with sufficiently large training
data, with or without using the pre-trained word
embeddings do not make any difference in the
topic coherence results.
• Neural models perform better than non-neural
baselines. When comparing the results of neural
models (NTM and our models) with the other
baselines, we find the former yield topics with
better coherence scores in most cases.
• Modeling topics in conversations is effective.
Among neural models, we found our models
outperform NTM (without exploiting conversa-
tion contexts). This shows that the conversations
provide useful context and enables more coherent
topics to be extracted from the entire conversation
thread instead of a single short message.
• Modeling topics together with discourse helps
produce more coherent topics. We can observe
better results with the joint model TOPIC+DISC in
comparison with the variant considering topics
only. This shows that TOPIC+DISC, via the joint
modeling of topic- and discourse-word distribu-
tions (reflecting non-topic information), can bet-
ter separate topical words from non-topical ones,
hence resulting in more coherent topics.
5.2 Discourse Interpretability
In this section, we evaluate whether our model
can discover meaningful discourse representa-
tions. To this end, we train the comparison models
for discourse modeling on the TREC data set and
test the learned latent discourse on a benchmark
data set released by Cerisara et al. (2018). The
benchmark data set consists of 2,217 microblog
messages forming 505 conversations collected
273
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
l
a
c
_
a
_
0
0
2
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Models
Baselines
LAED
Li et al. (2018)
Our models
DISC ONLY
TOPIC+DISC
Purity Homogeneity
VI
0.505
0.511
0.510
0.521
0.022
0.096
0.112
0.142
6.418
5.540
5.532
5.097
Table 3: The purity, homogeneity, and variation
of information (VI) scores for the latent discourse
roles measured against the human-annotated dia-
logue acts. For purity and homogeneity, higher
scores indicate better performance, while for VI
scores, lower is better. In each column, the best
results are in boldface. Our joint model TOPIC+DISC
significantly outperforms all the baselines (p <
0.01, paired t-test).
from Mastodon,12 a microblog platform exhibit-
ing Twitter-like user behavior (Cerisara et al.,
2018). For each message,
there is a human-
assigned discourse label, selected from one of
the 15 dialogue acts, such as question, answer,
disagreement, and so forth.
For discourse evaluation, we measure whether
the model-produced discourse assignments are
consistent with the human-annotated dialogue
acts. Hence, following Zhao et al. (2018), we
assume that an interpretable latent discourse
role should cluster messages labeled with the
same dialogue act. Therefore, we adopt purity
(Manning et al., 2008), homogeneity (Rosenberg
and Hirschberg, 2007), and variation of informa
tion (VI) (Meila, 2003; Goldwater and Griffiths,
2007) as our automatic evaluation metrics. Here,
we set D = 15 to ensure the number of latent
discourse roles to be the same as the number of
manually labeled dialogue acts. Table 3 shows the
comparison results of the average scores over the
15 latent discourse roles. Higher values indicate
better performance for purity and homogeneity,
while for VI, lower is better.
It can be observed that our models exhibit
generally better performance, showing the effec-
tiveness of our framework in inducing inter-
pretable discourse roles. Particularly, we observe
the best results achieved by our joint model
TOPIC+DISC, which is learned to distinguish topic-
and discourse-words, important in recognizing
indicative words to reflect latent discourse.
12https://mastodon.social.
Figure 3: A heatmap showing the alignments of the
latent discourse roles and human-annotated dialogue
act
labels. Each line visualizes the distribution of
messages with the corresponding dialogue act label
over varying discourse roles (indexed from 1 to 15),
where darker colors indicate higher values.
To further analyze the consistency of varying
latent discourse roles (produced by our TOPIC+DISC
model) with the human-labeled dialogue acts,
Figure 3 displays a heatmap, where each line
visualizes how the messages with a dialogue act
distribute over varying discourse roles. It is seen
that among all dialogue acts, our model discov-
ers more interpretable latent discourse for ‘‘greet-
ings’’, ‘‘thanking’’, ‘‘exclamation’’, and ‘‘offer’’,
where most messages are clustered into one or
two dominant discourse roles. It may be because
these dialogue acts can be relatively easier to
detect based on their associated indicative words,
such as the word ‘‘thanks’’ for ‘‘thanking’’, and
the word ‘‘wow’’ for ‘‘exclamation’’.
5.3 Message Representations
To further evaluate our ability to capture effec-
tive representations for microblog messages, we
take tweet classification as an example and test
the classification performance with the topic and
discourse representations as features. Here, the
user-generated hashtags capturing the topics of
online messages are used as the proxy class labels
(Li et al., 2016b; Zeng et al., 2018a). We construct
274
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
l
a
c
_
a
_
0
0
2
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Models
TREC
TWT16
Acc
Avg F1
Acc
Avg F1
Baselines
0.120
BoW
0.116
TF-IDF
0.128
LDA
BTM
0.123
LF-DMM 0.158
0.138
NTM
0.259
Our model
0.026
0.024
0.041
0.035
0.072
0.042
0.180
0.132
0.153
0.146
0.167
0.162
0.186
0.341
0.030
0.041
0.046
0.054
0.052
0.068
0.269
Table 4: Evaluation of tweet classification re-
sults in accuracy (Acc) and average F1 (Avg F1).
Representations learned by various models serve
as the classification features. For our model, both
the topic and discourse representations are fed
into the classifier.
the classification data set from TREC and TWT16
with the following steps. First, we removed the
tweets without hashtags. Second, we ranked hash-
tags by their frequencies. Third, we manually
removed the hashtags that are not topic-related
(e.g. ‘‘#fb’’ for indicating the source of tweets
from Facebook), and combined the hashtags refer-
ring to the same topic (e.g., ‘‘#DonaldTrump’’
and ‘‘#Trump’’). Finally, we selected the top 50
frequent hashtags, and all tweets containing these
hashtags as our classification data set. Here, we
simply use the support vector machines as the
classifier, since our focus is to compare the rep-
resentations learned by various models. Li et al.
(2018) are unable to produce vector representation
on tweet level, hence not considered here.
Table 4 shows the classification results of accu-
racy and average F1 on the two data sets with
the representations learned by various models
serving as the classification features. We observe
that our model outperforms other models with a
large margin. The possible reasons are twofold.
First, our model derives topics from conversa-
tion threads and thus potentially yields better
message representations. Second, the discourse
representations (only produced by our model) are
indicative features for hashtags, because users
will exhibit various discourse behaviors in dis-
cussing diverse topics (hashtags). For instance, we
observe prominent ‘‘argument’’ discourse from
tweets with ‘‘#Trump’’ and ‘‘#Hillary’’, attributed
to the controversial opinions to the two candidates
in the 2016 U.S. presidential election.
5.4 Example Topics and Discourse Roles
We have shown that joint modeling of topics
and discourse presents superior performance on a
quantitative measure. In this section, we qualita-
tively analyze the interpretability of our outputs
via analyzing the word distributions of some
example topics and discourse roles.
Example Topics. Table 5 lists the top 10 words
of some example latent topics discovered by var-
ious models from the TWT16 data set. According
to the words shown, we can interpret the extracted
topics as ‘‘gun control’’ — discussion about gun
law and the failure of gun control in Chicago. We
observe that LDA wrongly includes off-topic word
‘‘flag’’. From the outputs of BTM, LF-DMM, Li
et al. (2018), and our TOPIC ONLY variant, though
we do not find off-topic words, there are some non-
topic words, such as ‘‘said’’ and ‘‘understand’’.13
The output of our TOPIC+DISC model appears to be
the most coherent, with words such as ‘‘firearm’’
and ‘‘criminals’’ included, which are clearly rel-
evant to ‘‘gun control’’. Such results indicate the
benefit of examining the conversation contexts and
jointly exploring topics and discourse in them.
Example Discourse Roles. To qualitatively
analyze whether our TOPIC+DISC model can dis-
cover interpretable discourse roles, we select the
top 10 words from the distributions of some exam-
ple discourse roles and list them in Table 6. It can
be observed that there are some meaningful word
clusters reflecting varying discourse roles found
without any supervision. Interestingly, we observe
that the latent discourse roles from TREC and
TWT16, though learned separately, exhibit some
notable overlap in their associated top 10 words,
particularly for ‘‘question’’ and ‘‘statement’’. We
also note that ‘‘argument’’ is represented by very
different words. The reason is that TWT16 con-
tains a large volume of arguments centered around
candidates Clinton and Trump, resulting in the fre-
quent appearance of words like ‘‘he’’ and ‘‘she’’.
5.5 Further Discussions
In this section, we further present more discussions
on our joint model: TOPIC+DISC.
Parameter Analysis. Here, we study the two
the
important hyper-parameters in our model,
13Non-topic words do not clearly indicate the correspond-
ing topic, whereas off-topic words are more likely to appear
in other topics.
275
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
l
a
c
_
a
_
0
0
2
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
LDA
BTM
LF-DMM
Li et al. (2018)
NTM
TOPIC ONLY
TOPIC+DISC
think law agree black
people police wrong right (cid:58)(cid:58)(cid:58)(cid:58)
people trump police violence gun death protest guns flag shot
(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)
gun guns (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)
gun police black (cid:58)(cid:58)(cid:58)said (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)
wrong don trump gun (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)
gun (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)
shootings gun guns cops charges control (cid:58)(cid:58)(cid:58)(cid:58)mass commit (cid:58)(cid:58)(cid:58)(cid:58)
guns gun shootings chicago shooting cops firearm criminals commit laws
people guns killing ppl amendment laws
doesn (cid:58)(cid:58)(cid:58)(cid:58)make
understand laws agree guns (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)
understand (cid:58)(cid:58)yes guns world dead (cid:58)(cid:58)(cid:58)real discrimination trump silence
know agreed
Table 5: Top 10 representative words of example latent topics discovered from the TWT16 data set. We
Non-topic (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)words are wave-underlined
interpret the topics as ‘‘gun control’’ by the displayed words. (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)
and in blue, and off-topic words are underlined and in red.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
l
a
c
_
a
_
0
0
2
6
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Table 6: Top 10 representative words of example discourse roles learned from TREC and TWT16. The
discourse roles of the word clusters are manually assigned according to their associated words.
number of topics (K) and the number of discourse
roles (D). In Figure 4, we show the Cv topic
coherence given varying K in (a) and the homo-
geneity measure given varying D in (b). As can be
seen, the curves corresponding to the performance
on topics and discourse are not monotonic. In par-
ticular, better topic coherence scores are achieved
given relatively larger topic numbers for TREC
with the best result observed at K = 80. On the
contrary, the optimum topic number for TWT16
is K = 20, and increasing the number of topics
results in worse Cv scores in general. This may be
attributed to the relatively centralized topic con-
cerning U.S. election in the TWT16 corpus. For
discourse homogeneity, the best result is achieved
given D = 15, with same the number of manually
annotated dialogue acts in the benchmark.
Case Study. To further understand why our
model learns meaningful representations for topics
and discourse, we present a case study based
on the example conversation shown in Figure 1.
Specifically, we visualize the topic words (with
p(w | z) > p(w | d)) in red and the rest of the
words in blue to indicate discourse. Darker red
indicates the higher topic likelihood (p(w | z)) 和
darker blue shows the higher discourse likelihood
(p(w | d)). The results are shown in Figure 5. 我们
can observe that topic and discourse words are
well separated by our model, which explains why
it can generate high-quality representations for
both topics and discourse.
Model Extensibility. Recall that in the Intro-
归纳法, we mentioned that our neural-based
model has an advantage to be easily combined
with other neural network architectures and allows
for the joint training of both models. 这里, 我们
take message classification (with the setup in
部分 5.3) as an example, and study whether
276
数字 4: (A) The impact of topic numbers. 这
horizontal axis shows the number of topics; the vertical
axis shows the Cv topic coherence. (乙) The impact of
discourse numbers. The horizontal axis represents the
number of discourse; the vertical axis represents the
homogeneity measure.
joint training our model with convolutional neural
网络 (CNN) (Kim, 2014), the widely used
model on short text classification, can bring ben-
efits to the classification performance. We set the
embedding dimension to 200, with random initial-
化. 结果见表 7, 在哪里
we observe that joint training our model and the
classifier can successfully boost the classification
表现.
误差分析. We further analyze the errors
in our outputs. For topics, taking a closer look
at their word distributions, we found that our
model sometimes mixes sentiment words with
topic words. 例如, among the top 10 字
of a topic ‘‘win people illegal americans hate lt
racism social tax wrong’’, there are words ‘‘hate’’
and ‘‘wrong’’, expressing sentiment rather than
conveying topic-related information. This is due
to the prominent co-occurrences of topic words
and sentiment words in our data, 这导致
the similar distributions for topics and sentiment.
Future work could focus on the further separation
of sentiment and topic words.
For discourse, we found that our model can
induce some discourse roles beyond the 15 男人-
ually defined dialogue acts in the Mastodon data
放 (Cerisara et al., 2018). 例如, as shown
表中 6, our model discovers the ‘‘quotation’’
discourse from both TREC and TWT16, 哪个
是, 然而, not defined in the Mastodon data set.
This perhaps should not be considered as an error.
We argue that it is not sensible to pre-define a
fixed set of dialogue acts for diverse microblog
conversations due to the rapid change and a
wide variety of user behaviors in social media.
所以, future work should involve a better
alternative to evaluate the latent discourse without
数字 5: Visualization of the topic-discourse assign-
ment of a twitter conversion from TWT16. The anno-
tated blue words are prone to be discourse words,
and the red are topic words. The shade indicates the
confidence of the current assignment.
relying on manually defined dialogue acts. 我们
also notice that our model sometimes fails to iden-
tify discourse behaviors requiring more in-depth
semantic understanding, such as sarcasm, irony,
和幽默. This is because our model detects
latent discourse purely based on the observed
字, whereas the detection of sarcasm, irony, 或者
humor requires deeper language understanding,
which is beyond the capacity of our model.
6 Conclusion and Future Work
We have presented a neural framework that jointly
explores topic and discourse from microblog
conversations. Our model,
in an unsupervised
方式, examines the conversation contexts and
discovers word distributions that reflect latent
topics and discourse roles. Results from extensive
experiments show that our model can generate
coherent topics and meaningful discourse roles.
此外, our model can be easily combined
with other neural network architectures (这样的
as CNN) and allows for joint training, 哪个
has presented better message classification results
compared with the pipeline approach without joint
训练.
Our model captures topic and discourse repre-
sentations embedded in conversations. 他们是
potentially useful for a broad range of down-
stream applications, worthy to be explored in
future research. 例如, our model is useful
for developing social chatbots (Zhou et al., 2018).
By explicitly modeling ‘‘what you say’’ and ‘‘how
you say’’, our model can be adapted to track the
277
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
我
A
C
_
A
_
0
0
2
6
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
楷模
CNN only
Separate-Train
Joint-Train
TREC
TWT16
Acc Avg F1 Acc Avg F1
0.311
0.199
0.390
0.284
0.413
0.297
0.334
0.391
0.428
0.167
0.270
0.286
桌子 7: Accuracy (Acc) and average F1 (F1抗体)
on tweet classification (hashtags as labels). CNN
仅有的: CNN without using our representations.
Seperate-Train: CNN fed with our pre-trained
陈述. Joint-Train: Joint training CNN
and our model.
change of topics in conversation context, helpful
to determine ‘‘what to say and how to say’’ in
the next turn. 还, it would be interesting to
study how our learned latent topics and discourse
affect recommendation (Zeng et al., 2018乙) 和
summarization of microblog conversations (李
等人。, 2018).
致谢
This work is partially supported by the Research
Grants Council of the Hong Kong Special Admin-
istrative Region, 中国 (不. CUHK 14208815 和
不. CUHK 14210717 of the General Research Fund),
Innovate UK (授予号. 103652), and Microsoft
Research Asia (2018 Microsoft Research Asia Col-
laborative Research Award). We thank Shuming
Shi, Dong Yu, and TACL reviewers and editors
for the insightful suggestions on various aspects
of this work.
参考
Stergos D. 阿凡特诺斯, 埃里克·高, Nicholas Asher,
和杰米·佩雷特. 2015. Discourse parsing for
multi-party chat dialogues. 在诉讼程序中
这 2015 实证方法会议
自然语言处理, EMNLP 2015,
第 928–937 页. 里斯本.
David Alvarez-Melis and Martin Saveski. 2016.
Topic modeling in twitter: Aggregating tweets
by conversations. In Proceedings of the Tenth
International Conference on Web and Social
媒体, pages 519–522. Cologne.
Eytan Bakshy, Solomon Messing, and Lada A.
Adamic. 2015. Exposure to ideologically di-
verse news and opinion on facebook. 科学,
348(6239):1130–1132.
278
大卫·M. Blei, 托马斯·L. Griffiths, Michael I.
约旦, and Joshua B. Tenenbaum. 2003. Hier-
archical topic models and the nested chinese
restaurant process. In Advances in Neural In-
formation Processing Systems, NIPS 2003,
pages 17–24. Vancouver and Whistler.
大卫·M. Blei, Alp Kucukelbir, and Jon D.
McAuliffe. 2016. Variational inference: A re-
view for statisticians. CoRR, abs/1601.00670.
大卫·M. Blei, 安德鲁·Y. 的, and Michael I.
约旦. 2001. Latent Dirichlet allocation. 在
神经信息处理的进展
系统 14, pages 601–608. Vancouver.
Christophe Cerisara, Somayeh Jafaritazehjani,
Adedayo Oluokun, and Hoa T. Le. 2018. 多-
task dialog act and sentiment recognition on
mastodon. In Proceedings of the 27th Interna-
tional Conference on Computational Linguis-
抽动症, 科林 2018, pages 745–754. 圣达菲,
NM.
Micha Elsner and Eugene Charniak. 2008. 你
talking to me? A corpus and algorithm for
In Proceed-
conversation disentanglement.
ings of the 46th Annual Meeting of the Asso-
ciation for Computational Linguistics, 前交叉韧带 2008,
pages 834–842. Columbus, 哦.
Micha Elsner and Eugene Charniak. 2010. 迪斯-
entangling chat. 计算语言学, 36(3):
389–409.
Sharon Goldwater and Thomas L. Griffiths. 2007.
A fully Bayesian approach to unsupervised
part-of-speech tagging. 在诉讼程序中
45th Annual Meeting of the Association for
计算语言学. Prague.
Thomas Hofmann. 1999. Probabilistic latent se-
mantic indexing. In Proceedings of the 22nd
Annual International ACM SIGIR Conference
on Research and Development in Information
Retrieval, pages 50–57. 伯克利, CA.
Liangjie Hong and Brian D. Davison. 2010. Em-
pirical study of topic modeling in Twitter. 在
Proceedings of the 3rd Workshop on Social
Network Mining and Analysis, SNAKDD 2009,
pages 80–88. 巴黎.
Zhiting Hu, Gang Luo, Mrinmaya Sachan, Eric P.
Xing, and Zaiqing Nie. 2016. Grounding topic
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
我
A
C
_
A
_
0
0
2
6
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
models with knowledge bases. In Proceedings
of the Twenty-Fifth International Joint Con-
ference on Artificial Intelligence, IJCAI 2016,
pages 1578–1584. 纽约, 纽约.
Eric Jang, Shixiang Gu, and Ben Poole. 2016.
Categorical reparameterization with gumbel-
softmax. CoRR, abs/1611.01144.
Yangfeng Ji, Gholamreza Haffari, and Jacob
Eisenstein. 2016. A latent variable recurrent
neural network for discourse-driven language
型号. 在诉讼程序中 2016 会议
of the North American Chapter of the Asso-
ciation for Computational Linguistics: 人类
语言技术, NAACL HLT 2016,
pages 332–342. 圣地亚哥, CA.
Yunhao Jiao, Cheng Li, Fei Wu, and Qiaozhu Mei.
2018. Find the conversation killers: A predictive
study of thread-ending posts. In Proceedings
的 2018 World Wide Web Conference on
World Wide Web, 万维网 2018, pages 1145–1154.
Lyon.
Shafiq R. Joty, Giuseppe Carenini, and Chin-Yew
林. 2011. Unsupervised modeling of dialog acts
in asynchronous conversations. In Proceedings
of the 22nd International Joint Conference on Arti-
ficial Intelligence, IJCAI 2011, pages 1807–1813.
巴塞罗那.
Yoon Kim. 2014. Convolutional neural networks
for sentence classification. 在诉讼程序中
这 2014 实证方法会议
自然语言处理, EMNLP 2014,
A meeting of SIGDAT, a Special Interest Group
of the ACL, pages 1746–1751. Doha.
Diederik P. Kingma and Jimmy Ba. 2014. 亚当:
A method for stochastic optimization. CoRR,
abs/1412.6980.
Diederik P. Kingma and Max Welling. 2013. 汽车-
encoding variational Bayes. CoRR, abs/1312.
6114.
Chenliang Li, Haoran Wang, Zhiqian Zhang,
Aixin Sun, and Zongyang Ma. 2016A. 话题
modeling for short texts with auxiliary word
嵌入. In Proceedings of the 39th Inter-
national ACM SIGIR conference on Research
and Development
in Information Retrieval,
SIGIR 2016, pages 165–174. Pisa.
Jing Li, Ming Liao, Wei Gao, Yulan He, 和
Kam-Fai Wong. 2016乙. Topic extraction from
microblog posts using conversation structures.
In Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics,
前交叉韧带 2016, 体积 1: Long Papers. 柏林.
Jing Li, Yan Song, Zhongyu Wei, and Kam-Fai
黄. 2018. A joint model of conversational
topics on microblogs.
discourse and latent
计算语言学, 44(4):719–754.
Jiwei Li, Thang Luong, Dan Jurafsky, 和
Eduard H. 蓝色的. 2015. When are tree structures
necessary for deep learning of representations?
在诉讼程序中 2015 Conference on Em-
pirical Methods in Natural Language Process-
英, EMNLP 2015, pages 2304–2314. 里斯本.
Chris J. Maddison, Andriy Mnih, and Yee Whye
Teh. 2016. The concrete distribution: A contin-
uous relaxation of discrete random variables.
CoRR, abs/1611.00712.
Christopher D. 曼宁, Prabhakar Raghavan,
and Hinrich Sch¨utze. 2008. Introduction to
Information Retrieval. 剑桥大学
按.
Rishabh Mehrotra, Scott Sanner, Wray L. Buntine,
and Lexing Xie. 2013. Improving LDA topic
models for microblogs via tweet pooling and
automatic labeling. In The 36th International
ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR
’13, pages 889–892. 都柏林.
Marina Meila. 2003. Comparing clusterings by the
variation of information. In Computational Learn-
ing Theory and Kernel Machines, 16th Annual
Conference on Computational Learning Theory
and 7th Kernel Workshop, COLT/Kernel,
会议记录, pages 173–187. 华盛顿, 直流.
Yishu Miao, Edward Grefenstette, and Phil
Blunsom. 2017. Discovering discrete latent
topics with neural variational
inference. 在
the 34th International Con-
会议记录
ference on Machine Learning, ICML 2017,
pages 2410–2419. 悉尼.
Vinod Nair and Geoffrey E. 欣顿. 2010. Rectified
linear units improve restricted Boltzmann ma-
中国人. In Proceedings of the 27th International
279
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
我
A
C
_
A
_
0
0
2
6
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Conference on Machine Learning (ICML 2010),
pages 807–814. Haifa.
Dat Quoc Nguyen, Richard Billingsley, Lan Du,
and Mark Johnson. 2015. Improving topic
feature word represen-
models with latent
tations. Transactions of
the Association for
计算语言学, 处理, 3:299–313.
Kamal Nigam, Andrew McCallum, Sebastian Thrun,
and Tom M. 米切尔. 2000. Text classification
from labeled and unlabeled documents using
EM. Machine Learning, 39(2/3):103–134.
Jeffrey Pennington, Richard Socher, and Christopher
D. 曼宁. 2014. GloVe: Global vectors for
word representation. 在诉讼程序中
这
2014 实证方法会议
自然语言处理, EMNLP 2014,
pages 1532–1543. Doha.
Kechen Qin, Lu Wang, and Joseph Kim. 2017.
Joint modeling of content and discourse rela-
tions in dialogues. In Proceedings of the 55th
Annual Meeting of the Association for Com-
putational Linguistics, 前交叉韧带 2017, 体积 1:
Long Papers, pages 974–984. Vancouver.
Xiaojun Quan, Chunyu Kit, Yong Ge, and Sinno
Jialin Pan. 2015. Short and sparse text topic
modeling via self-aggregation. In Proceedings
of the Twenty-Fourth International Joint Con-
ference on Artificial Intelligence, IJCAI 2015,
pages 2270–2276. Buenos Aires.
Daniel Ramage, 苏珊·T. Dumais, and Daniel J.
Liebling. 2010. Characterizing microblogs with
topic models. In Proceedings of the Fourth
International Conference on Weblogs and So-
cial Media, ICWSM 2010, 华盛顿, 直流.
Danilo Jimenez Rezende, Shakir Mohamed,
and Daan Wierstra. 2014. Stochastic back-
propagation and approximate inference in deep
generative models. In Proceedings of the 31th
International Conference on Machine Learn-
英, ICML 2014, pages 1278–1286. 北京.
Alan Ritter, Colin Cherry, and Bill Dolan. 2010.
Unsupervised modeling of twitter conversa-
系统蒸发散. In Human Language Technologies: 骗局-
ference of the North American Chapter of the
Association of Computational Linguistics, Pro-
ceedings, pages 172–180. 天使们, CA.
Michael R¨oder, Andreas Both, and Alexander
Hinneburg. 2015. Exploring the Space of
Topic Coherence Measures. 在诉讼程序中
the Eighth ACM International Conference on
Web Search and Data Mining, WSDM 2015,
pages 399–408. Shanghai.
Michal Rosen-Zvi, 托马斯·L. Griffiths, 标记
Steyvers, and Padhraic Smyth. 2004. 这
author-topic model for authors and documents.
In UAI ’04, Proceedings of the 20th Confer-
ence Uncertainty in Artificial
智力,
pages 487–494. Banff.
Andrew Rosenberg and Julia Hirschberg. 2007.
V-measure: A conditional entropy-based exter-
nal cluster evaluation measure. In Proceed-
ings of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and
Computational Natural Language Learning,
EMNLP-CoNLL 2007, pages 410–420. Prague.
Bei Shi, Wai Lam, Shoaib Jameel, Steven
Schockaert, and Kwun Ping Lai. 2017. Jointly
learning word embeddings and latent topics.
In Proceedings of the 40th International ACM
SIGIR Conference on Research and Develop-
ment in Information Retrieval, pages 375–384.
东京.
Yangqiu Song, Haixun Wang, Zhongyuan Wang,
Hongsong Li, and Weizhu Chen. 2011. Short
text conceptualization using a probabilistic
knowledgebase. In Proceedings of the 22nd
International Joint Conference on Artificial
智力, IJCAI 2011, pages 2330–2336.
巴塞罗那.
Akash Srivastava and Charles Sutton. 2017. 汽车-
encoding variational inference for topic models.
the Fifth International
在诉讼程序中
Conference on Learning Representations, ICLR
2017. Toulon.
Andreas Stolcke, Noah Coccaro, Rebecca Bates,
Paul Taylor, Carol Van Ess-Dykema, Klaus
Ries, Elizabeth Shriberg, Daniel Jurafsky,
Rachel Martin, and Marie Meteer. 2000. 是的-
logue act modeling for automatic tagging and
recognition of conversational speech. Compu-
tational Linguistics, 26(3).
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, 和
Xueqi Cheng. 2013. A biterm topic model for
280
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
我
A
C
_
A
_
0
0
2
6
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
short texts. In 22nd International World Wide
Web Conference, WWW ’13, pages 1445–1456.
Rio de Janeiro.
Yi Yang, Doug Downey, and Jordan L. Boyd-
Graber. 2015. Efficient methods for incor-
porating knowledge into topic models.
在
诉讼程序 2015 Conference on Em-
pirical Methods in Natural Language Pro-
cessing, EMNLP 2015, pages 308–317. 里斯本.
Elina Zarisheva and Tatjana Scheffler. 2015. 是的-
log act annotation for twitter conversations. 在
Proceedings of the SIGDIAL 2015 会议,
pages 114–123. Prague.
Jichuan Zeng, Jing Li, Yan Song, Cuiyun Gao,
Michael R. Lyu, and Irwin King. 2018A. 话题
memory networks for short text classification.
这 2018 会议
在诉讼程序中
Empirical Methods
in Natural Language
加工, EMNLP 2018. 布鲁塞尔.
Xingshan Zeng, Jing Li, Lu Wang, 尼古拉斯
Beauchamp, Sarah Shugars, and Kam-Fai
黄. 2018乙. Microblog conversation recom-
mendation via joint modeling of topics and
话语. 在诉讼程序中 2018 骗局-
the North American Chapter of
ference of
the Association for Computational Linguistics:
人类语言技术, NAACL-HLT
2018, 体积 1 (Long Papers), pages 375–385.
New Orleans, 这.
Tiancheng Zhao, Kyusong Lee, and Maxine
Esk´enazi. 2018. Unsupervised discrete sentence
representation learning for interpretable neural
dialog generation. In Proceedings of the 56th
Annual Meeting of the Association for Com-
putational Linguistics, 前交叉韧带 2018, 体积 1:
Long Papers, pages 1098–1107. 墨尔本.
Tiancheng Zhao, Ran Zhao, and Maxine Esk´enazi.
2017. Learning discourse-level diversity for
neural dialog models using conditional varia-
tional autoencoders. In Proceedings of the 55th
Annual Meeting of the Association for Com-
putational Linguistics, 前交叉韧带 2017, 体积 1:
Long Papers, pages 654–664. Vancouver.
Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing
他, Ee-Peng Lim, Hongfei Yan, and Xiaoming
李. 2011. Comparing Twitter and traditional
media using topic models. In Proceedings
of Advances in Information Retrieval – 33rd
European Conference on IR Research, ECIR
2011, pages 338–349. 都柏林.
Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung
Shum. 2018. The design and implementation of
Xiaoice, an empathetic social chatbot. CoRR,
abs/1812.08989.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
2
6
7
1
9
2
3
3
0
1
/
/
t
我
A
C
_
A
_
0
0
2
6
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
281