What You Say and How You Say it: Joint Modeling of

What You Say and How You Say it: Joint Modeling of
Topics and Discourse in Microblog Conversations

Jichuan Zeng1∗ Jing Li2 ∗ Yulan He3 Cuiyun Gao1 Michael R. Lyu1

Irwin King1

1Department of Computer Science and Engineering
The Chinese University of Hong Kong, HKSAR, Porcelana
2Tencent AI Lab, Shenzhen, Porcelana
3Departamento de Ciencias de la Computación, University of Warwick, Reino Unido
1{jczeng, cygao, lyu, king}@cse.cuhk.edu.hk
2ameliajli@tencent.com, 3yulan.he@warwick.ac.uk

Abstracto

This paper presents an unsupervised frame-
work for jointly modeling topic content and
discourse behavior in microblog conversa-
ciones. Concretely, we propose a neural model
to discover word clusters indicating what a
conversation concerns (es decir., temas) and those
reflecting how participants voice their opin-
ions (es decir., discourse).1 Extensive experiments
show that our model can yield both coherent
topics and meaningful discourse behavior. Fur-
ther study shows that our topic and discourse
representations can benefit the classification
of microblog messages, especially when they
are jointly trained with the classifier.

1

Introducción

The last decade has witnessed the revolution
of communication, where the ‘‘kitchen table
conversations’’ have been expanded to public dis-
cussions on online platforms. Como consecuencia,
in our daily life, the exposure to new information
and the exchange of personal opinions have been
mediated through microblogs, one popular online
platform genre (Bakshy et al., 2015). The flourish
of microblogs has also led to the sheer quantity
of user-created conversations emerging every
día, exposing individuals to superfluous infor-
formación. Facing such an unprecedented number of
conversations relative to limited attention of indi-
viduals, how shall we automatically extract the
critical points and make sense of these microblog
conversaciones?

∗This work was partially conducted in Jichuan Zeng’s
internship in Tencent AI Lab. Autor correspondiente: Jing Li.
1Our data sets and code are available at: http://

github.com/zengjichuan/Topic_Disc.

267

Toward key focus understanding of a conver-
estación, previous work has shown the benefits of
estructura del discurso (Le et al., 2016b; Qin et al.,
2017; Le et al., 2018), which shapes how messages
interact with each other, forming the discussion
flow, and can usefully reflect salient topics raised
in the discussion process. Después de todo, the topical
content of a message naturally occurs in context
of the conversation discourse and hence should not
be modeled in isolation. En cambio, el extraído
topics can reveal the purpose of participants and
further facilitate the understanding of their dis-
course behavior (Qin et al., 2017). Más, el
joint effects of topics and discourse will contribute
to better understanding of social media conversa-
ciones, benefiting downstream tasks such as the
management of discussion topics and discourse
behavior of social chatbots (Zhou y cols., 2018) y
the prediction of user engagements for conversa-
tion recommendation (Zeng et al., 2018b).

To illustrate how the topics and discourse inter-
play in a conversation, Cifra 1 displays a snip-
pet of Twitter conversation. As can be seen, el
content words reflecting the discussion topics
(such as ‘‘supreme court’’ and ‘‘gun rights’’) ap-
pear in context of the discourse flow, dónde
participants carry the conversation forward via
making a statement, giving a comment, asking a
pregunta, Etcétera. Motivated by such an ob-
servicio, we assume that a microblog conver-
sation can be decomposed into two crucially
different components: one for topical content
and the other for discourse behavior. Aquí, el
topic components indicate what a conversation
is centered around and reflect the important dis-
cussion points put forward in the conversation
proceso. The discourse components signal
el
discourse roles of messages, such as making

Transacciones de la Asociación de Lingüística Computacional, volumen. 7, páginas. 267–281, 2019. Editor de acciones: Jianfeng Gao.
Lote de envío: 12/2018; Lote de revisión: 2/2019; Publicado 5/2019.
C(cid:13) 2019 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1

/

/
t

yo

a
C
_
a
_
0
0
2
6
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

end-to-end training of topic and discourse repre-
sentation learning with other neural models for
diverse tasks.

For model evaluation, we conduct an extensive
empirical study on two large-scale Twitter data
conjuntos. The intrinsic results show that our model
can produce latent
topics and discourse roles
with better interpretability than the state-of-the-
art models from previous studies. The extrinsic
evaluations on a tweet classification task exhibit
the model’s ability to capture useful representa-
tions for microblog messages. Particularly, nuestro
model enables an easy combination with existing
neural models for end-to-end training, como
convolutional neural networks, which is shown to
perform better in classification than the pipeline
approach without joint training.

2 Trabajo relacionado

Our work is in the line with previous studies
that use non-neural models to leverage discourse
structure for extracting topical content from con-
versations (Le et al., 2016b; Qin et al., 2017; li
et al., 2018). Zeng et al. (2018b) explore how dis-
course and topics jointly affect user engagements
in microblog discussions. Different from them,
we build our model in a neural network frame-
trabajar, where the joint effects of topic and discourse
representations can be exploited for various down-
stream deep learning tasks in an end-to-end man-
ner. Además, we are inspired by prior research
that only models topics or conversation discourse.
En el siguiente, we discuss them in turn.

Topic Modeling. Our work is closely related
with the topic model studies. In this field, a pesar de
the huge success achieved by the springboard topic
modelos (p.ej., pLSA [Hofmann, 1999] and LDA
[Blei et al., 2001]), and their extensions (Blei et al.,
2003; Rosen-Zvi et al., 2004), the applications of
these models have been limited to formal and well-
edited documents, such as news reports (Blei et al.,
2003) and scientific articles (Rosen-Zvi et al.,
2004), attributed to their reliance on document-
level word collocations. When processing short
textos, such as the messages on microblogs, es
likely that the performance of these models will
be inevitably compromised, due to the severe data
sparsity issue.

To deal with such an issue, many previous ef-
forts incorporate the external representations, semejante
as word embeddings (Nguyen et al., 2015; Le et al.,

Cifra 1: A Twitter conversation snippet about the
gun control issue in U.S. Topic words reflecting the
conversation focus are in boldface. The italic words in
[ ] are our interpretations of the messages’ discourse
roles.

una declaración, asking a question, and other dia-
logue acts (Ritter et al., 2010; Joty et al., 2011),
which further shape the discourse structure of a
conversation.2 To distinguish the above two com-
ponents, we examine the conversation contexts
and identify two types of words: topic words,
indicating what a conversation focuses on, y
discourse words, reflecting how the opinion is
voiced in each message. Por ejemplo, En figura 1,
the topic words ‘‘gun’’ and ‘‘control’’ indicate the
conversation topic while the discourse word ‘‘what’’
and ‘‘?’’ signal the question in M3.

Concretely, we propose a neural framework
built upon topic models, enabling the joint explo-
ration of word clusters to represent topic and
discourse in microblog conversations. Different
from the prior models trained on annotated data
(Le et al., 2016b; Qin et al., 2017), our model is
fully unsupervised, not dependent on annotations
for either topics or discourse, which ensures its
immediate applicability in any domain or lan-
guage. Además, taking advantages of the recent
advances in neural topic models (Srivastava and
suton, 2017; Miao et al., 2017), somos capaces de
approximate Bayesian variational inference with-
out requiring model-specific derivations, mientras
most existing work (Ritter et al., 2010; Joty et al.,
2011; Alvarez-Melis and Saveski, 2016; Zeng
et al., 2018b; Le et al., 2018) require expertise
inference algo-
involved to customize model
rithms. Además, our neural nature enables

2en este documento, the discourse role refers to a certain type of
dialogue act (p.ej., statement or question) for each message.
And the discourse structure refers to some combination of
discourse roles in a conversation.

268

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1

/

/
t

yo

a
C
_
a
_
0
0
2
6
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

2016a; Shi et al., 2017) and knowledge (Song
et al., 2011; Yang et al., 2015; Hu et al., 2016),
pre-trained on large-scale high-quality resources.
Different from them, our model learns topic and
discourse representations only with the internal
data and thus can be widely applied on scenarios
where the specific external resource is unavailable.
In another line of the research, most prior work
focuses on how to enrich the context of short
messages. Para tal fin, biterm topic model (BTM)
(Yan et al., 2013) extends a message into a biterm
set with all combinations of any two distinct words
appearing in the message. On the contrary, nuestro
model allows the richer context in a conversation
to be exploited, where word collocation patterns
can be captured beyond a short message.

Además,

there are many methods using
some heuristic rules to aggregate short messages
into long pseudo-documents, such as those based
on authorship (Hong and Davison, 2010; zhao
et al., 2011) and hashtags (Ramage et al., 2010;
Mehrotra et al., 2013). Compared with these
methods, we model messages in the context of
their conversations, which has been demonstrated
to be a more natural and effective text aggregation
strategy for topic modeling (Alvarez-Melis and
Saveski, 2016).

Conversation Discourse. Our work is also in
the area of discourse analysis for conversa-
ciones, ranging from the prediction of the shallow
discourse roles on utterance level (Stolcke et al.,
2000; Ji et al., 2016; Zhao et al., 2018) to the
discourse parsing for a more complex conversation
estructura (Elsner and Charniak, 2008, 2010;
Afantenos et al., 2015). In this area, most existing
models heavily rely on the data annotated with
discourse labels for learning (Zhao et al., 2017).
in a fully
from them, our model,
Different
unsupervised way, identifies distributional word
clusters to represent latent discourse factors in
conversaciones. Although such latent discourse
variables have been studied in previous work
(Ritter et al., 2010; Joty et al., 2011; Ji et al., 2016;
Zhao et al., 2018), none of them explores the
effects of latent discourse on the identification of
conversation topic, which is a gap our work fills in.

3 Our Neural Model for Topics and

Discourse in Conversations

This section introduces our neural model that
jointly explores latent representations for topics

and discourse in conversations. We first present
an overview of our model in Section 3.1, seguido
by the model generative process and inference
procedure in Section 3.2 y 3.3, respectivamente.

3.1 Model Overview

En general, our model aims to learn coherent word
clusters that reflect the latent topics and discourse
roles embedded in the microblog conversations.
Para tal fin, we distinguish two latent components
in the given collection: topics and discourse, cada
represented by a certain type of word distribution
(distributional word cluster). Específicamente, en el
corpus level, we assume that there are K topics,
represented by φT
k (k = 1, 2, . . . , k), and D dis-
course roles, captured with φD
d (re = 1, 2, . . . , D).
φT and φD are all multinomial word distributions
over the vocabulary size V . Inspired by the neural
topic models in Miao et al. (2017), our model
encodes topic and discourse distributions (φT and
φD) as latent variables in a neural network and
learns the parameters via back propagation.

Before touching the details of our model, nosotros
first describe how we formulate the input. On
microblogs, as a message might have multiple
replies, messages in an entire conversation can
be organized as a tree with replying relations (li
et al., 2016b, 2018). Though the recent progress
in recursive models allows the representation
learning from the tree-structured data, previous
studies have pointed out that, en la práctica, secuencia
models serve as a more simple yet robust alterna-
tivo (Le et al., 2015). En este trabajo, we follow the
common practice in most conversation modeling
investigación (Ritter et al., 2010; Joty et al., 2011;
Zhao et al., 2018) to take a conversation as a
sequence of turns. Para tal fin, each conversation
tree is flattened into root-to-leaf paths. Each one of
such paths is hence considered as a conversation
instancia, and a message on the path corresponds
to a conversation turn (Zarisheva and Scheffler,
2015; Cerisara et al., 2018; Jiao et al., 2018).

The overall architecture of our model is shown
En figura 2. Formalmente, we formulate a conversation
c as a sequence of messages (x1, x2, . . . , xMc),
where Mc denotes the number of messages in
C. In the conversation, each message x, como el
target message, is fed into our model sequentially.
Here we process the target message x as the
bag-of-words (BoW) term vector xBoW ∈ RV ,
following the bag-of-words assumption in most

269

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1

/

/
t

yo

a
C
_
a
_
0
0
2
6
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

temas (Le et al., 2018; Zeng et al., 2018b). Estafa-
cretely, we define the latent topic variable z ∈ RK
at the conversation level and generate the topic
mixture of c, denoted as a K-dimensional distri-
bution θ, via a softmax construction conditioned
on z (Miao et al., 2017).

Latent Discourse. For modeling the discourse
structure of conversations, we capture the message-
level discourse roles reflecting the dialogue acts
of each message, as is done in Ritter et al. (2010).
Concretely, given the target message x, we use
a D-dimensional one-hot vector to represent the
latent discourse variable d, where the high bit
indicates the index of a discourse word distribu-
tion that can best express x’s discourse role. En
the generative process, the latent discourse d is
drawn from a multinomial distribution with pa-
rameters estimated from the input data.

Data Generative Process As mentioned pre-
viously, our entire framework is based on VAE,
which consists of an encoder and a decoder.
The encoder maps a given input into latent topic
and discourse representations and the decoder
reconstructs the original input from the latent
representaciones. En el siguiente, we first describe
the decoder followed by the encoder.

En general, our decoder is learned to reconstruct
the words in the target message x (in the BoW
forma) from the latent topic z and latent discourse
d. We show the generative story that reflects the
reconstruction process below:

• Draw the latent topic z ∼ N (µ, p2)

• c’s topic mixture θ = softmax((z))

• Draw the latent discourse d ∼ M ulti(Pi)

• For the n-th word in x

– βn = softmax(fφT (i) + fφD (d))
– Draw the word wn ∼ M ulti(βn)

where f∗(·) is a neural perceptron, with a linear
transformation of inputs activated by a non-linear
transformación. Here we use rectified linear units
(Nair and Hinton, 2010) as the activate func-
ciones. En particular, the weight matrix of fφT (·)
(after the softmax normalization) is considered
as the topic-word distributions φT . The discourse-
word distributions φD are similarly obtained from
fφD (·).

Cifra 2: The architecture of our neural framework.
The latent topics z and latent discourse d are jointly
modeled from conversation c and target message x,
respectivamente. Mutual information penalty (MI) se utiliza
to separate words clusters representing topics and
discourse. Después, z and d are used to reconstruct
the target message x(cid:48).

topic models (Blei et al., 2003; Miao et al., 2017).
The conversation, C, where the target message x
is involved, is considered as the context of x.
It is also encoded in the BoW form (denoted as
cBoW ∈ RV ) and fed into our model. In doing
entonces, we ensure that
the context of the target
message is incorporated while learning its latent
representaciones.

Following the previous practice in neural topic
modelos (Miao et al., 2017; Srivastava and Sutton,
2017), we utilize the variational auto-encoder
(VAE) (Kingma and Welling, 2013) to resemble
the data generative process via two steps. Primero,
given the target message x and its conversation c,
our model converts them into two latent variables:
topic variable z and discourse variable d. Entonces,
using the intermediate representations captured by
z and d, we reconstruct the target message, X(cid:48).

3.2 Generative Process

En esta sección, we first describe the two latent
variables in our model: the topic variable z and the
discourse variable d. Entonces, we present our data
generative process from the latent variables.

Latent Topics. For latent topic learning, nosotros
examine the main discussion points in the context
of a conversation. Our assumption is that messages
in the same conversation tend to focus on similar

270

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1

/

/
t

yo

a
C
_
a
_
0
0
2
6
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

For the encoder, we learn the parameters µ, pag,
and π from the input xBoW and cBoW (the BoW
form of the target message and its conversation),
following the following formula:

µ = fµ(fe(cBoW )), log σ = fσ(fe(cBoW ))

π = softmax((xBoW ))

(1)

3.3 Model Inference

For the objective function of our entire framework,
we take three aspects into account: the learning of
latent topics and discourse, the reconstruction of
the target messages, and the separation of topic-
associated words and discourse-related words.

Learning Latent Topics and Discourse. Para
learning the latent topics/discourse in our model,
we utilize the variational inference (Blei et al.,
2016) to approximate posterior distribution over
the latent topic z and the latent discourse d given
all the training data. Para tal fin, we maximize the
variational lower bound Lz for z and Ld for d,
each defined as following:

Lz = Eq(z | C)[pag(C | z)] − DKL(q(z | C) || pag(z))
Ld = Eq(d | X)[pag(X | d)] − DKL(q(d | X) || pag(d))

(2)

q(z | C) and q(d | X) are approximated posterior
probabilities describing how the latent topic z
and the latent discourse d are generated from
los datos. pag(C | z) y P(X | d) represent the corpus
likelihoods conditioned on the latent variables.
Aquí, to facilitate coherent topic production, en
pag(C | z), we penalize the likelihood of stopwords
to be generated from latent topics following Li
et al. (2018). pag(z) follows the standard normal
prior N (0, I) y P(d) is the uniform distribu-
tion U nif (0, 1). DKL refers to the Kullback-
Leibler divergence that ensures the approximated
posteriors to be close to the true ones. For more
derivation details, we refer readers to Miao et al.
(2017).

Reconstructing target messages. From the la-
tent variables z and d, the goal of our model is
to reconstruct the target message x. The corre-
sponding learning objective is to maximize Lx
defined as:

Lx = Eq(z | X)q(d | C)[iniciar sesión p(X | z, d)]

(3)

Here we design Lx to ensure that the learned
latent topics and discourse can reconstruct x.

Distinguishing Topics and Discourse. Nuestro
model aims to distinguish word distributions for
temas (φT ) and discourse (φD), which enables topics
and discourse to capture different information
in conversations. Concretely, we use the mutual
información, given below, to measure the mutual
dependency between the latent topics z and the
latent discourse d.3

ecuación(z)q(d)

(cid:20)

registro

(cid:21)

pag(z, d)
pag(z)pag(d)

(4)

Ecuación 4 can be further derived as the Kullback-
Leibler divergence of the conditional distribution,
pag(d | z), and marginal distribution, pag(d). The de-
rived formula, defined as the mutual information
(MI) loss (LM I ) and shown in Equation 5, se utiliza
to map z and d into the separated semantic space.

LM I = Eq(z)[DKL(pag(d | z)||pag(d))]

(5)

We can hence minimize LM I for guiding our
model to separate word distributions that represent
topics and discourse.

The Final Objective. To capture the joint ef-
fects of the learning objectives described above
(Lz, Ld, Lx, and LM I ), we design the final ob-
jective function for our entire framework as the
following:

L = Lz + Ld + Lx − λLM I

(6)

where the hyperparameter λ is the trade-off
parameter for balancing between the MI loss
(LM I ) and the other learning objectives. Por
maximizing the final objective L via back pro-
pagation, the word distributions of topics and
discourse can be jointly learned from microblog
conversations.4

4 Experimental Setup

Recopilación de datos. For our experiments, we col-
lected two microblog conversation data sets from
Twitter. One is released by the TREC 2011
microblog track (henceforth TREC), containing
conversations concerning a wide range of topics.5

3The distributions in Equation 4 are all conditional
probability distributions given the target message x and its
conversation c. We omit the conditions for simplicity.

4To smooth the gradients in implementation, for z ∼
norte (µ, pag), we apply the reparameterization on z (Kingma and
Welling, 2013; Rezende et al., 2014), and for d ∼ M ulti(Pi),
we adopt the Gumbel-Softmax trick (Maddison et al., 2016;
Jang et al., 2016).

5http://trec.nist.gov/data/tweets/.

271

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1

/

/
t

yo

a
C
_
a
_
0
0
2
6
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Data sets

TREC
TWT16

# de
convs
116,612
29,502

Avg msgs Avg words
per conv
3.95
8.67

per msg
11.38
14.70

|Vocab|

9,463
7,544

Mesa 1: Statistics of the two data sets containing
Twitter conversations.

The other is crawled from January to June 2016
with Twitter streaming API6 (henceforth TWT16,
short for Twitter 2016), following the way of
building the TREC data set. During this period,
there are a large volume of discussions centered
around the U.S. presidential election. Además,
for both data sets, we apply Twitter search API7
to retrieve the missing tweets in the conversation
historia, as the Twitter streaming API (used to
collect both data sets) only returns sampled tweets
from the entire pool.

The statistics of the two experiment data sets
se muestran en la tabla 1. For model training and
evaluación, we randomly sampled 80%, 10%, y
10% of the data to form the training, desarrollo,
and test set, respectivamente.

Data Preprocessing. We preprocessed the data
with the following steps. Primero, non-English
tweets were filtered out. Entonces, hashtags, hombres-
ciones (@username), and links were replaced with
generic tags ‘‘HASH’’, ‘‘MENT’’, and ‘‘URL’’,
respectivamente. Próximo, the natural languge toolkit was
applied for tweet tokenization.8 After that, todo
letters were normalized to lower cases. Finalmente,
words that occurred fewer than 20 times were
filtered out from the data.

Parameter Setting. To ensure comparable
results with Li et al. (2018) (the prior work
focusing on the same task as ours), in the topic
coherence evaluation, we follow their setup to
report the results under two sets of K (the number
of topics): K = 50 and K = 100, and with the
number of discourse roles (D) set to 10. El
analysis for the effects of K and D will be further
presented in Section 5.5. For all the other hyper-
parámetros, we tuned them on development set by
grid search. The trade-off parameter λ (defined

6https://developer.twitter.com/en/docs/
tweets/filter-realtime/api-reference/post-
statuses-filter.html.

7https://developer.twitter.com/en/docs/
tweets/search/api-reference/get-savedsearches-
show-id.

8https://www.nltk.org/.

en la ecuación 6), balancing the MI loss and the
other objective functions, is set to 0.01. In model
training, we use the Adam optimizer (Kingma and
Ba, 2014) and run 100 epochs with early stop
strategy adopted.

Líneas de base.
In topic modeling experiments, nosotros
consider the five topic model baselines treating
each tweet as a document: LDA (Blei et al., 2003),
BTM (Yan et al., 2013), LF-LDA, LF-DMM
(Nguyen et al., 2015), and NTM (Miao et al.,
2017). En particular, BTM and LF-DMM are the
state-of-the-art topic models for short texts. BTM
explores the topics of all word pairs (biterms) en
each message to alleviate data sparsity in short
textos. LF-DMM incorporates word embeddings
pre-trained on external data to expand semantic
meanings of words, so does LF-LDA. In Nguyen
et al. (2015), LF-DMM, based on one-topic-per-
document Dirichlet Multinomial Mixture (DMM)
(Nigam et al., 2000), was reported to perform
better than LF-LDA, based on LDA. For LF-LDA
and LF-DMM, we use GloVe Twitter embeddings
(Pennington et al., 2014) as the pre-trained word
embeddings.9

For the discourse modeling experiments, nosotros
compare our results with LAED (Zhao et al.,
2018), a VAE-based representation learning model
for conversation discourse. Además, for both
topic and discourse evaluation, we compare with
Li et al. (2018), a recently proposed model for
microblog conversations, where topics and dis-
course are jointly explored with a non-neural
estructura. Besides the existing models from pre-
vious studies, we also compare with the variants
of our model that only models topics (hence-
forth TOPIC ONLY) or discourse (henceforth DISC
ONLY).10 Our joint model of topics and discourse
is referred to as TOPIC+DISC.

In the preprocessing procedure for the baselines,
we removed stop words and punctuation for topic
models unable to learn discourse representations
following the common practice in previous work
(Yan et al., 2013; Miao et al., 2017). For the other
modelos, stop words and punctuation were retained
in the vocabulary, considering their usefulness
as discourse indicators (Le et al., 2018).

9https://nlp.stanford.edu/projects/glove/.
10In our ablation without mutual information loss (LM I
defined in Equation 4), topics and discourse are learned inde-
pendently. De este modo, its topic representation can be used for the
output of TOPIC ONLY, so does its discourse one for DISC ONLY.

272

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1

/

/
t

yo

a
C
_
a
_
0
0
2
6
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelos

Líneas de base
LDA
BTM
LF-DMM
LF-LDA
NTM
Li et al. (2018)
Our models
TOPIC ONLY
TOPIC+DISC

K = 50

K = 100

TREC TWT16 TREC TWT16

0.467
0.460
0.456
0.470
0.478
0.463

0.454
0.461
0.448
0.456
0.479
0.433

0.467
0.466
0.463
0.467
0.482
0.464

0.454
0.463
0.466
0.453
0.443
0.435

0.478
0.485

0.482
0.487

0.481
0.496

0.471
0.480

Mesa 2: Cv coherence scores for latent topics
produced by different models. The best result in
each column is highlighted in bold. Our joint
model TOPIC+DISC achieves significantly better
coherence scores than all the baselines (pag < 0.01, paired test). 5 Experimental Results In this section, we first report the topic coher- ence results in Section 5.1, followed by a dis- cussion in Section 5.2 comparing the latent discourse roles discovered by our model with the manually annotated dialogue acts. Then, we study whether we can capture useful representations for microblog messages in a tweet classification task (in Section 5.3). A qualitative analysis, showing some example topics and discourse roles, is further provided in Section 5.4. Finally, in Section 5.5, we provide more discussions on our model. 5.1 Topic Coherence For the topic coherence, we adopt the Cv scores measured via the open-source Palmetto toolkit as our evaluation metric.11 Cv scores assume that the top N words in a coherent topics (ranked by likelihood) tend to co-occur in the same document and have shown comparable evaluation results to human judgments (R¨oder et al., 2015). Table 2 shows the average Cv scores over the produced topics given N = 5 and N = 10. The values range from 0.0 to 1.0, and higher scores indicate better topic coherence. We can observe that: • Models assuming a single topic for each mes- sage do not work well. It has long been pointed out that the one-topic-per-message assumption (each message contains only one topic) helps 11https://github.com/dice-group/Palmetto. topic models alleviate the data sparsity issue in short texts on microblogs (Zhao et al., 2011; Quan et al., 2015; Nguyen et al., 2015; Li et al., 2018). However, we observe contradictory results because both LF-DMM and Li et al. (2018), following this assumption, achieve generally worse performance than the other models. This might be attributed to the large-scale data used in our experiments (each data set has over 250K mes- sages as shown in Table 1), which potentially provide richer word co-occurrence patterns and thus partially alleviate the data sparsity issue. they result • Pre-trained word embeddings do not bring ben- efits. Comparing LF-LDA with LDA, we found that in similar coherence scores. This shows that with sufficiently large training data, with or without using the pre-trained word embeddings do not make any difference in the topic coherence results. • Neural models perform better than non-neural baselines. When comparing the results of neural models (NTM and our models) with the other baselines, we find the former yield topics with better coherence scores in most cases. • Modeling topics in conversations is effective. Among neural models, we found our models outperform NTM (without exploiting conversa- tion contexts). This shows that the conversations provide useful context and enables more coherent topics to be extracted from the entire conversation thread instead of a single short message. • Modeling topics together with discourse helps produce more coherent topics. We can observe better results with the joint model TOPIC+DISC in comparison with the variant considering topics only. This shows that TOPIC+DISC, via the joint modeling of topic- and discourse-word distribu- tions (reflecting non-topic information), can bet- ter separate topical words from non-topical ones, hence resulting in more coherent topics. 5.2 Discourse Interpretability In this section, we evaluate whether our model can discover meaningful discourse representa- tions. To this end, we train the comparison models for discourse modeling on the TREC data set and test the learned latent discourse on a benchmark data set released by Cerisara et al. (2018). The benchmark data set consists of 2,217 microblog messages forming 505 conversations collected 273 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 6 7 1 9 2 3 3 0 1 / / t l a c _ a _ 0 0 2 6 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Models Baselines LAED Li et al. (2018) Our models DISC ONLY TOPIC+DISC Purity Homogeneity VI 0.505 0.511 0.510 0.521 0.022 0.096 0.112 0.142 6.418 5.540 5.532 5.097 Table 3: The purity, homogeneity, and variation of information (VI) scores for the latent discourse roles measured against the human-annotated dia- logue acts. For purity and homogeneity, higher scores indicate better performance, while for VI scores, lower is better. In each column, the best results are in boldface. Our joint model TOPIC+DISC significantly outperforms all the baselines (p < 0.01, paired t-test). from Mastodon,12 a microblog platform exhibit- ing Twitter-like user behavior (Cerisara et al., 2018). For each message, there is a human- assigned discourse label, selected from one of the 15 dialogue acts, such as question, answer, disagreement, and so forth. For discourse evaluation, we measure whether the model-produced discourse assignments are consistent with the human-annotated dialogue acts. Hence, following Zhao et al. (2018), we assume that an interpretable latent discourse role should cluster messages labeled with the same dialogue act. Therefore, we adopt purity (Manning et al., 2008), homogeneity (Rosenberg and Hirschberg, 2007), and variation of informa tion (VI) (Meila, 2003; Goldwater and Griffiths, 2007) as our automatic evaluation metrics. Here, we set D = 15 to ensure the number of latent discourse roles to be the same as the number of manually labeled dialogue acts. Table 3 shows the comparison results of the average scores over the 15 latent discourse roles. Higher values indicate better performance for purity and homogeneity, while for VI, lower is better. It can be observed that our models exhibit generally better performance, showing the effec- tiveness of our framework in inducing inter- pretable discourse roles. Particularly, we observe the best results achieved by our joint model TOPIC+DISC, which is learned to distinguish topic- and discourse-words, important in recognizing indicative words to reflect latent discourse. 12https://mastodon.social. Figure 3: A heatmap showing the alignments of the latent discourse roles and human-annotated dialogue act labels. Each line visualizes the distribution of messages with the corresponding dialogue act label over varying discourse roles (indexed from 1 to 15), where darker colors indicate higher values. To further analyze the consistency of varying latent discourse roles (produced by our TOPIC+DISC model) with the human-labeled dialogue acts, Figure 3 displays a heatmap, where each line visualizes how the messages with a dialogue act distribute over varying discourse roles. It is seen that among all dialogue acts, our model discov- ers more interpretable latent discourse for ‘‘greet- ings’’, ‘‘thanking’’, ‘‘exclamation’’, and ‘‘offer’’, where most messages are clustered into one or two dominant discourse roles. It may be because these dialogue acts can be relatively easier to detect based on their associated indicative words, such as the word ‘‘thanks’’ for ‘‘thanking’’, and the word ‘‘wow’’ for ‘‘exclamation’’. 5.3 Message Representations To further evaluate our ability to capture effec- tive representations for microblog messages, we take tweet classification as an example and test the classification performance with the topic and discourse representations as features. Here, the user-generated hashtags capturing the topics of online messages are used as the proxy class labels (Li et al., 2016b; Zeng et al., 2018a). We construct 274 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 6 7 1 9 2 3 3 0 1 / / t l a c _ a _ 0 0 2 6 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Models TREC TWT16 Acc Avg F1 Acc Avg F1 Baselines 0.120 BoW 0.116 TF-IDF 0.128 LDA BTM 0.123 LF-DMM 0.158 0.138 NTM 0.259 Our model 0.026 0.024 0.041 0.035 0.072 0.042 0.180 0.132 0.153 0.146 0.167 0.162 0.186 0.341 0.030 0.041 0.046 0.054 0.052 0.068 0.269 Table 4: Evaluation of tweet classification re- sults in accuracy (Acc) and average F1 (Avg F1). Representations learned by various models serve as the classification features. For our model, both the topic and discourse representations are fed into the classifier. the classification data set from TREC and TWT16 with the following steps. First, we removed the tweets without hashtags. Second, we ranked hash- tags by their frequencies. Third, we manually removed the hashtags that are not topic-related (e.g. ‘‘#fb’’ for indicating the source of tweets from Facebook), and combined the hashtags refer- ring to the same topic (e.g., ‘‘#DonaldTrump’’ and ‘‘#Trump’’). Finally, we selected the top 50 frequent hashtags, and all tweets containing these hashtags as our classification data set. Here, we simply use the support vector machines as the classifier, since our focus is to compare the rep- resentations learned by various models. Li et al. (2018) are unable to produce vector representation on tweet level, hence not considered here. Table 4 shows the classification results of accu- racy and average F1 on the two data sets with the representations learned by various models serving as the classification features. We observe that our model outperforms other models with a large margin. The possible reasons are twofold. First, our model derives topics from conversa- tion threads and thus potentially yields better message representations. Second, the discourse representations (only produced by our model) are indicative features for hashtags, because users will exhibit various discourse behaviors in dis- cussing diverse topics (hashtags). For instance, we observe prominent ‘‘argument’’ discourse from tweets with ‘‘#Trump’’ and ‘‘#Hillary’’, attributed to the controversial opinions to the two candidates in the 2016 U.S. presidential election. 5.4 Example Topics and Discourse Roles We have shown that joint modeling of topics and discourse presents superior performance on a quantitative measure. In this section, we qualita- tively analyze the interpretability of our outputs via analyzing the word distributions of some example topics and discourse roles. Example Topics. Table 5 lists the top 10 words of some example latent topics discovered by var- ious models from the TWT16 data set. According to the words shown, we can interpret the extracted topics as ‘‘gun control’’ — discussion about gun law and the failure of gun control in Chicago. We observe that LDA wrongly includes off-topic word ‘‘flag’’. From the outputs of BTM, LF-DMM, Li et al. (2018), and our TOPIC ONLY variant, though we do not find off-topic words, there are some non- topic words, such as ‘‘said’’ and ‘‘understand’’.13 The output of our TOPIC+DISC model appears to be the most coherent, with words such as ‘‘firearm’’ and ‘‘criminals’’ included, which are clearly rel- evant to ‘‘gun control’’. Such results indicate the benefit of examining the conversation contexts and jointly exploring topics and discourse in them. Example Discourse Roles. To qualitatively analyze whether our TOPIC+DISC model can dis- cover interpretable discourse roles, we select the top 10 words from the distributions of some exam- ple discourse roles and list them in Table 6. It can be observed that there are some meaningful word clusters reflecting varying discourse roles found without any supervision. Interestingly, we observe that the latent discourse roles from TREC and TWT16, though learned separately, exhibit some notable overlap in their associated top 10 words, particularly for ‘‘question’’ and ‘‘statement’’. We also note that ‘‘argument’’ is represented by very different words. The reason is that TWT16 con- tains a large volume of arguments centered around candidates Clinton and Trump, resulting in the fre- quent appearance of words like ‘‘he’’ and ‘‘she’’. 5.5 Further Discussions In this section, we further present more discussions on our joint model: TOPIC+DISC. Parameter Analysis. Here, we study the two the important hyper-parameters in our model, 13Non-topic words do not clearly indicate the correspond- ing topic, whereas off-topic words are more likely to appear in other topics. 275 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 6 7 1 9 2 3 3 0 1 / / t l a c _ a _ 0 0 2 6 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 LDA BTM LF-DMM Li et al. (2018) NTM TOPIC ONLY TOPIC+DISC think law agree black people police wrong right (cid:58)(cid:58)(cid:58)(cid:58) people trump police violence gun death protest guns flag shot (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) gun guns (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) gun police black (cid:58)(cid:58)(cid:58)said (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58) wrong don trump gun (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58) gun (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58) shootings gun guns cops charges control (cid:58)(cid:58)(cid:58)(cid:58)mass commit (cid:58)(cid:58)(cid:58)(cid:58) guns gun shootings chicago shooting cops firearm criminals commit laws people guns killing ppl amendment laws doesn (cid:58)(cid:58)(cid:58)(cid:58)make understand laws agree guns (cid:58)(cid:58)(cid:58)(cid:58)(cid:58) understand (cid:58)(cid:58)yes guns world dead (cid:58)(cid:58)(cid:58)real discrimination trump silence know agreed Table 5: Top 10 representative words of example latent topics discovered from the TWT16 data set. We Non-topic (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)words are wave-underlined interpret the topics as ‘‘gun control’’ by the displayed words. (cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58)(cid:58) and in blue, and off-topic words are underlined and in red. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 6 7 1 9 2 3 3 0 1 / / t l a c _ a _ 0 0 2 6 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Table 6: Top 10 representative words of example discourse roles learned from TREC and TWT16. The discourse roles of the word clusters are manually assigned according to their associated words. number of topics (K) and the number of discourse roles (D). In Figure 4, we show the Cv topic coherence given varying K in (a) and the homo- geneity measure given varying D in (b). As can be seen, the curves corresponding to the performance on topics and discourse are not monotonic. In par- ticular, better topic coherence scores are achieved given relatively larger topic numbers for TREC with the best result observed at K = 80. On the contrary, the optimum topic number for TWT16 is K = 20, and increasing the number of topics results in worse Cv scores in general. This may be attributed to the relatively centralized topic con- cerning U.S. election in the TWT16 corpus. For discourse homogeneity, the best result is achieved given D = 15, with same the number of manually annotated dialogue acts in the benchmark. Case Study. To further understand why our model learns meaningful representations for topics and discourse, we present a case study based on the example conversation shown in Figure 1. Specifically, we visualize the topic words (with p(w | z) > pag(w | d)) in red and the rest of the
words in blue to indicate discourse. Darker red
indicates the higher topic likelihood (pag(w | z)) y
darker blue shows the higher discourse likelihood
(pag(w | d)). The results are shown in Figure 5. Nosotros
can observe that topic and discourse words are
well separated by our model, which explains why
it can generate high-quality representations for
both topics and discourse.

Model Extensibility. Recall that in the Intro-
ducción, we mentioned that our neural-based
model has an advantage to be easily combined
with other neural network architectures and allows
for the joint training of both models. Aquí, nosotros
take message classification (with the setup in
Sección 5.3) as an example, and study whether

276

Cifra 4: (a) The impact of topic numbers. El
horizontal axis shows the number of topics; the vertical
axis shows the Cv topic coherence. (b) The impact of
discourse numbers. The horizontal axis represents the
number of discourse; the vertical axis represents the
homogeneity measure.

joint training our model with convolutional neural
network (CNN) (kim, 2014), the widely used
model on short text classification, can bring ben-
efits to the classification performance. We set the
embedding dimension to 200, with random initial-
ización. Los resultados se muestran en la tabla. 7, dónde
we observe that joint training our model and the
classifier can successfully boost the classification
actuación.

Análisis de errores. We further analyze the errors
in our outputs. For topics, taking a closer look
at their word distributions, we found that our
model sometimes mixes sentiment words with
topic words. Por ejemplo, among the top 10 palabras
of a topic ‘‘win people illegal americans hate lt
racism social tax wrong’’, there are words ‘‘hate’’
and ‘‘wrong’’, expressing sentiment rather than
conveying topic-related information. This is due
to the prominent co-occurrences of topic words
and sentiment words in our data, which results in
the similar distributions for topics and sentiment.
Future work could focus on the further separation
of sentiment and topic words.

For discourse, we found that our model can
induce some discourse roles beyond the 15 hombre-
ually defined dialogue acts in the Mastodon data
colocar (Cerisara et al., 2018). Por ejemplo, as shown
en mesa 6, our model discovers the ‘‘quotation’’
discourse from both TREC and TWT16, cual
es, sin embargo, not defined in the Mastodon data set.
This perhaps should not be considered as an error.
We argue that it is not sensible to pre-define a
fixed set of dialogue acts for diverse microblog
conversations due to the rapid change and a
wide variety of user behaviors in social media.
Por lo tanto, future work should involve a better
alternative to evaluate the latent discourse without

Cifra 5: Visualization of the topic-discourse assign-
ment of a twitter conversion from TWT16. The anno-
tated blue words are prone to be discourse words,
and the red are topic words. The shade indicates the
confidence of the current assignment.

relying on manually defined dialogue acts. Nosotros
also notice that our model sometimes fails to iden-
tify discourse behaviors requiring more in-depth
semantic understanding, such as sarcasm, irony,
and humor. This is because our model detects
latent discourse purely based on the observed
palabras, whereas the detection of sarcasm, irony, o
humor requires deeper language understanding,
which is beyond the capacity of our model.

6 Conclusion and Future Work

We have presented a neural framework that jointly
explores topic and discourse from microblog
conversaciones. Our model,
in an unsupervised
manner, examines the conversation contexts and
discovers word distributions that reflect latent
topics and discourse roles. Results from extensive
experiments show that our model can generate
coherent topics and meaningful discourse roles.
Además, our model can be easily combined
with other neural network architectures (semejante
as CNN) and allows for joint training, cual
has presented better message classification results
compared with the pipeline approach without joint
training.

Our model captures topic and discourse repre-
sentations embedded in conversations. Ellos son
potentially useful for a broad range of down-
stream applications, worthy to be explored in
investigación futura. Por ejemplo, our model is useful
for developing social chatbots (Zhou y cols., 2018).
By explicitly modeling ‘‘what you say’’ and ‘‘how
you say’’, our model can be adapted to track the

277

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1

/

/
t

yo

a
C
_
a
_
0
0
2
6
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelos

CNN only
Separate-Train
Joint-Train

TREC

TWT16

Acc Avg F1 Acc Avg F1
0.311
0.199
0.390
0.284
0.413
0.297

0.334
0.391
0.428

0.167
0.270
0.286

Mesa 7: Accuracy (Acc) and average F1 (Avg F1)
on tweet classification (hashtags as labels). CNN
solo: CNN without using our representations.
Seperate-Train: CNN fed with our pre-trained
representaciones. Joint-Train: Joint training CNN
and our model.

change of topics in conversation context, helpful
to determine ‘‘what to say and how to say’’ in
the next turn. También, it would be interesting to
study how our learned latent topics and discourse
affect recommendation (Zeng et al., 2018b) y
summarization of microblog conversations (li
et al., 2018).

Expresiones de gratitud

This work is partially supported by the Research
Grants Council of the Hong Kong Special Admin-
istrative Region, Porcelana (No. CUHK 14208815 y
No. CUHK 14210717 of the General Research Fund),
Innovate UK (grant no. 103652), and Microsoft
Research Asia (2018 Microsoft Research Asia Col-
laborative Research Award). We thank Shuming
Shi, Dong Yu, and TACL reviewers and editors
for the insightful suggestions on various aspects
of this work.

Referencias

Stergos D. Afantenos, Eric Kow, Nicholas Asher,
and J´er´emy Perret. 2015. Discourse parsing for
multi-party chat dialogues. En procedimientos de
el 2015 Conference on Empirical Methods in
Natural Language Processing, EMNLP 2015,
pages 928–937. Lisbon.

David Alvarez-Melis and Martin Saveski. 2016.
Topic modeling in twitter: Aggregating tweets
by conversations. In Proceedings of the Tenth
International Conference on Web and Social
Media, pages 519–522. Cologne.

Eytan Bakshy, Solomon Messing, and Lada A.
Adamic. 2015. Exposure to ideologically di-
verse news and opinion on facebook. Ciencia,
348(6239):1130–1132.

278

David M.. Blei, Thomas L. Griffiths, Michael I.
Jordán, and Joshua B. Tenenbaum. 2003. Hier-
archical topic models and the nested chinese
restaurant process. In Advances in Neural In-
formation Processing Systems, NIPS 2003,
pages 17–24. Vancouver and Whistler.

David M.. Blei, Alp Kucukelbir, and Jon D.
McAuliffe. 2016. Variational inference: A re-
view for statisticians. CORR, abs/1601.00670.

David M.. Blei, Andrew Y. Ng, and Michael I.
Jordán. 2001. Latent Dirichlet allocation. En
Avances en el procesamiento de información neuronal
Sistemas 14, pages 601–608. vancouver.

Christophe Cerisara, Somayeh Jafaritazehjani,
Adedayo Oluokun, and Hoa T. Le. 2018. Multi-
task dialog act and sentiment recognition on
mastodon. In Proceedings of the 27th Interna-
tional Conference on Computational Linguis-
tics, COLECCIONAR 2018, pages 745–754. Santa Fe,
NM.

Micha Elsner and Eugene Charniak. 2008. You
talking to me? A corpus and algorithm for
En curso-
conversation disentanglement.
ings of the 46th Annual Meeting of the Asso-
ciation for Computational Linguistics, LCA 2008,
pages 834–842. Columbus, OH.

Micha Elsner and Eugene Charniak. 2010. Dis-
entangling chat. Ligüística computacional, 36(3):
389–409.

Sharon Goldwater and Thomas L. Griffiths. 2007.
A fully Bayesian approach to unsupervised
part-of-speech tagging. En Actas de la
45ª Reunión Anual de la Asociación de
Ligüística computacional. Prague.

Thomas Hofmann. 1999. Probabilistic latent se-
mantic indexing. In Proceedings of the 22nd
Annual International ACM SIGIR Conference
on Research and Development in Information
Retrieval, pages 50–57. berkeley, California.

Liangjie Hong and Brian D. Davison. 2010. Em-
pirical study of topic modeling in Twitter. En
Proceedings of the 3rd Workshop on Social
Network Mining and Analysis, SNAKDD 2009,
pages 80–88. París.

Zhiting Hu, Gang Luo, Mrinmaya Sachan, Eric P.
Xing, and Zaiqing Nie. 2016. Grounding topic

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1

/

/
t

yo

a
C
_
a
_
0
0
2
6
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

models with knowledge bases. En procedimientos
of the Twenty-Fifth International Joint Con-
ference on Artificial Intelligence, IJCAI 2016,
pages 1578–1584. Nueva York, Nueva York.

Eric Jang, Shixiang Gu, and Ben Poole. 2016.
Categorical reparameterization with gumbel-
softmax. CORR, abs/1611.01144.

Yangfeng Ji, Gholamreza Haffari, and Jacob
Eisenstein. 2016. A latent variable recurrent
neural network for discourse-driven language
modelos. En Actas de la 2016 Conferencia
of the North American Chapter of the Asso-
ciation for Computational Linguistics: Humano
Language Technologies, NAACL HLT 2016,
pages 332–342. San Diego, California.

Yunhao Jiao, Cheng Li, Fei Wu, and Qiaozhu Mei.
2018. Find the conversation killers: A predictive
study of thread-ending posts. En procedimientos
del 2018 World Wide Web Conference on
World Wide Web, WWW 2018, pages 1145–1154.
Lyon.

Shafiq R. Joty, Giuseppe Carenini, and Chin-Yew
lin. 2011. Unsupervised modeling of dialog acts
in asynchronous conversations. En procedimientos
of the 22nd International Joint Conference on Arti-
ficial Intelligence, IJCAI 2011, pages 1807–1813.
Barcelona.

Yoon Kim. 2014. Convolutional neural networks
for sentence classification. En procedimientos de
el 2014 Conference on Empirical Methods in
Natural Language Processing, EMNLP 2014,
A meeting of SIGDAT, a Special Interest Group
of the ACL, pages 1746–1751. Doha.

Diederik P. Kingma and Jimmy Ba. 2014. Adán:
A method for stochastic optimization. CORR,
abs/1412.6980.

Diederik P. Kingma and Max Welling. 2013. Auto-
encoding variational Bayes. CORR, abs/1312.
6114.

Chenliang Li, Haoran Wang, Zhiqian Zhang,
Aixin Sun, and Zongyang Ma. 2016a. Tema
modeling for short texts with auxiliary word
embeddings. In Proceedings of the 39th Inter-
national ACM SIGIR conference on Research
and Development
in Information Retrieval,
SIGIR 2016, pages 165–174. Pisa.

Jing Li, Ming Liao, Wei Gao, Yulan He, y
Kam-Fai Wong. 2016b. Topic extraction from
microblog posts using conversation structures.
In Proceedings of the 54th Annual Meeting of
la Asociación de Lingüística Computacional,
LCA 2016, Volumen 1: Artículos largos. Berlina.

Jing Li, Yan Song, Zhongyu Wei, and Kam-Fai
Wong. 2018. A joint model of conversational
topics on microblogs.
discourse and latent
Ligüística computacional, 44(4):719–754.

Jiwei Li, Thang Luong, Dan Jurafsky, y
Eduard H. Azul. 2015. When are tree structures
necessary for deep learning of representations?
En Actas de la 2015 Conference on Em-
pirical Methods in Natural Language Process-
En g, EMNLP 2015, pages 2304–2314. Lisbon.

Chris J. Maddison, Andriy Mnih, and Yee Whye
Teh. 2016. The concrete distribution: A contin-
uous relaxation of discrete random variables.
CORR, abs/1611.00712.

Cristóbal D.. Manning, Prabhakar Raghavan,
and Hinrich Sch¨utze. 2008. Introduction to
Information Retrieval. Cambridge University
Prensa.

Rishabh Mehrotra, Scott Sanner, Wray L. Buntine,
and Lexing Xie. 2013. Improving LDA topic
models for microblogs via tweet pooling and
automatic labeling. In The 36th International
ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR
’13, pages 889–892. Dublín.

Marina Meila. 2003. Comparing clusterings by the
variation of information. In Computational Learn-
ing Theory and Kernel Machines, 16th Annual
Conference on Computational Learning Theory
and 7th Kernel Workshop, COLT/Kernel,
Actas, pages 173–187. Washington, corriente continua.

Yishu Miao, Edward Grefenstette, and Phil
Blunsom. 2017. Discovering discrete latent
topics with neural variational
inferencia. En
the 34th International Con-
Actas de
ference on Machine Learning, ICML 2017,
pages 2410–2419. Sídney.

Vinod Nair and Geoffrey E. Hinton. 2010. Rectified
linear units improve restricted Boltzmann ma-
chines. In Proceedings of the 27th International

279

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1

/

/
t

yo

a
C
_
a
_
0
0
2
6
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Conference on Machine Learning (ICML 2010),
pages 807–814. Haifa.

Dat Quoc Nguyen, Richard Billingsley, Lan Du,
and Mark Johnson. 2015. Improving topic
feature word represen-
models with latent
taciones. Transactions of
la Asociación para
Ligüística computacional, TACL, 3:299–313.

Kamal Nigam, Andrew McCallum, Sebastian Thrun,
and Tom M. mitchell. 2000. Text classification
from labeled and unlabeled documents using
EM. Machine Learning, 39(2/3):103–134.

Jeffrey Pennington, Richard Socher, and Christopher
D. Manning. 2014. GloVe: Global vectors for
word representation. En procedimientos de
el
2014 Conference on Empirical Methods in
Natural Language Processing, EMNLP 2014,
pages 1532–1543. Doha.

Kechen Qin, Lu Wang, and Joseph Kim. 2017.
Joint modeling of content and discourse rela-
tions in dialogues. In Proceedings of the 55th
Annual Meeting of the Association for Com-
Lingüística putacional, LCA 2017, Volumen 1:
Artículos largos, pages 974–984. vancouver.

Xiaojun Quan, Chunyu Kit, Yong Ge, and Sinno
Jialin Pan. 2015. Short and sparse text topic
modeling via self-aggregation. En procedimientos
of the Twenty-Fourth International Joint Con-
ference on Artificial Intelligence, IJCAI 2015,
pages 2270–2276. Buenos Aires.

Daniel Ramage, Susan T. Dumais, and Daniel J.
Liebling. 2010. Characterizing microblogs with
topic models. In Proceedings of the Fourth
International Conference on Weblogs and So-
cial Media, ICWSM 2010, Washington, corriente continua.

Danilo Jimenez Rezende, Shakir Mohamed,
and Daan Wierstra. 2014. Stochastic back-
propagation and approximate inference in deep
generative models. In Proceedings of the 31th
Conferencia internacional sobre aprendizaje automático-
En g, ICML 2014, pages 1278–1286. Beijing.

Alan Ritter, Colin Cherry, and Bill Dolan. 2010.
Unsupervised modeling of twitter conversa-
ciones. In Human Language Technologies: Estafa-
ference of the North American Chapter of the
Association of Computational Linguistics, Pro-
ceedings, pages 172–180. Los Angeles, California.

Michael R¨oder, Andreas Both, y alejandro
Hinneburg. 2015. Exploring the Space of
Topic Coherence Measures. En procedimientos de
the Eighth ACM International Conference on
Web Search and Data Mining, WSDM 2015,
pages 399–408. Shanghai.

Michal Rosen-Zvi, Thomas L. Griffiths, Marca
Steyvers, and Padhraic Smyth. 2004. El
author-topic model for authors and documents.
In UAI ’04, Proceedings of the 20th Confer-
ence Uncertainty in Artificial
Inteligencia,
pages 487–494. Banff.

Andrew Rosenberg and Julia Hirschberg. 2007.
V-measure: A conditional entropy-based exter-
nal cluster evaluation measure. En curso-
ings of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and
Computational Natural Language Learning,
EMNLP-CoNLL 2007, pages 410–420. Prague.

Bei Shi, Wai Lam, Shoaib Jameel, Steven
Schockaert, and Kwun Ping Lai. 2017. Jointly
learning word embeddings and latent topics.
In Proceedings of the 40th International ACM
SIGIR Conference on Research and Develop-
ment in Information Retrieval, pages 375–384.
Tokio.

Yangqiu Song, Haixun Wang, Zhongyuan Wang,
Hongsong Li, and Weizhu Chen. 2011. Short
text conceptualization using a probabilistic
knowledgebase. In Proceedings of the 22nd
International Joint Conference on Artificial
Inteligencia, IJCAI 2011, pages 2330–2336.
Barcelona.

Akash Srivastava and Charles Sutton. 2017. Auto-
encoding variational inference for topic models.
the Fifth International
En procedimientos de
Conferencia sobre Representaciones del Aprendizaje, ICLR
2017. Toulon.

Andreas Stolcke, Noah Coccaro, Rebecca Bates,
Paul Taylor, Carol Van Ess-Dykema, Klaus
Ries, Elizabeth Shriberg, Daniel Jurafsky,
Rachel Martin, and Marie Meteer. 2000. Dia-
logue act modeling for automatic tagging and
recognition of conversational speech. Compu-
lingüística nacional, 26(3).

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, y
Xueqi Cheng. 2013. A biterm topic model for

280

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1

/

/
t

yo

a
C
_
a
_
0
0
2
6
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

short texts. In 22nd International World Wide
Web Conference, WWW ’13, pages 1445–1456.
Río de Janeiro.

Yi Yang, Doug Downey, and Jordan L. chico-
Graber. 2015. Efficient methods for incor-
porating knowledge into topic models.
En
Actas de la 2015 Conference on Em-
pirical Methods in Natural Language Pro-
cesando, EMNLP 2015, pages 308–317. Lisbon.

Elina Zarisheva and Tatjana Scheffler. 2015. Dia-
log act annotation for twitter conversations. En
Proceedings of the SIGDIAL 2015 Conferencia,
pages 114–123. Prague.

Jichuan Zeng, Jing Li, Yan Song, Cuiyun Gao,
Michael R. Lyu, and Irwin King. 2018a. Tema
memory networks for short text classification.
el 2018 Conferencia sobre
En procedimientos de
Empirical Methods
in Natural Language
Procesando, EMNLP 2018. Bruselas.

Xingshan Zeng, Jing Li, Lu Wang, Nicholas
Beauchamp, Sarah Shugars, and Kam-Fai
Wong. 2018b. Microblog conversation recom-
mendation via joint modeling of topics and
discourse. En Actas de la 2018 Estafa-
el Capítulo Norteamericano de
diferencia de
la Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano, NAACL-HLT

2018, Volumen 1 (Artículos largos), pages 375–385.
Nueva Orleans, LA.

Tiancheng Zhao, Kyusong Lee, and Maxine
Esk´enazi. 2018. Unsupervised discrete sentence
representation learning for interpretable neural
dialog generation. In Proceedings of the 56th
Annual Meeting of the Association for Com-
Lingüística putacional, LCA 2018, Volumen 1:
Artículos largos, pages 1098–1107. Melbourne.

Tiancheng Zhao, Ran Zhao, and Maxine Esk´enazi.
2017. Learning discourse-level diversity for
neural dialog models using conditional varia-
tional autoencoders. In Proceedings of the 55th
Annual Meeting of the Association for Com-
Lingüística putacional, LCA 2017, Volumen 1:
Artículos largos, pages 654–664. vancouver.

Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing
Él, Ee-Peng Lim, Hongfei Yan, and Xiaoming
li. 2011. Comparing Twitter and traditional
media using topic models. En procedimientos
of Advances in Information Retrieval – 33rd
European Conference on IR Research, ECIR
2011, pages 338–349. Dublín.

Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung
Shum. 2018. The design and implementation of
Xiaoice, an empathetic social chatbot. CORR,
abs/1812.08989.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
2
6
7
1
9
2
3
3
0
1

/

/
t

yo

a
C
_
a
_
0
0
2
6
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

281What You Say and How You Say it: Joint Modeling of image
What You Say and How You Say it: Joint Modeling of image
What You Say and How You Say it: Joint Modeling of image
What You Say and How You Say it: Joint Modeling of image

Descargar PDF