Topic Modeling in Embedding Spaces

Adji B. Dieng
哥伦比亚大学
纽约, 纽约, 美国
abd2141@columbia.edu

Francisco J. 右. Ruiz∗
DeepMind
伦敦, 英国
franrruiz@google.com

大卫·M. Blei
哥伦比亚大学
纽约, 纽约, 美国

david.blei@columbia.edu

抽象的

Topic modeling analyzes documents to learn
meaningful patterns of words. 然而, 存在-
ing topic models fail to learn interpretable
topics when working with large and heavy-
tailed vocabularies. 为此, we develop
the embedded topic model (ETM), a generative
model of documents that marries traditional
topic models with word embeddings. More spe-
cifically, the ETM models each word with a
categorical distribution whose natural param-
eter is the inner product between the word’s
embedding and an embedding of its assigned
话题. To fit the ETM, we develop an effi-
cient amortized variational
inference algo-
rithm. The ETM discovers interpretable topics
even with large vocabularies that include rare
words and stop words. It outperforms exist-
ing document models, such as latent Dirichlet
allocation, in terms of both topic quality and
predictive performance.

1 介绍

Topic models are statistical tools for discovering
the hidden semantic structure in a collection of
文件 (Blei et al., 2003; Blei, 2012). 话题
models and their extensions have been applied
to many fields, such as marketing, 社会学,
政治学, and the digital humanities.
Boyd-Graber et al. (2017) provide a review.

Most topic models build on latent Dirichlet
allocation (LDA) (Blei et al., 2003). LDA is a
hierarchical probabilistic model that represents
each topic as a distribution over terms and re-
presents each document as a mixture of the top-
集成电路. When fit to a collection of documents, 这
topics summarize their contents, and the topic

∗Work done while at Columbia University and the

University of Cambridge.

439

proportions provide a low-dimensional represen-
tation of each document. LDA can be fit to large
datasets of text by using variational inference
and stochastic optimization (Hoffman et al., 2010,
2013).

LDA is a powerful model and it is widely used.
然而, it suffers from a pervasive technical
problem—it fails in the face of large vocabularies.
Practitioners must severely prune their vocabular-
ies in order to fit good topic models—namely,
those that are both predictive and interpretable.
This is typically done by removing the most and
least frequent words. On large collections, 这
pruning may remove important terms and limit
the scope of the models. The problem of topic
modeling with large vocabularies has yet to be
addressed in the research literature.

In parallel with topic modeling came the idea of
word embeddings. Research in word embeddings
begins with the neural language model of Bengio
等人. (2003), published in the same year and
journal as Blei et al. (2003). Word embeddings
eschew the ‘‘one-hot’’ representation of words—a
vocabulary-length vector of zeros with a single
one—to learn a distributed representation, 一
where words with similar meanings are close in
a lower-dimensional vector space (Rumelhart and
亚伯拉罕森, 1973; Bengio et al., 2006). As for
topic models, researchers scaled up embedding
methods to large datasets (Mikolov et al., 2013A,乙;
Pennington et al., 2014; Levy and Goldberg, 2014;
Mnih and Kavukcuoglu, 2013). Word embeddings
have been extended and developed in many ways.
They have become crucial in many applications
of natural language processing (Maas et al., 2011;
Li and Yang, 2018), and they have also been
extended to datasets beyond text (Rudolph et al.,
2016).

在本文中, we develop the embedded topic
模型 (ETM), a document model that marries LDA
and word embeddings. The ETM enjoys the good
properties of topic models and the good properties

计算语言学协会会刊, 卷. 8, PP. 439–453, 2020. https://doi.org/10.1162/tacl 00325
动作编辑器: Doug Downey. 提交批次: 2/2020; 修改批次: 5/2020; 已发表 7/2020.
C(西德:13) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 1: Ratio of the held-out perplexity on a document
completion task and the topic coherence as a function
of the vocabulary size for the ETM and LDA on the
20NewsGroup corpus. The perplexity is normalized
by the size of the vocabulary. While the performance
of LDA deteriorates for large vocabularies, the ETM
maintains good performance.

of word embeddings. As a topic model, it discovers
an interpretable latent semantic structure of the
文件; as a word embedding model, it pro-
vides a low-dimensional representation of the
meaning of words. The ETM robustly accommo-
dates large vocabularies and the long tail of lan-
guage data.

数字 1 illustrates the advantages. This figure
shows the ratio between the perplexity on held-out
文件 (a measure of predictive performance)
and the topic coherence (a measure of the quality
of the topics), as a function of the size of the
词汇. (The perplexity has been normalized
by the vocabulary size.) This is for a corpus of
11.2K articles from the 20NewsGroup and for
100 主题. The red line is LDA; its performance
deteriorates as the vocabulary size increases—the
predictive performance and the quality of the
topics get worse. The blue line is the ETM; it main-
tains good performance, even as the vocabulary
size become large.

Like LDA, the ETM is a generative probabilistic
模型: Each document is a mixture of topics and
each observed word is assigned to a particular
话题. In contrast to LDA, the per-topic conditional
probability of a term has a log-linear form that
involves a low-dimensional representation of the
词汇. Each term is represented by an embed-
ding and each topic is a point in that embedding
空间. The topic’s distribution over terms is pro-
portional to the exponentiated inner product of the

440

数字 2: A topic about Christianity found by the ETM
on The New York Times. The topic is a point in the
word embedding space.

数字 3: Topics about sports found by the ETM on The
纽约时报. Each topic is a point in the word
embedding space.

topic’s embedding and each term’s embedding.
人物 2 和 3 show topics from a 300-topic ETM
of The New York Times. The figures show each
topic’s embedding and its closest words; 这些
topics are about Christianity and sports.

Representing topics as points in the embedding
space allows the ETM to be robust to the presence
of stop words, unlike most topic models. 什么时候
stop words are included in the vocabulary, 这
ETM assigns topics to the corresponding area of
the embedding space (we demonstrate this in
部分 6).

As for most topic models, the posterior of the
topic proportions is intractable to compute. 我们
derive an efficient algorithm for approximating
the posterior with variational inference (约旦
等人。, 1999; Hoffman et al., 2013; Blei et al.,
2017) and additionally use amortized inference
to efficiently approximate the topic proportions
(Kingma and Welling, 2014; Rezende et al., 2014).

The resulting algorithm fits the ETM to large
corpora with large vocabularies. This algorithm
can either use previously fitted word embeddings,
or fit them jointly with the rest of the parameters.
(尤其, 人物 1 到 3 were made using the
version of the ETM that uses pre-fitted skip-gram
word embeddings.)

We compared the performance of the ETM to LDA,
the neural variational document model (NVDM)
(Miao et al., 2016), and PRODLDA (Srivastava and
Sutton, 2017).1 The NVDM is a form of multinomial
matrix factorization and PRODLDA is a modern
version of LDA that uses a product of experts
to model the distribution over words. 我们也
compare to a document model
that combines
PRODLDA with pre-fitted word embeddings. The ETM
yields better predictive performance, 测量的
by held-out log-likelihood on a document comple-
tion task (Wallach et al., 2009乙). It also discovers
more meaningful topics, as measured by topic
连贯性 (Mimno et al., 2011) and topic diver-
城市. The latter is a metric we introduce in this
paper that, together with topic coherence, gives a
better indication of the quality of the topics. 这
ETM is especially robust to large vocabularies.

2 相关工作

This work develops a new topic model that extends
LDA. LDA has been extended in many ways, 和
topic modeling has become a subfield of its own.
For a review, see Blei (2012) and Boyd-Graber
等人. (2017).

A broader set of related works are neural topic
型号. These mainly focus on improving topic
modeling inference through deep neural networks
(Srivastava and Sutton, 2017; Card et al., 2017;
Cong et al., 2017; 张等人。, 2018). 具体来说,
these methods reduce the dimension of the text
data through amortized inference and the variatio-
nal auto-encoder (Kingma and Welling, 2014;
Rezende et al., 2014). To perform inference in the
ETM, we also avail ourselves of amortized inference
方法 (Gershman and Goodman, 2014).

As a document model, the ETM also relates to
works that learn per-document representations as
part of an embedding model (Le and Mikolov,
2014; Moody, 2016; Miao et al., 2016; 李等人。,
the docu-
2016). 相比之下

to these works,

1Code is available at https://github.com/

adjidieng/ETM.

ment variables in the ETM are part of a larger
probabilistic topic model.

One of the goals in developing the ETM is to
incorporate word similarity into the topic model,
and there is previous research that shares this goal.
These methods either modify the topic priors
(Petterson et al., 2010; 赵等人。, 2017乙; Shi
等人。, 2017; 赵等人。, 2017A) or the topic
assignment priors (Xie et al., 2015). 例如,
Petterson et al. (2010) use a word similarity graph
(as given by a thesaurus) to bias LDA towards
assigning similar words to similar topics. 作为
another example, Xie et al. (2015) model the per-
word topic assignments of LDA using a Markov
random field to account for both the topic pro-
portions and the topic assignments of similar
字. These methods use word similarity as a
type of ‘‘side information’’ about language; 在
对比, the ETM directly models the similarity (via
嵌入) in its generative process of words.

然而, a more closely related set of works
directly combine topic modeling and word
嵌入. One common strategy is to convert
into continuous observations
the discrete text
of embeddings, and then adapt LDA to generate
real-valued data (Das et al., 2015; Xun et al.,
2016; Batmanghelich et al., 2016; Xun et al.,
2017). With this strategy, topics are Gaussian
distributions with latent means and covariances,
and the likelihood over the embeddings is modeled
with a Gaussian (Das et al., 2015) or a Von-Mises
Fisher distribution (Batmanghelich et al., 2016).
The ETM differs from these approaches in that
it is a model of categorical data, one that goes
through the embeddings matrix. Thus it does not
require pre-fitted embeddings and, 的确, 能
learn embeddings as part of its inference process.
The ETM also differs from these approaches in
that it is amenable to large datasets with large
vocabularies.

There are few other ways of combining LDA
and embeddings. Nguyen et al. (2015) mix the
likelihood defined by LDA with a log-linear model
that uses pre-fitted word embeddings; Bunk and
Krestel (2018) randomly replace words drawn
from a topic with their embeddings drawn from
a Gaussian; 徐等. (2018) adopt a geometric
看法, using Wasserstein distances to learn
topics and word embeddings jointly; and Keya
等人. (2019) propose the neural embedding alloca-
的 (NEA), which has a similar generative process
to the ETM but is fit using a pre-fitted LDA model as

441

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

a target distribution. Because it requires LDA, 这
NEA suffers from the same limitation as LDA. 这些
models often lack scalability with respect to the
vocabulary size and are fit using Gibbs sampling,
limiting their scalability to large corpora.

3 Background

The ETM builds on two main ideas, LDA and word
嵌入. Consider a corpus of D documents,
where the vocabulary contains V distinct terms.
Let wdn ∈ {1, . . . , V } denote the nth word in the
dth document.

Latent Dirichlet Allocation.
LDA is a probabi-
listic generative model of documents (Blei et al.,
2003). It posits K topics β1:K, each of which is a
distribution over the vocabulary. LDA assumes each
document comes from a mixture of topics, 在哪里
the topics are shared across the corpus and the
mixture proportions are unique for each document.
The generative process for each document is the
following:

1. Draw topic proportion θd ∼ Dirichlet(αθ).

2. For each word n in the document:

(A) Draw topic assignment zdn ∼ Cat(θd).
(乙) Draw word wdn ∼ Cat(βzdn).

这里, 猫(·) denotes the categorical distribu-
的. LDA places a Dirichlet prior on the topics,

βk ∼ Dirichlet(αβ) for k = 1, . . . , K.

The concentration parameters αβ and αθ of the
Dirichlet distributions are fixed model hyperpa-
rameters.

Word Embeddings. Word embeddings provide
models of language that use vector representations
of words (Rumelhart and Abrahamson, 1973;
Bengio et al., 2003). The word representations
are fitted to relate to meaning, in that words with
similar meanings will have representations that
are close. (In embeddings, the ‘‘meaning’’ of a
word comes from the contexts in which it is used
[哈里斯, 1954].)

We focus on the continuous bag-of-words
(CBOW) variant of word embeddings (米科洛夫
等人。, 2013乙). In CBOW, the likelihood of each
word wdn is

wdn ∼ softmax(ρ⊤αdn).

(1)

The embedding matrix ρ is a L × V matrix whose
columns contain the embedding representations
of the vocabulary, ρv ∈ RL. The vector αdn is
the context embedding. The context embedding is
the sum of the context embedding vectors (αv for
each word v) of the words surrounding wdn.

4 The Embedded Topic Model

The ETM is a topic model that uses embedding
representations of both words and topics.
它
contains two notions of latent dimension. 第一的,
it embeds the vocabulary in an L-dimensional
空间. These embeddings are similar in spirit to
classical word embeddings. 第二, it represents
each document in terms of K latent topics.

In traditional topic modeling, each topic is a
full distribution over the vocabulary. In the ETM,
然而, the kth topic is a vector αk ∈ RL in the
embedding space. We call αk a topic embedding—
it is a distributed representation of the kth topic in
the semantic space of words.

In its generative process, the ETM uses the topic
embedding to form a per-topic distribution over
the vocabulary. 具体来说, the ETM uses a log-
linear model that takes the inner product of the
word embedding matrix and the topic embedding.
With this form, the ETM assigns high probability
to a word v in topic k by measuring the agreement
between the word’s embedding and the topic’s
embedding.

Denote the L × V word embedding matrix by
ρ; the column ρv is the embedding of term v.
Under the ETM, the generative process of the dth
document is the following:

1. Draw topic proportions θd ∼ LN (0, 我).

2. For each word n in the document:

A. Draw topic assignment zdn ∼ Cat(θd).
乙. Draw the word wdn ∼ softmax(ρ⊤

αzdn).

In Step 1, LN (·) denotes the logistic-normal
分配 (Aitchison and Shen, 1980; Blei and
拉弗蒂, 2007); it transforms a standard Gaussian
random variable to the simplex. A draw θd from
this distribution is obtained as

δd ∼ N (0, 我);

θd = softmax(δd).

(2)

(We replaced the Dirichlet with the logistic normal
to easily use reparameterization in the inference
algorithm; 参见章节 5.)

442

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Steps 1 and 2a are standard for topic modeling:
They represent documents as distributions over
topics and draw a topic assignment for each
observed word. Step 2b is different; it uses the
embeddings of the vocabulary ρ and the assigned
topic embedding αzdn to draw the observed word
from the assigned topic, as given by zdn.

The topic distribution in Step 2b mirrors the
CBOW likelihood in Eq. 1. Recall CBOW uses the
surrounding words to form the context vector αdn.
相比之下, the ETM uses the topic embedding αzdn
as the context vector, where the assigned topic zdn
is drawn from the per-document variable θd. 这
ETM draws its words from a document context,
rather than from a window of surrounding words.
The ETM likelihood uses a matrix of word
embeddings ρ, a representation of the vocabulary
in a lower dimensional space. 在实践中, it can
either rely on previously fitted embeddings or
learn them as part of its overall fitting procedure.
When the ETM learns the embeddings as part of the
fitting procedure, it simultaneously finds topics
and an embedding space.

When the ETM uses previously fitted embed-
丁斯, it learns the topics of a corpus in a particular
embedding space. This strategy is particularly
useful when there are words in the embedding
that are not used in the corpus. The ETM can
hypothesize how those words fit in to the topics
because it can calculate ρ⊤
v αk even for words v
that do not appear in the corpus.

5 Inference and Estimation

We are given a corpus of documents {w1, . . . ,
wD}, where the dth document wd is a collection
of Nd words. How do we fit the ETM to this
语料库?

The Marginal Likelihood. The parameters of
the ETM are the word embeddings ρ1:V and the
topic embeddings α1:K ; each αk is a point in
the word embedding space. We maximize the log
marginal likelihood of the documents,

L(A, ρ) =

Xd=1

log p(wd | A, ρ).

(3)

topic proportions, which we write in terms of the
untransformed proportions δd in Eq. 2,

p(wd | A, ρ) =

p(δd)

Yn=1

p(wdn | δd, A, ρ) dδd.

(4)

The conditional distribution p(wdn | δd, A, ρ) 的
each word marginalizes out the topic assignment
zdn,

p(wdn | δd, A, ρ) =

Xk=1

θdkβk,wdn.

(5)

这里, θdk denotes the (transformed) topic propor-
系统蒸发散 (Eq. 2) and βk,v denotes a traditional
‘‘topic,’’
是, a distribution over words,
induced by the word embeddings ρ and the topic
embedding αk,

那

βkv = softmax(ρ⊤αk)

(6)

v.
(西德:12)
(西德:12)

Eqs. 4, 5, 6 flesh out the likelihood in Eq. 3.

Variational Inference. We sidestep the intrac-
table integral in Eq. eq:integral with variational
inference (Jordan et al., 1999; Blei et al., 2017).
Variational inference optimizes a sum of per-
document bounds on the log of the marginal
likelihood of Eq. 4.

开始, posit a family of distributions of the
untransformed topic proportions q(δd ; wd, ν).
This family of distributions is parameterized by ν.
We use amortized inference, where q(δd ; wd, ν)
(called a variational distribution) depends on both
the document wd and shared parameters ν. 在
特别的, q(δd ; wd, ν) is a Gaussian whose mean
and variance come from an ‘‘inference network,’’
a neural network parameterized by ν (Kingma and
Welling, 2014). The inference network ingests
a bag-of-words representation of the document
wd and outputs the mean and covariance of δd.
(To accommodate documents of varying length,
we form the input of the inference network by
normalizing the bag-of-word representation of the
document by the number of words Nd.)

The problem is that the marginal likelihood
of each document—p(wd | A, ρ)—is intractable to
compute. It involves a difficult integral over the

We use this family of distributions to bound
the log of the marginal likelihood in Eq. 4. 这
bound is called the evidence lower bound (ELBO)

443

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

and is a function of the model parameters and the
variational parameters,

L(A, ρ, ν) =

Xn=1

Xd=1
D

Eq[ log p(wnd | δd, ρ, A) ]

吉隆坡(q(δd; wd, ν) || p(δd)). (7)

Xd=1

The first term of the ELBO (Eq. 7) encourages
variational distributions q(δd ; wd, ν) that place
mass on topic proportions δd that explain the
observed words and the second term encourages
q(δd ; wd, ν) to be close to the prior p(δd).
Maximizing the ELBO with respect to the model
参数 (A, ρ) is equivalent to maximizing
the expected complete log-likelihood,
d log
p(δd, wd | A, ρ).

磷

The ELBO in Eq. 7 is intractable because the
expectation is intractable. 然而, we can form
a Monte Carlo approximation of the ELBO,

˜L(A, ρ, ν) =

Xd=1

Xn=1

Xs=1

1
S

log p(wnd | δ(s)

d , ρ, A)

吉隆坡(q(δd; wd, ν) || p(δd)),

(8)

Xd=1

where δ(s)
d ∼ q(δd; wd, ν) for s = 1 . . . S. 到
form an unbiased estimator of the ELBO and its
gradients, we use the reparameterization trick
when sampling the unnormalized proportions
d , . . . , δ(S)
δ(1)
(Kingma and Welling, 2014; Titsias
and L´azaro-Gredilla, 2014; Rezende et al., 2014).
那是, we sample δ(s)
d

from q(δd; wd, ν) 作为
d ǫ(s)
d ,

d = µd + Σ

1
2

d ∼ N (0, 我) and δ(s)
ǫ(s)

Algorithm 1 Topic modeling with the ETM
Initialize model and variational parameters
for iteration i = 1, 2, . . . 做

Compute βk = softmax(ρ⊤αk) for each topic k
Choose a minibatch B of documents
for each document d in B do

Get normalized bag-of-word representat. xd
Compute µd = NN(xd ; νµ)
Compute Σd = NN(xd ; νΣ)
Sample θd using Eq. 9 and θd = softmax(δd)
for each word in the document do

Compute p(wdn | θd, ρ, A) = θ⊤

d β·,wdn

end for

end for
Estimate the ELBO using Eq. 10 and Eq. 11
Take gradients of the ELBO via backpropagation
Update model parameters α1:K (ρ if necessary)
Update variational parameters (νµ, νΣ)

end for

Given that the prior p(δd) and q(δd; wd, ν) 是
both Gaussians, the KL admits a closed-form
expression,

吉隆坡(q(δd; wd, ν) || p(δd)) =
1
2 (西德:8)

tr(Σd) + µ⊤

d µd − log det(Σd) − K

(11)

(西德:9)

We optimize the stochastic ELBO in Equation 10
with respect to both the model parameters (A, ρ)
and the variational parameters ν. We set
这
learning rate with Adam (Kingma and Ba, 2015).
The procedure is shown in Algorithm 1, where we
set the number of Monte Carlo samples S = 1
and the notation NN(X ; ν) represents a neural
network with input x and parameters ν.

(9)

6 Empirical Study

where µd and Σd are the mean and covariance of
q(δd; wd, ν) 分别, which depend implicitly
on ν and wd via the inference network. We use a
diagonal covariance matrix Σd.

We also use data subsampling to handle large
collections of documents (Hoffman et al., 2013).
Denote by B a minibatch of documents. 然后
the approximation of the ELBO using data sub-
sampling is

˜L(A, ρ, ν) =

D
|乙| Xd∈B
D
|乙| Xd∈B

Xn=1

Xs=1

log p(wnd | δ(s)

d , ρ, A)

吉隆坡(q(δd; wd, ν) || p(δd)).

(10)

444

We study the performance of the ETM and compare
it to other unsupervised document models. A good
document model should provide both coherent
patterns of language and an accurate distribution
of words, so we measure performance in terms
of both predictive accuracy and topic interpret-
能力. We measure accuracy with log-likelihood
on a document completion task (Rosen-Zvi et al.,
2004; Wallach et al., 2009乙); we measure topic
interpretability as a blend of topic coherence and
diversity. We find that, of the interpretable models,
the ETM is the one that provides better predictions
and topics.

In a separate analysis (部分 6.1), we study
the robustness of each method in the presence

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数据集

Minimum DF

#Tokens Train

#Tokens Valid

#Tokens Test Vocabulary

20Newsgroups

纽约时报

100
30
10
5
2

5,000
200
100
30
10

604.9 K
778.0 K
880.3 K
922.3 K
966.3 K

226.9 中号
270.1 中号
272.3 中号
274.8 中号
276.0 中号

5,998
7,231
6,769
8,494
8,600

13.4 中号
15.9 中号
16.0 中号
16.1 中号
16.1 中号

399.6 K
512.5 K
578.8 K
605.9 K
622.9 K

26.8 中号
31.8 中号
32.1 中号
32.3 中号
32.5 中号

3,102
8,496
18,625
29,461
52,258

9,842
55,627
74,095
124,725
212,237

桌子 1: Statistics of the different corpora studied. DF denotes document frequency, K denotes
a thousand, and M denotes a million.

of stop words. Standard topic models fail
在
this regime—because stop words appear in many
文件, every learned topic includes some
stop words, leading to poor topic interpretability.
相比之下, the ETM is able to use the information
from the word embeddings to provide interpretable
主题.

语料库. We study the 20Newsgroups corpus
and the New York Times corpus; the statistics of
both corpora are summarized in Table 1.

The 20Newsgroup corpus is a collection of
newsgroup posts. We preprocess the corpus
by filtering stop words, words with document
frequency above 70%, and tokenizing. To form
the vocabulary, we keep all words that appear in
more than a certain number of documents, 和我们
vary the threshold from 100 (a smaller vocabulary,
where V = 3,102) 到 2 (a larger vocabulary, 在哪里
V = 52,258). After preprocessing, we further
remove one-word documents from the validation
and test sets. We split the corpus into a training
一套 11,260 文件, a test set of 7,532
文件, and a validation set of 100 文件.
The New York Times corpus is a larger collec-
tion of news articles. It contains more than 1.8
million articles, spanning the years 1987–2007.
We follow the same preprocessing steps as for
20Newsgroups. We form versions of this corpus
with vocabularies ranging from V = 9,842 到
V = 212,237. After preprocessing, 我们用 85%
of the documents for training, 10% for testing, 和
5% for validation.

楷模. We compare the performance of the
ETM against several document models. We briefly
describe each below.

We consider latent Dirichlet allocation (LDA)
(Blei et al., 2003), a standard topic model that
posits Dirichlet priors for the topics βk and topic
proportions θd. (We set the prior hyperparameters
到 1.) 它
is a conditionally conjugate model,
amenable to variational inference with coordinate
ascent. We consider LDA because it is the most
commonly used topic model, and it has a similar
generative process as the ETM.

We also consider the neural variational docu-
ment model (NVDM) (Miao et al., 2016). The NVDM
is a multinomial factor model of documents; 它
posits the likelihood wdn ∼ softmax (β⊤θd),
where the K-dimensional vector θd ∼ N (0, IK)
is a per-document variable, and β is a real-
valued matrix of size K × V . The NVDM uses
a per-document real-valued latent vector θd
to average over the embedding matrix β in the logit
空间. Like the ETM, the NVDM uses amortized
variational inference to jointly learn the approxi-
mate posterior over the document representation
θd and the model parameter β.

NVDM is not interpretable as a topic model;
its latent variables are unconstrained. We study
a more interpretable variant of the NVDM which
constrains θd to lie in the simplex, replacing its
Gaussian prior with a logistic normal (Aitchison
and Shen, 1980). (This can be thought of as a
semi-nonnegative matrix factorization.) We call
this document model ∆-NVDM.

We also consider PRODLDA (Srivastava and
Sutton, 2017). It posits the likelihood wdn ∼
softmax(β⊤θd) where the topic proportions θd are
from the simplex. Contrary to LDA, the topic-matrix
β s unconstrained.

PRODLDA shares the generative model with
∆-NVDM but it is fit differently. PRODLDA uses

445

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Skip-gram embeddings

ETM embeddings

love
loved
热情
爱
affection
adore

家庭
家庭
祖父母
母亲
朋友们
relatives

woman
男人
女孩
boy
teenager
人

NVDM embeddings

政治
政治的
religion
politicking
ideology
partisanship wonderful wife

love
joy
爱
loved
热情

家庭
孩子们
儿子
母亲
父亲

woman
女孩
boy
母亲
daughter
pregnant

政治
政治的
政治家
ideology
speeches
ideological

love
爱
热情
wonderful
joy
beautiful

家庭
儿子们
生活
brother
儿子
lived

woman
女孩
女性
男人
pregnant
boyfriend

love
爱
affection
sentimental
梦
laugh

政治
政治的
政治家
政治家
政治上
民主的

love
miss
young
born
梦
younger

PRODLDA embeddings
woman
女孩
boyfriend
boy
teenager
ager

家庭
husband
wife
daughters
sister
朋友们

∆-NVDM embeddings

家庭
home
父亲
儿子
天
mrs

woman
生活
婚姻
女性
read
young

政治
政治的
信仰
婚姻
政治家
选举

政治
政治的
政治家
liberal
政治家
ideological

桌子 2: Word embeddings learned by all document models (and skip-gram) on the New York Times
with vocabulary size 118,363.

amortized variational
inference with batch
normalization (Ioffe and Szegedy, 2015) 和
dropout (Srivastava et al., 2014).

最后, we consider a document model that
combines PRODLDA with pre-fitted word embed-
dings ρ, by using the likelihood wdn ∼ softmax
(ρ⊤θd). We call this document model PRODLDA-
PWE, where PWE stands for Pre-fitted Word
Embeddings.

We study two variants of the ETM, one where
the word embeddings are pre-fitted and one where
they are learned jointly with the rest of the
参数. The variant with pre-fitted embed-
dings is called the ETM-PWE.

For PRODLDA-PWE and the ETM-PWE, we first
obtain the word embeddings (Mikolov et al.,
2013乙) by training skip-gram on each corpus. (我们
reuse the same embeddings across the experiments
with varying vocabulary sizes.)

Algorithm Settings. Given a corpus, each
model comes with an approximate posterior
inference problem. We use variational inference
for all of the models and employ SVI (Hoffman
等人。, 2013) to speed up the optimization. 这
minibatch size is 1,000 文件. For LDA, 我们
set the learning rate as suggested by Hoffman et al.

(2013): the delay is 10 and the forgetting factor is
0.85.

Within SVI, LDA enjoys coordinate ascent
variational updates; we use five inner steps to
optimize the local variables. For the other models,
we use amortized inference over the local variables
θd. We use 3-layer inference networks and we
set the local learning rate to 0.002. We use ℓ2
regularization on the variational parameters (这
weight decay parameter is 1.2 × 10−6).

Qualitative Results. We first examine the
嵌入. The ETM, NVDM, ∆-NVDM, 和
PRODLDA all learn word embeddings. We illustrate
them by fixing a set of terms and showing
the closest words in the embedding space (作为
measured by cosine distance). 用于比较,
we also illustrate word embeddings learned by the
skip-gram model.

桌子 2 illustrates the embeddings of the differ-
ent models. All the methods provide interpretable
embeddings—words with related meanings are
close to each other. The ETM,
the NVDM, 和
PRODLDA learn embeddings that are similar to those
from the skip-gram. The embeddings of ∆-NVDM
are different; the simplex constraint on the local
variable and the inference procedure change the
nature of the embeddings.

446

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

时间
天
后退
好的
长的

年
百万
钱
支付
tax

学者
gingrich
funds
机构
endowment

japan
tokyo
pacific
欧洲
zealand

officials
民众
部门
报告
状态

gansler
wellstone
mccain
shalikashvili
coached

concerto
solos
sonata
melodies
soloist

servings
tablespoons
tablespoon
preheat
minced

nato
soviet
iraqi
gorbachev
arab

temptation
repressed
drowsy
addiction
conquering

grasp
unruly
choke
drowsy
drift

electron
nuclei
macal
trained
mediaone

LDA

mr
president
bush
白色的
clinton

NVDM

spratt
tabitha
mccorkle
cheetos
卷

∆-NVDM

innings
scored
inning
shutout
scoreless

PRODLDA

played
lou
greg
bobby
steve

城市
building
street
park
房子

assn
assoc
qtr
yr
nyse

treas
yr
qtr
outst
telerate

amato
模型
delaware
morita
双重的

PRODLDA-PWE

百分
百万
company
年
十亿

状态
republican
派对
bill
mr

pryce
mickens

ridership
mtv
straphangers mckechnie
freierman
riders

mfume
filkins

患者
医生
medicare
dr
physicians

democrats
republicans
republican
senate
dole

briefly
precious
serving
放
virgin

giant
boarding
bundle
distance
foray

mercies
lockbox
pharm
shims
cp

音乐
dance
歌曲
opera
concert

game
team
season
coach
玩

cheesecloth
overcook
strainer
kirberger
browned

scoreless
floured
hitless
asterisk
knead

chapels
magnolias
asea
bogeyed
birdie

distinguishable
cocktails
punishable
checkpoints
disobeying

floured
公正性
knead
refrigerate
tablespoons

gillers
lacerated
polshek
decimated
inhuman

republican
bush
campaign
senator
democrats

yankees
game
baseball
season
mets

音乐
mr
dance
opera
乐队

united
israel
政府
israeli
mr

ETM-PWE

game
点
season
team
玩

ETM

wine
食物
sauce
minutes
餐厅

wine
餐厅
食物
dishes
restaurants

company
库存
百万
公司
十亿

法庭
法官
案件
正义
审判

yankees
game
baseball
mets
season

company
百万
库存
shares
十亿

艺术
museum
展示
工作
artist

桌子 3: Top five words of seven most used topics from different document models on 1.8M
documents of the New York Times corpus with vocabulary size 212,237 and K = 300 主题.

We next look at the learned topics. 桌子 3 迪斯-
plays the seven most used topics for all methods,
as given by the average of the topic proportions
θd. LDA and both variants of the ETM provide

interpretable topics. The rest of the models do
not provide interpretable topics; their matrices β
are unconstrained and thus are not interpretable
as distributions over the vocabulary that mix to

447

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 4: Interpretability as measured by the exponentiated topic quality (the higher the better) 与. predictive
performance as measured by log-likelihood on document completion (the higher the better) on the 20NewsGroup
dataset. Both interpretability and predictive power metrics are normalized by subtracting the mean and dividing
by the standard deviation across models. Better models are on the top right corner. 全面的, the ETM is a better
topic model.

form documents. ∆-NVDM also suffers from this
effect although it is less apparent (看, 例如, 这
fifth listed topic for ∆-NVDM).

Quantitative Results. We next
study the
models quantitatively. We measure the quality
of the topics and the predictive performance of
该模型. We found that among the models with
interpretable topics, the ETM provides the best
预测.

We measure topic quality by blending two
指标:
topic coherence and topic diversity.
Topic coherence is a quantitative measure of the
interpretability of a topic (Mimno et al., 2011). 这是
the average pointwise mutual information of two
words drawn randomly from the same document,

TC =

1
K

Xk=1

1
45

Xi=1

Xj=i+1

F (w(k)
我

, w(k)

j ),

在哪里 {w(k)

1 , . . . , w(k)

10 } denotes the top-10 most
likely words in topic k. We choose f (·, ·) 作为
the normalized pointwise mutual
信息
(Bouma, 2009; Lau et al., 2014),

F (wi, wj) =

log P (wi,wj)
磷 (wi)磷 (wj )
− log P (wi, wj)

这里, 磷 (wi, wj) is the probability of words wi and
wj co-occurring in a document and P (wi) 是个
marginal probability of word wi. We approximate
these probabilities with empirical counts.

The idea behind topic coherence is that a
coherent topic will display words that tend to

448

occur in the same documents. 换句话说,
the most likely words in a coherent topic should
have high mutual information. Document models
with higher topic coherence are more interpretable
topic models.

We combine coherence with a second metric,
topic diversity. We define topic diversity to be the
percentage of unique words in the top 25 words of
all topics. Diversity close to 0 indicates redundant
主题; diversity close to 1 indicates more varied
主题.

We define the overall quality of a model’s
topics as the product of its topic diversity and
topic coherence.

A good topic model also provides a good
distribution of language. To measure predictive
力量, we calculate log likelihood on a document
completion task (Rosen-Zvi et al., 2004; 瓦拉赫
等人。, 2009乙). We divide each test document into
two sets of words. The first half is observed: 它
induces a distribution over topics which, 反过来,
induces a distribution over the next words in the
文档. We then evaluate the second half under
this distribution. A good document model should
provide high log-likelihood on the second half.
(For all methods, we approximate the likelihood
by setting θd to the variational mean.)

We study both corpora and with different vo-
cabularies. 人物 4 和 5 show interpretability
of the topics as a function of predictive power. (到
ease visualization, we exponentiate topic quality
and normalize all metrics by subtracting the mean
and dividing by the standard deviation across

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 5: Interpretability as measured by the exponentiated topic quality (the higher the better) 与. predictive
performance as measured by log-likelihood on document completion (the higher the better) on the New York Times
dataset. Both interpretability and predictive power metrics are normalized by subtracting the mean and dividing
by the standard deviation across models. Better models are on the top right corner. 全面的, the ETM is a better
topic model.

methods.) The best models are on the upper right
corner.

LDA predicts worst in almost all settings. 在
the 20NewsGroups, the NVDM’s predictions are in
general better than LDA but worse than for the other
方法; on the New York Times, the NVDM gives
the best predictions. 然而, topic quality for the
NVDM is far below the other methods. (It does not
provide ‘‘topics’’, so we assess the interpretability
of its β matrix.) In prediction, both versions of the
ETM are at least as good as the simplex-constrained
∆-NVDM. 更重要的是, both versions of the
ETM outperform the PRODLDA-PWE; signaling the
ETM provides a better way of integrating word
embeddings into a topic model.

These figures show that, of the interpretable
the ETM provides the best predictive
型号,
performance while keeping interpretable topics.
It is robust to large vocabularies.

6.1 Stop Words

We now study a version of the New York Times
corpus that includes all stop words. We remove
infrequent words to form a vocabulary of size
10,283. Our goal is to show that the ETM-PWE
provides interpretable topics even in the presence
of stop words, another regime where topic models
typically fail. 尤其, given that stop words
出现
话题
models learn topics that contain stop words,
regardless of the actual semantics of the topic.
This leads to poor topic interpretability. 有
extensions of topic models specifically designed

in many documents,

traditional

Quality

LDA
∆-NVDM
PRODLDA-PWE
ETM-PWE

0.13
0.17
0.03
0.18

0.14
0.11
0.53
0.22

0.0182
0.0187
0.0159
0.0396

桌子 4: Topic quality on the New York
Times data in the presence of stop words.
Topic quality here is given by the product
of topic coherence and topic diversity
(higher is better). The ETM-PWE is robust
to stop words; it achieves similar topic
coherence than when there are no stop
字.

to cope with stop words (Griffiths et al., 2004;
Chemudugunta et al., 2006; Wallach et al., 2009A);
our goal here is not to establish comparisons with
these methods but to show the performance of the
ETM-PWE in the presence of stop words.

We fit LDA, the ∆-NVDM, the PRODLDA-PWE,
and the ETM-PWE with K = 300 主题. (We do
not report the NVDM because it does not provide
interpretable topics.) 桌子 4 shows the topic
质量 (the product of topic coherence and topic
diversity). 全面的, the ETM-PWE gives the best
performance in terms of topic quality.

While the ETM has a few ‘‘stop topics’’ that
are specific for stop words (看, 例如, 数字 6),
∆-NVDM and LDA have stop words in almost every
话题. (The topics are not displayed here for space
constraints.) The reason is that stop words co-
occur in the same documents as every other word;

449

参考

John Aitchison and Shir Ming Shen. 1980. Logis-
tic normal distributions: Some properties and
uses. Biometrika, 67(2):261–272.

Kayhan Batmanghelich, Ardavan Saeedi, Karthik
Narasimhan, and Sam Gershman. 2016. Non-
parametric spherical topic modeling with word
嵌入. In Association for Computational
语言学, 体积 2016, 页 537.

Yoshua Bengio, R´ejean Ducharme, Pascal
Vincent, and Christian Janvin. 2003. A neural
probabilistic language model. Journal of Ma-
chine Learning Research, 3:1137–1155.

Yoshua Bengio, Holger Schwenk, Jean-S´ebastien
Sen´ecal, Fr´ederic Morin,
Jean-Luc
Gauvain. 2006, Neural probabilistic language
型号. In Innovations in Machine Learning.

和

大卫·M. Blei. 2012. Probabilistic topic models.
ACM 通讯, 55(4):77–84.

大卫·M. Blei, Alp Kucukelbir, and Jon D.
McAuliffe. 2017. Variational inference: A re-
view for statisticians. Journal of the American
Statistical Association, 112(518):859–877.

大卫·M. Blei and Jon D. 拉弗蒂. 2007. A
correlated topic model of Science. The Annals
of Applied Statistics, 1(1):17–35.

大卫·M. Blei, 安德鲁·Y. 的, and Michael I.
约旦. 2003. Latent Dirichlet allocation.
Journal of Machine Learning Research,
3(Jan):993–1022.

Gerlof Bouma. 2009. Normalized (pointwise)
mutual information in collocation extraction. 在
German Society for Computational Linguistics
and Language Technology Conference.

Jordan Boyd-Graber, Yuening Hu, and David
Mimno. 2017. Applications of topic models.
信息
Foundations and Trends
Retrieval, 11(2–3):143–296.

在

Stefan Bunk and Ralf Krestel. 2018. WELDA:
Enhancing topic models by incorporating local
word context. In ACM/IEEE Joint Conference
on Digital Libraries.

Dallas Card, Chenhao Tan, and Noah A. 史密斯.
2017. A neural framework for generalized topic
型号. In arXiv:1705.09296.

数字 6: A topic containing stop words found by the
ETM-PWE on The New York Times. The ETM is robust
even in the presence of stop words.

therefore traditional topic models have difficulties
telling apart content words and stop words. 这
ETM-PWE recognizes the location of stop words
in the embedding space; its sets them off on their
own topic.

7 结论

We developed the ETM, a generative model of doc-
uments that marries LDA with word embeddings.
The ETM assumes that topics and words live in
the same embedding space, and that words are
generated from a categorical distribution whose
natural parameter is the inner product of the word
embeddings and the embedding of the assigned
话题.

The ETM learns interpretable word embeddings
and topics, even in corpora with large vocabular-
是的. We studied the performance of the ETM against
several document models. The ETM learns both
coherent patterns of language and an accurate
distribution of words.

致谢

supported

DB and AD are
by ONR
N00014-17-1-2131, ONR N00014-15-1-2209,
NIH 1U01MH115727-01, NSF CCF-1740833,
DARPA SD2 FA8750-18-C-0130, 亚马逊,
NVIDIA, and the Simons Foundation. FR received
funding from the EU’s Horizon 2020 右&我
programme under the Marie Skłodowska-Curie
grant agreement 706760. AD is supported by a
Google PhD Fellowship.

450

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Chaitanya Chemudugunta, Padhraic Smyth, 和
Mark Steyvers. 2006. Modeling general and
specific aspects of documents with a probab-
ilistic topic model. In Advances in Neural
Information Processing Systems.

Yulai Cong, Bo C. 陈, Hongwei Liu, 和
Mingyuan Zhou. 2017. Deep latent Dirichlet
allocation with topic-layer-adaptive stochastic
gradient Riemannian MCMC. 在国际
Conference on Machine Learning.

Rajarshi Das, Manzil Zaheer, and Chris Dyer.
2015. Gaussian LDA for topic models with
word embeddings. In Association for Computa-
tional Linguistics and International Joint
Conference on Natural Language Processing
(体积 1: Long Papers).

Samuel J. Gershman and Noah D. 古德曼. 2014.
Amortized inference in probabilistic reasoning.
In Annual Meeting of the Cognitive Science
社会.

托马斯·L. Griffiths, Mark Steyvers, 大卫·M.
Blei, and Joshua B. Tenenbaum. 2004. Integrat-
ing topics and syntax. In Advances in Neural
Information Processing Systems.

Zellig S. 哈里斯. 1954. Distributional structure.

Word, 10(2–3):146–162.

Matthew D. Hoffman, 大卫·M. Blei, and Francis
巴赫. 2010. Online learning for latent Dirichlet
allocation. In Advances in Neural Information
Processing Systems.

Matthew D. Hoffman, 大卫·M. Blei, Chong
王, and John Paisley. 2013. Stochastic varia-
tional inference. Journal of Machine Learning
研究, 14:1303–1347.

Sergey Ioffe and Christian Szegedy. 2015. Batch
normalization: Accelerating deep network
training by reducing internal covariate shift.
In International Conference on Machine
学习.

Michael I. 约旦, Zoubin Ghahramani, Tommi S.
Jaakkola, and Lawrence K. Saul. 1999. 一个
introduction to variational methods for graphi-
cal models. Machine Learning, 37(2):183–233.

Kamrun Naher Keya, Yannis Papanikolaou, 和
James R. Foulds. 2019. Neural embedding

451

allocation: Distributed representations of topic
型号. arXiv 预印本 arXiv:1909.04702.

Diederik P. Kingma and Jimmy L. Ba. 2015.
亚当: A method for stochastic optimization.
In International Conference on Learning
Representations.

Diederik P. Kingma and Max Welling. 2014.
Auto-encoding variational Bayes. In Interna-
tional Conference on Learning Representa-
系统蒸发散.

Jey H. Lau, David Newman, and Timothy
Baldwin. 2014. Machine reading tea leaves:
Automatically evaluating topic coherence and
topic model quality. In Conference of
这
the Association for
European Chapter of
计算语言学.

Quoc Le and Tomas Mikolov. 2014. Distributed
representations of sentences and documents. 在
International Conference on Machine Learn-
英.

Omer Levy

and Yoav Goldberg.

2014.
Neural word embedding as implicit matrix
factorization. In Neural Information Processing
系统.

Shaohua Li, Tat-Seng Chua, Jun Zhu, and Chun-
yan Miao. 2016. Generative topic embedding:
A continuous representation of documents. 在
Annual Meeting of the Association for Compu-
tational Linguistics (体积 1: Long Papers).

Yang Li and Tao Yang. 2018. Word Embedding
for Understanding Natural Language: 调查,
Springer International Publishing.

Andrew L. Maas, Raymond E. Daly, Peter T.
Pham, Dan Huang, 安德鲁·Y. 的, 和
Christopher Potts. 2011. Learning word vectors
for sentiment analysis. In Annual Meeting of
the Association for Computational Linguistics:
人类语言技术.

Yishu Miao, Lei Yu,

and Phil Blunsom.
2016. Neural variational
inference for text
加工. In International Conference on
Machine Learning.

Tomas Mikolov, Kai Chen, 格雷格小号. 科拉多,
and Jeffrey Dean. 2013A. Efficient estimation
of word representations in vector space. arXiv
preprint arXiv:1301.3781

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Tomas Mikolov, 伊利亚·苏茨克维尔, Kai Chen, 格雷格小号.
科拉多, 和杰夫·迪恩. 2013乙. Distributed
representations of words and phrases and
their compositionality. In Neural Information
Processing Systems.

David Mimno, Hanna M. 瓦拉赫, Edmund
Talley, Miriam Leenders,
and Andrew
麦卡勒姆. 2011. Optimizing semantic coher-
In Conference on
ence in topic models.
Empirical Methods
in Natural Language
加工.

Andriy Mnih and Koray Kavukcuoglu. 2013.
Learning word embeddings efficiently with
noise-contrastive estimation. In Neural Inform-
ation Processing Systems.

Christopher E. Moody. 2016. Mixing Dirichlet
topic models and word embeddings to make
LDA2vec. arXiv:1605.02019.

Dat Q. 阮, Richard Billingsley, Lan Du,
and Mark Johnson. 2015. Improving topic
models with latent feature word representations.
Transactions of the Association for Computa-
tional Linguistics, 3:299–313.

杰弗里

Socher,

Pennington, 理查德

和
Christopher D. 曼宁. 2014. GloVe: 全球的
vectors for word representation. In Conference
on Empirical Methods on Natural Language
加工.

James Petterson, Wray Buntine, Shravan M.
Narayanamurthy, Tib´erio S. Caetano, 和
Alex J. Smola. 2010. Word features for latent
dirichlet allocation. In Advances in Neural
Information Processing Systems.

Danilo J. Rezende, Shakir Mohamed, and Daan
Wierstra. 2014. Stochastic backpropagation and
approximate inference in deep generative
In International Conference on
型号.
Machine Learning.

Michal Rosen-Zvi, Thomas Griffiths, 标记
Steyvers, and Padhraic Smyth. 2004. 这
author-topic model for authors and documents.
In Uncertainty in Artificial Intelligence.

Maja Rudolph, Francisco J. 右. Ruiz, Stephan
Mandt, 和大卫·M. Blei. 2016. Exponential
family embeddings. In Advances in Neural
Information Processing Systems.

David E. Rumelhart and Adele A. 亚伯拉罕森.
推理.

for analogical
1973. A model
认知心理学, 5(1):1–28.

Bei Shi, Wai Lam, Shoaib Jameel, Steven
Schockaert, and Kwun P. Lai. 2017. Jointly
learning word embeddings and latent topics.
In ACM SIGIR Conference on Research and
Development in Information Retrieval.

Akash Srivastava and Charles Sutton. 2017. 汽车-
encoding variational inference for topic models.
In International Conference on Learning
Representations.

Nitish

Srivastava, Geoffrey Hinton, Alex
克里热夫斯基,
and Ruslan
伊利亚·苏茨克维尔,
Salakhutdinov. 2014. Dropout: a simple way
to prevent neural networks from overfitting.
Journal of Machine Learning Research,
15(1):1929–1958.

Michalis K. Titsias and Miguel L´azaro-Gredilla.
2014. Doubly stochastic variational Bayes for
non-conjugate inference. In International Con-
ference on Machine Learning.

Hanna M. 瓦拉赫, 大卫·M. Mimno, 和
Andrew McCallum. 2009A. Rethinking LDA:
Why priors matter. In Advances in Neural
Information Processing Systems.

Hanna M. 瓦拉赫,

Iain Murray, Ruslan
Salakhutdinov, and David Mimno. 2009乙.
Evaluation methods
topic models.
In International Conference on Machine
学习.

为了

Pengtao Xie, Diyi Yang, and Eric Xing. 2015.
Incorporating word correlation knowledge into
topic modeling. In Conference of the North
the Association for
American Chapter of
计算语言学: Human Language
Technologies.

Hongteng Xu, Wenlin Wang, Wei Liu, 和
Lawrence Carin. 2018. Distilled Wasserstein
learning for word embedding and topic
造型. In Advances in Neural Information
Processing Systems.

Guangxu Xun, Vishrawas Gopalakrishnan,
Fenglong Ma, Yaliang Li, Jing Gao, and Aidong
张. 2016. Topic discovery for short texts
using word embeddings. In IEEE International
Conference on Data Mining.

452

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Guangxu Xun, Yaliang Li, Wayne Xin Zhao, Jing
高, and Aidong Zhang. 2017. A correlated
topic model using word embeddings. In Joint
Conference on Artificial Intelligence.

He Zhao, Lan Du, and Wray Buntine. 2017A.
A word embeddings informed focused topic
In Asian Conference on Machine
模型.
学习.

Hao Zhang, Bo Chen, Dandan Guo, 和
Mingyuan Zhou. 2018. WHAI: Weibull
hybrid autoencoding inference for deep topic
In International Conference on
造型.
Learning Representations.

He Zhao, Lan Du, Wray Buntine, and Gang
刘. 2017乙. MetaLDA: A topic model that
efficiently incorporates meta information. 在
IEEE International Conference on Data
Mining.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
2
5
1
9
2
3
0
7
4

/
t

我

A
C
_
A
_
0
0
3
2
5
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

453
下载pdf