Heterogeneous Supervised Topic Models

Heterogeneous Supervised Topic Models

Dhanya Sridhar(cid:2) and Hal Daum´e III† and David Blei ♠

(cid:2)Universit´e de Montr´eal and Mila-Quebec AI Institute, Canada
dhanya.sridhar@mila.quebec
†University of Maryland and Microsoft Research, USA
hal3@umd.edu
♠Columbia University, USA
david.blei@columbia.edu

Abstract

Researchers in the social sciences are often in-
terested in the relationship between text and an
outcome of interest, where the goal is to both
uncover latent patterns in the text and predict
outcomes for unseen texts. To this end, this
paper develops the heterogeneous supervised
topic model (HSTM), a probabilistic approach
to text analysis and prediction. HSTMs posit
a joint model of text and outcomes to find
heterogeneous patterns that help with both text
analysis and prediction. The main benefit of
HSTMs is that they capture heterogeneity in
the relationship between text and the outcome
across latent topics. To fit HSTMs, we de-
velop a variational inference algorithm based
on the auto-encoding variational Bayes frame-
work. We study the performance of HSTMs
on eight datasets and find that they consis-
tently outperform related methods, including
fine-tuned black-box models. Finally, we ap-
ply HSTMs to analyze news articles labeled
with pro- or anti-tone. We find evidence of
differing language used to signal a pro- and
anti-tone.

1

Introduction

Researchers in the social sciences are interested
in modeling the relationship between text and an
outcome of interest. In surveys about elections,
how do respondents’ open-ended responses relate
to evaluations of their anger, fear, or concern
(Roberts et al., 2014)? In news media, how do
articles relate to the tone used by the writer towards
the article’s topic (Card et al., 2015)? On social
media, how do tweets on mass shootings relate
to the political ideology of the user (Demszky
et al., 2019)? In these settings, the goal is to both
uncover latent properties about the text and predict
outcomes for unseen texts.

732

We develop the heterogeneous supervised topic
model (HSTM), a probabilistic approach to mod-
eling outcomes from text. The key innovation in
HSTMs—the hetereogeneity—is the idea that in-
dividual words can be predictive of the outcome
not just on their own, but also specifically in
combination with latent topics in the text.

As a running example, consider news articles
about US immigration labeled with the pro- or
anti-tone of the writer towards US immigration
(Card et al., 2015). Table 1 illustrates how an
HSTM can be used to analyze patterns in text that
relate to pro- and anti-immigration tones.

As is standard for topic models (Blei et al.,
2003), HSTMs learn the hidden topics in the cor-
pus, which are captured by the ‘‘neutral’’ words
shown in each topic. For example, US immigra-
tion articles discuss legal aspects (first topic) and
views of the administration (second topic).

However, HSTMs also analyze the per-topic,
heterogeneous associations between the text and
outcome. In Table 1, this is captured by the ‘‘pro’’
and ‘‘anti’’ words in each topic. For example, in
articles about the former administration (second
topic), phrases such as ‘‘families’’ and ‘‘work-
ers’’ reflect a pro-immigration tone, and phrases
such as ‘‘border’’ and ‘‘control’’ reflect an anti-
immigration tone. In articles about legal aspects
(first topic), the words ‘‘children’’ and ‘‘citizen’’
capture a pro-immigration tone while words such
as ‘‘terrorist’’ appear in anti-immigration articles.
A fitted HSTM can also perform supervised
prediction of the outcome from text by using its
model of the outcome based on the interactions
between the topics and words. For example, an
HSTM fit to the immigration articles can predict
the pro- or anti-immigration tone of unseen articles
about US immigration.

Transactions of the Association for Computational Linguistics, vol. 10, pp. 732–745, 2022. https://doi.org/10.1162/tacl a 00487
Action Editor: Mauro Cettolo. Submission batch: 10/2021; Revision batch: 2/2022; Published 6/2022.
c(cid:4) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
7
2
0
3
0
6
8
8

/

/
t

l

a
c
_
a
_
0
0
4
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

behind HSTMs is that the observed correlation
among words provides evidence for latent topics,
and latent topics mediate the associations between
words and the outcome of interest. To implement
this idea, the HSTM posits a joint generative
model that captures how latent topics drive both
text documents and outcomes. We fit HSTMs
with auto-encoding variational Bayes (Kingma
and Welling, 2013).

We study the HSTM across eight datasets that
range from product reviews to surveys about
immigration.1 First, we evaluate the predictive
performance of several methods for supervised
text prediction. We compare HSTMs to related
topic-modeling approaches (Card et al., 2017;
McAuliffe and Blei, 2008; Blei et al., 2003) and
fine-tuned BERT models (Devlin et al., 2018).
The topic modeling approaches are easy to visual-
ize while the black-box BERT models are difficult
to interpret. We find that the HSTM achieves sig-
nificantly better predictive performance than its
topic modeling counterparts across all settings,
and performs competitively with BERT in five
out of eight settings. Next, we perform an ablation
study to understand the effect of different mod-
eling choices made in designing the HSTM. We
find that the heterogeneous model of outcomes
improves text prediction. Finally, we use news
articles to study the use of HSTMs for qualitative
data exploration.

2 Background

Consider a corpus of n text documents and
their associated outcomes D = {(w1, y1), . . . ,
(wn, yn)}. Each document wi
is a sequence
of m word tokens wi = {wi1 . . . wim} that come
from a vocabulary of size V . The corpus can also
be represented as a bag-of-words (BoW) with
matrix x, where xiv is the frequency (or occur-
rence count) of word v in document i. The out-
come yi can be real-valued, binary, or categorical.
This work builds on topic models, probabilistic
models of latent variables and observed text. We
extend a topic model based on products of ex-
perts (PoE) (Hinton, 2002). In a PoE topic model,

1The code and data to reproduce the experiments
are available at: https://github.com/dsridhar91
/hstm.

Table 1: An example of the analysis enabled by
HSTM on US immigration articles labeled with
the pro or anti tone towards immigration by the
writer.

The HSTM contributes to the large body of
work on modeling outcomes from text. One
common approach is to fine-tune large language
models such as BERT (Devlin et al., 2018). These
black-box methods capture complex interactions
between parts of text and the outcome, which
often leads strong predictive performance. How-
ever, it is difficult to visualize the parts of text
that drive predictions (Feng et al., 2018; Jacovi
and Goldberg, 2020). Even explanation methods
such as saliency maps, rationales, and analysis
of attention mechanisms are not robust (Jain and
Wallace, 2019; Kindermans et al., 2019; Serrano
and Smith, 2019) and are often not faithful
to the underlying prediction model (Jacovi and
Goldberg, 2020).

Since it is difficult to explain the predictions of
black-box methods, there has been a push towards
more transparent models (Rudin, 2019). Proba-
bilistic models of text such as topic models (Blei
et al., 2003) tend to be interpretable to experts
(Chang et al., 2009) and are a mainstay in so-
cial science research for understanding patterns
in text data (Krippendorff, 2018; Grimmer and
Stewart, 2013). Their supervised counterparts
such as supervised topic models (McAuliffe and
Blei, 2008), sparse additive models (Eisenstein
et al., 2011), multinomial text regression (Taddy,
2013), and related variants (Card et al., 2017)
retain this interpretability, enabling exploratory
data analysis. However, these probabilistic meth-
ods do not capture heterogeneous relationships
between text and outcomes, limiting their predic-
tive performance.

The HSTM enjoys the interpretability of topic
models while extending them to capture flexible
relationships between text and outcomes. The idea

733

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
7
2
0
3
0
6
8
8

/

/
t

l

a
c
_
a
_
0
0
4
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

,

ri ∈ RK ∼ N (0, I)
θi ∈ (cid:7)K = σ(ri).

each word wij comes from a product of K (possi-
bly unnormalized) distributions called experts,

p(wij | θ, β) = Mult(σ(θ(cid:5)

(cid:2)

i β))
(cid:3) (cid:4)

exp

(cid:5)

v

(1)

k

(cid:3)

θikβkv
(cid:5)

v

exp

θikβkv

v
(cid:2)

(cid:2)

k

v

where σ(·) is the softmax function, and v is a word
in the vocabulary. Srivastava and Sutton (2017)
refer to this model as ProdLDA.

Each document is associated with a local la-
tent variable θi on the K-dimensional simplex
((cid:7)K).2 Each entry θik is the document’s affinity
for the k-th expert. The k-th expert is governed by
the global variable βk ∈ RV . Each entry βkv is
the k-th expert’s affinity for word v.

The PoE topic model is different than latent
Dirichl et allocation (LDA) (Blei et al., 2003), a
mixed membership model. To draw a word from
LDA, it must have high probability in at least
one component. In contrast, in a PoE topic model,
to draw a word, all K experts must have a high
affinity for the word. Consequently, PoE topic
models tend to result in more contrasting topics
(Srivastava and Sutton, 2017).

3 The Heterogeneous Supervised

Topic Model

We present the heterogeneous supervised topic
model (HSTM). It combines a PoE topic model
of words with a heterogeneous model of the out-
comes. HSTMs capture how associations between
words and the outcome vary across latent topics.
The HSTM is a generative model of a text and
outcome pair (wi, yi). To generate the text wi,
the HSTM uses the PoE topic model described
in § 2, and involves the same global and local
latent variables β and θi. Additionally, it includes
a global intercept term b ∈ RV . The model of
text is

In including this intercept, we follow Eisenstein
et al. (2011), Roberts et al. (2014), and Card
et al. (2017).

The local latent variable θi is parameterized

by a logistic-Normal prior,

(3)

(4)

The priors on the global latent variables are

βk ∈ RV ∼ N (0, I)
b ∈ RV ∼ Laplace(0, λI).

The variables β describe the latent structure of
text. In Table 1, we visualize the neutral words by
ranking the values of each vector βk. In the topic
about legal aspects of US immigration, the values
of βk were large for words such as ‘‘deportation’’
and ‘‘court’’.

After generating the document wi, the HSTM
generates its outcome yi using a generalized linear
model (McCullough and Nelder, 1989). Each out-
come y is from an exponential family f (y ; η(·))
(e.g., Bernoulli, Normal, Categorical) whose nat-
ural parameter η(·) is a linear function of latent
variables and text xi.

The outcome model involves the global vari-
ables β and local variables θi. Additionally, the
outcome model includes global latent variables
a ∈ RK, ω ∈ RV , and γk ∈ RV , which are vec-
tors of coefficients. Let Θi = {xi, θi, β, γ, a, ω}.
The outcome model is

yi | Θi ∼ f (y ; η(Θi)),
where

η(Θi) =

+

Linear in topics
(cid:6)(cid:7)(cid:8)(cid:9)
a(cid:5)θi +
(cid:3) (cid:4)
xiv

(cid:4)

(cid:8)

v

Linear in words
(cid:6) (cid:7)(cid:8) (cid:9)
ω(cid:5)xi
θikβkvγkv
(cid:7)

(cid:9)(cid:6)
k
Heterogeneity

(5)

(cid:5)

.

wij | θ, β, b ∼ Mult(σ(b + θ(cid:5)

i β)).

(2)

The priors for the global latent variables in the
outcome model are

Each entry bv is an intercept that reflects the
baseline popularity of the word v in the corpus.

2This is not necessary to define a PoE but Srivastava and

Sutton (2017) show that it results in higher quality topics.

a ∈ RK ∼ N (0, I)
ω ∈ RV ∼ Laplace(0, λI)
γk ∈ RV ∼ Laplace(0, τ I).

(6)

734

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
7
2
0
3
0
6
8
8

/

/
t

l

a
c
_
a
_
0
0
4
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The first term in Eq. 5 is θ(cid:5)

i a. It captures the
relationship between the context of a document
and the outcome, similar to supervised topic mod-
els (McAuliffe and Blei, 2008; Card et al., 2017;
Roberts et al., 2014; Eisenstein et al., 2011).

The second term is x(cid:5)

i ω. The coefficients ω
capture which words are predictive of the out-
come, regardless of the context. When the value
ωv is negative, the vocabulary word v has a neg-
ative pull on the outcome. The interpretation is
similar for positive values.

In the last term of Eq. 5, the word frequency
xiv is multiplied by a quantity that is factor-
ized over the K experts. Each term θikβkv is the
contribution of expert k to the unnormalized log
probability of word v in document i. The variable
γ modulates this term to predict the outcome.

In Table 1, the word ‘‘deportation’’ has a large
value of θikβkv in articles about legal aspects
of US immigration. However, it may not carry
much signal about the pro- or anti-immigration
tone. Thus, the corresponding value of γkv for
‘‘deportation’’ will be small. The word ‘‘citizen’’
has a large value of γkv since it has a positive
pull on the tone taken towards immigration. The
anti-immigration word ‘‘terrorist’’ has a large
negative value of γkv.

Since the variables γ and β are both real-valued,
we cannot immediately interpret positive (and
negative) values of γ as pro and anti words. Table 1
visualizes the pro and anti words in each context
by ranking the values of γkv
. When this rescaled
βkv
variable is positive, the word v is associated with
a pro-immigration tone.

4

Inference

Given a dataset D of documents and outcomes,
we perform maximum a posteriori estimation of
the global variables μ = {β, γ, a, ω} and vari-
ational inference for the local latent variables
ri, which is θi on the logit scale (Eq. 3). We
follow the work on auto-encoding variational
Bayes (Kingma and Welling, 2013).

The local latent variable ri comes from an amor-
tized variational family qφ(ri | wi), a multivariate
Normal distribution whose mean and diagonal
covariance are parameterized by a neural network
called the encoder. The encoder has weights φ
and take the text wi as input.

We maximize the evidence lower bound

(ELBO),

L (φ, μ) =

(cid:10)

Eri

log p(wi | ri, μ)

1
n

n(cid:4)

i=1

(cid:11)

+ log p(yi | ri, wi, μ)
(cid:5)

(cid:3)

− KL

qφ(ri | wi)||p(ri)

+ log p(μ),

(7)

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
7
2
0
3
0
6
8
8

/

/
t

l

a
c
_
a
_
0
0
4
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

(cid:12)

a bound on the log marginal likelihood of the
n
i=1 log p(wi | μ). The expectation
documents,
is taken with respect to the variational distribu-
tion qφ(ri | wi). The term log p(μ) corresponds
to L1 and L2 penalty terms,

(cid:13)

(cid:4)

log p(μ) =

λ

(cid:14)

|bv| + |ωv|

⎝τ

+

v

(cid:4)

+

(cid:4)

⎠ .

a2
k

β2
kv +

k,v

k

|γkv|

(cid:4)

k,v

(8)

We maximize the ELBO with stochastic gradient
descent. When taking gradients, ∇Eqφ(ri | wi)(·)
poses a challenge. We approximate this gradi-
ent with the reparameterization trick (Kingma
and Welling, 2013). We approximate Eqφ(ri | wi)(·)
with Monte Carlo and sample r ∼ qφ(ri | wi) by
drawing from N (0, 1) and applying the location-
scale transformation with the mean and variance
of qφ(·). Then ∇L (φ, μ) can be computed with
automatic differentiation. (We use PyTorch;
implementation details are discussed below.)

After fitting HSTMs, we use the estimated val-
ues ˆβ, ˆω, and ˆγ to visualize the results in Table 1.
To form predictions for a new document wd, we
obtain its representation ˆθd by passing the words
through the encoder. We make predictions using
Eq. 5 with the fitted variables.

5 Related Work

HSTMs contribute to the work on extending topic
models (Blei et al., 2003) for supervised predic-
tion (McAuliffe and Blei, 2008; Lacoste-Julien
et al., 2008; Eisenstein et al., 2011; Roberts et al.,
2014; Card et al., 2017) and modeling covari-
ates (Taddy, 2013) such as authorship (Rosen-Zvi
et al., 2012), political ideology (Nguyen et al.,
2013), voting behavior (Nguyen et al., 2015; Vafa

735

et al., 2020), or category tags (Ramage et al.,
2009).

In this line of work, HSTMs are most closely
related to the neural models for documents with
metadata (i.e., outcomes) proposed by Card et al.
(2017). Card et al. (2017) develop a general frame-
work for topic models fit using auto-encoding
variational Bayes. In their framework, either the
outcomes are predicted from the inferred topics
or interactions between the topics and outcomes
are used to model text. In the latter task, the in-
teraction effects learned by the neural models are
similar to the heterogeneity that HSTMs capture.
However, the neural model framework does not
accommodate predicting outcomes and capturing
heterogeneity with the same model. In contrast,
HSTMs can simultaneously capture topic-driven
heterogeneity and enable supervised prediction
based on the heterogeneous patterns.

HSTMs also relate to the literature on neural
topic models (Dieng et al., 2020; Miao et al., 2016;
Benton and Dredze, 2018; Cao et al., 2015; Das
et al., 2015; Nguyen et al., 2013; Srivastava and
Sutton, 2017; Lau et al., 2017; He et al., 2017;
Larochelle and Lauly, 2012). Within this line of
work, the HSTM builds on Srivastava and Sutton,
(2017) who proposed a PoE topic modzel and
adapted auto-encoding variational Bayes (Kingma
and Welling, 2013) for topic models. In con-
trast to their work, which proposes a generative
model of text, HSTMs are joint models of text
and outcomes, enabling supervised prediction and
text analysis that reveals how text relates to the
outcome of interest.

6 Empirical Studies

We empirically evaluate HSTMs with eight
datasets. The code and data to reproduce the re-
sults are available at https://github.com
/dsridhar91/hstm.

Our empirical analysis is driven by three key
questions: 1) How does the predictive perfor-
mance of HSTMs compare to that of related meth-
ods? 2) How do different modeling components
of HSTMs contribute to their performance? 3)
How do HSTMs help exploratory text analysis?

For predictive performance, we find that: 1)
in all eight settings that we study, the predictive
performance of HSTMs significantly improves
upon the results of approaches that also use a bag-
of-words approach to representing text, making

them easy to visualize; 2) in five out of eight
settings, HSTMs are competitive with fine-tuned
BERT models, a state-of-the-art transformer based
approach that is difficult to interpret (in contrast,
HSTMs are easy to visualize); and 3) in four out
of eight settings, the terms in Eq. 5 that capture
heterogeneity in the outcome model significantly
improve prediction.

For exploratory text analysis, we apply HSTMs
to analyze news articles labeled with the writer’s
pro- or anti-tone. Using the media framing cor-
pus (Card et al., 2015), we study articles on US
immigration, same sex marriage, the death penalty
and gun control. In all article subjects, we find
evidence of differing pro-issue and anti-issue
word choices across learned topics.

6.1 Datasets and Preprocessing

We studied eight
outcomes.

text corpora with labeled

Amazon Office Products. This dataset contains
the text of Amazon reviews for office-related
products and each review’s corresponding rating
(one to five stars).3 We used a randomly selected
subset of 20k reviews. We normalized the ratings
to be ∈ [0, 1], and studied the regression task of
predicting the normalized rating from text.

Amazon Grocery Products. This dataset con-
tains the text of Amazon reviews and their ratings
for grocery-related products.4 Here, we are inter-
ested in the binary classification task of predicting
a positive or negative review from text. We again
randomly selected 20k reviews. We discarded re-
views that had a 3-star rating and binned the
remaining reviews into two categories: positive re-
views (4- and 5-star ratings) and negative reviews
(1- and 2-star ratings).

Yelp Reviews. This dataset contains Yelp re-
views.5 We considered a subset of 20k reviews
ranging across all businesses. Each review has an
associated rating for the business, ranging from
one to four stars. The reviews were again binned
as positive and negative based on their ratings.
We consider the binary classification of positive
and negative reviews from text.

3http://jmcauley.ucsd.edu/data/amazon/.
4http://jmcauley.ucsd.edu/data/amazon/.
5http://www.yelp.com/dataset_challenge.

736

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
7
2
0
3
0
6
8
8

/

/
t

l

a
c
_
a
_
0
0
4
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Method

Description

Regression with bag-of-words (BoW) frequencies as features
Regression with document representation from LDA (Blei et al., 2003)

BoW
LDA
Bow + LDA Regression with BoW and document representation from LDA
PCA
sLDA
STM
BERT
HSTM

Regression with PCA word embeddings, averaged over document
Supervised LDA (McAuliffe and Blei, 2008) with auto-encoding VB inference
Supervised PoE topic model (Card et al., 2017)
Pre-trained BERT embeddings fine-tuned for prediction (Devlin et al., 2018)
Full model fit jointly (this paper)

Table 2: Description of the methods compared.

PeerRead. This dataset comes from PeerRead,
(Kang et al., 2018), a corpus of computer-science
papers. The dataset contains each paper’s abstract
and whether or not they were accepted to a con-
ference. We consider a subset of papers posted to
the arXiv under cs.cl, cs.lg, or cs.ai
between 2007 and 2017 inclusive.6 The dataset
includes 11,778 papers, of which 2,891 are ac-
cepted. We study the binary classification of ac-
ceptance (or not) from abstracts.

Media Framing Corpus. The last four datasets
come from the media framing corpus (Card et al.,
2015), a corpus of news articles labeled with the
pro or anti tone taken by the writer towards the
article subject. We follow Card et al. (2017) in
analyzing this dataset. We consider articles across
four subjects: US immigration, same sex mar-
riage, death penalty, and gun rights. Each subject
consists of about 4k articles. We study the binary
classification of pro/anti tone from articles.

Pre-processing. For each dataset, we con-
structed a vocabulary of unigrams that occurred
in at least 0.07% and in no more than 90% of
the documents, and bigrams that occurred in at
least 0.7% of the documents as our vocabulary.
We considered normalized frequencies as BoW
features.

6.2 Experimental Setup

Table 2 describes all
compared in the empirical studies.

the methods that we

6The dataset only includes papers that are not cross listed
with any non-cs categories and are within a month of the
submission deadline for a target conference. The conferences
are ACL, EMNLP, NAACL, EACL, TACL, NeurIPS, ICML,
ICLR, and AAAI. A paper is marked as accepted if it appeared
in one of the target venues. Otherwise, the paper is marked
as rejected.

Implementation Details. For logistic regres-
sion, ridge regression, PCA, and LDA, we used
Python’s sklearn package with default settings.
We implemented the BERT method with the
transformers Python library.7 The BERT
method uses the raw text as input, with a maximum
token length of 128. To predict, we apply a linear
map from the final hidden layer, averaged over
all tokens, to the outcome. The BERT method
was fine-tuned for 5 epochs. We use stochastic
gradient-based optimization with Adam (Kingma
and Ba, 2015), using a learning rate of 1 × 10−5.
STMs, sLDA, and HSTM were all implemented
with PyTorch. For sLDA, we fit the model pro-
posed by McAuliffe and Blei (2008) but replace
the Dirichlet priors with logistic Normal priors to
enable stochastic optimization. For auto-encoding
VB inference, we used an encoder with two hidden
layers of size 300, ReLU activation, and batch-
normalization after each layer. For stochastic
optimization with Adam, we use automatic differ-
entiation in PyTorch. We used a learning rate of
0.01 based on recommendations from Srivastava
and Sutton (2017); Card et al. (2017). These
methods and BERT were trained on Titan GPUs.

Hyperparameters. To fit HSTMs, STMs,
sLDA, PCA, and LDA, we used 50 topics for
PeerRead and Semantic Scholar, 30 for Amazon
office products and Yelp, 20 for Amazon grocery
products, and 10 for all media framing corpus
subjects to fit LDA, PCA, and this paper’s meth-
ods, chosen based on validation log likelihood.

For HSTMs, we selected values of the Laplace
prior hyperparameters τ and λ based on cross-
validated prediction error.

7https://huggingface.co/transformers/.

737

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
7
2
0
3
0
6
8
8

/

/
t

l

a
c
_
a
_
0
0
4
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Amazon office

Yelp

PeerRead

Method

BoW
BoW + LDA
LDA
PCA
sLDA
STM
HSTM

MSE

Method

Accuracy

Method

Accuracy

0.80 (0.04)
0.78 (0.04)
0.90 (0.04)
0.87 (0.04)
0.88 (0.04)
0.93 (0.05)
0.70 (0.03)

BoW
BoW + LDA
LDA
PCA
sLDA
STM
HSTM

0.83 (0.005)
0.84 (0.008)
0.82 (0.007)
0.78 (0.005)
0.78 (0.03)
0.83 (0.005)
0.91 (0.005)

BoW
BoW + LDA
LDA
PCA
sLDA
STM
HSTM

0.76 (0.01)
0.79 (0.004)
0.79 (0.005)
0.78 (0.01)
0.76 (0.01)
0.78 (0.006)
0.81 (0.01)

Amazon grocery

US immigration

Death penalty

Method

Accuracy

Method

BoW
BoW + LDA
LDA
PCA
sLDA
STM
HSTM

0.89 (0.003)
0.90 (0.004)
0.89 (0.004)
0.88 (0.004)
0.90 (0.003)
0.90 (0.002)
0.93 (0.004)

BoW
BoW + LDA
LDA
PCA
sLDA
STM
HSTM

Accuracy

0.62 (0.01)
0.71 (0.03)
0.70 (0.02)
0.71 (0.03)
0.67 (0.03)
0.72 (0.02)
0.80 (0.01)

Method

BoW
BoW + LDA
LDA
PCA
sLDA
STM
HSTM

Accuracy

0.66 (0.02)
0.70 (0.01)
0.69 (0.02)
0.70 (0.01)
0.69 (0.01)
0.69 (0.01)
0.78 (0.02)

Same sex marriage

Gun rights

Method

BoW
BoW + LDA
LDA
PCA
sLDA
STM
HSTM

Accuracy

0.73 (0.01)
0.75 (0.01)
0.74 (0.01)
0.76 (0.01)
0.75 (0.01)
0.77 (0.01)
0.83 (0.01)

Method

BoW
BoW + LDA
LDA
PCA
sLDA
STM
HSTM

Accuracy

0.69 (0.01)
0.69 (0.02)
0.69 (0.02)
0.70 (0.01)
0.70 (0.01)
0.69 (0.02)
0.78 (0.02)

Table 3: The HSTM achieves the best predictive performance across five datasets. Metrics are averages
from cross validation across five folds (with standard errors in parenthesis). For regression, we report
mean squared error (lower is better). For binary prediction, we report accuracy (higher is better).
Bolded numbers indicate that the result shows a statistically significant improvement over the next best
performing method (with a significance level of α = 0.01).

Experimental Details. Following the work of
Vafa et al. (2020), we initialize the variables
β with a pre-trained model. Specifically, we fit
LDA and use the log topics to initialize β. With
this initialization, we reweight the KL term to be
2 · DKL(·) following work on β-VAE (Burgess
et al., 2018) to encourage the posterior to be
closer to the prior. For a fair comparison with
STMs (Card et al., 2017), we fit it using the
same optimization strategies as we use for the
HSTM. We also initialize the unnormalized top-
ics using log topics from LDA as we do with
the HSTM.

6.3 Predictive Performance

We investigate how HSTMs perform on predic-
tion tasks relative to related methods. First, we

includes all

compare HSTMs against the other methods that
also use a bag-of-words approach to representing
documents. This
the compared
methods except for fine-tuned BERT models. The
goal is to assess how HSTMs perform compared
to other approaches that are also easy to visualize.
Second, we compare HSTMs to fine-tuned BERT
models, a state-of-the-art approach to supervised
text prediction based on transformers. In contrast
to HSTMs, BERT models are difficult to visualize.

Comparisons to BoW-based Approaches.
Table 3 compares HSTMs to the related methods
in Table 2 across the eight datasets. For binary
the average accu-
prediction tasks, we report
racy across five folds. For regression tasks, we
report the average mean squared error (MSE).

738

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
7
2
0
3
0
6
8
8

/

/
t

l

a
c
_
a
_
0
0
4
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Setting

HSTM

BERT

Amazon office

0.70 (0.03)

0.53 (0.02)

Yelp
PeerRead
Amazon grocery
US immigration
Death penalty
Same sex
Gun rights

0.91 (0.005)
0.81 (0.01)
0.93 (0.005)
0.80 (0.01)
0.78 (0.02)
0.83 (0.01)
0.78 (0.02)

0.93 (0.002)
0.78 (0.01)
0.96 (0.006)
0.76 (0.03)
0.79 (0.01)
0.83 (0.01)
0.77 (0.02)

Table 4: In five out of eight settings, the HSTM’s
predictive performance is not statistically sig-
nificantly different (at a significance level of
α = 0.01) than the performance of fine-tuned
BERT models, a state-of-the-art approach to su-
pervised text prediction based on transformers.
Bolded results indicate when BERT achieves sig-
nificantly better performance. BERT is a black-
box method, making it hard to interpret which
parts of text drove prediction. In contrast, HSTMs
are easy to visualize. We report MSE for the re-
gression task in the Amazon office setting, and
accuracy for the remaining binary classification
tasks. The results are averaged across five folds.

Bolded results show a statistically significant
improvement over
the next best performing
method (at significance level α = 0.01).8
The results in Table 3 suggest

that among
approaches that are easy to visualize, HSTMs
offer the best predictive performance. In all eight
datasets, the HSTM achieves improvements in
predictive performance that are statistically signif-
icant when compared to the next best performing
method (and by extension, all other compared
methods). There is variation in which method
comes closest to the performance of HSTMs. In
the Amazon office and PeerRead settings, the
BoW + LDA method is the next best performing
method after the HSTM. In the remaining settings,
the STM approach, which relates to the frame-
work from Card et al. (2017), is the second best.

Comparison to BERT-based Approach.
Table 4 summarizes the comparison between

8We apply a Student’s t-test to evaluate the null hypothe-
sis that the methods’ true mean results are the same. The test
statistic follows a Student’s t-distribution. For binary clas-
sification, accuracy reflects the probability of successfully
predicting the true label in a Binomial distribution. However,
since the Binomial distribution is well-approximated by a
Normal distribution when the sample size is large, we use the
t-test even for classification tasks.

HSTMs and fine-tuned BERT models. In five
the HSTM’s predictive
out of eight settings,
performance is not statistically significantly dif-
ferent (at significance level α = 0.01) than the
performance of BERT. For the regression task in
the Amazon office setting, BERT offers drastic
improvements in predictive performance. For the
binary classification tasks in the Amazon gro-
cery and Yelp settings, BERT again achieves
significantly better performance than HSTMs.

We studied BERT and HSTMs further using
the full Yelp reviews dataset, which contains
560,000 training examples and 38,000 test exam-
ples. This has become a well-studied benchmark
dataset for binary classification from text.9 The
BERT model achieves a test-set accuracy of 0.96,
which is consistent with state-of-the-art results.10
Using the best performing hyper-parameters from
the smaller Yelp setting, we trained the HSTM
on this benchmark corpus. The HSTM achieves
a test-set accuracy of 0.91. Taken together with
the findings from Table 4, the results suggest
that while BERT remains a competitive approach
to supervised text prediction, when it is impor-
tant to trade off performance against the ease
of visualizing models, HSTMs might offer an
attractive solution.

6.4 Ablation Study

Which components of the HSTM are important for
its performance? We conduct an ablation study,
removing the heterogeneity and linear-in-words
terms from the outcome model to see the effect
they have on prediction.

We compare the full HSTM to variants that

omit terms from the outcome model in Eq. 5,

HSTM-Het

STM

i α + x(cid:5)
θ(cid:5)
i ω
θ(cid:5)
i α.

(9)

For a fair comparison, we apply the initializa-

tion (using LDA) that the HSTM uses.

Table 5 summarizes the findings from this
study. First, compared to STM, which only uses
topics to model the outcome, HSTM-Het, which
includes the linear-in-words term, substantially
improves held out performance. Second, HSTM-
Het performs better than its counterpart, BoW+

9For example, see https://paperswithcode.com
/sota/sentiment-analysis-on-yelp-binary.
10https://huggingface.co/textattack/bert

-base-uncased-yelp-polarity.

739

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
7
2
0
3
0
6
8
8

/

/
t

l

a
c
_
a
_
0
0
4
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Amazon office

PeerRead

Yelp

Method

STM
HSTM-Het.
HSTM

MSE

Method

Accuracy

Method

Accuracy

0.93 (0.04)
0.72 (0.04)
0.70 (0.03)

STM
HSTM-Het.
HSTM

0.78 (0.006)
0.80 (0.01)
0.81 (0.01)

STM
HSTM-Het.
HSTM

0.83 (0.005)
0.89 (0.008)
0.91 (0.005)

Amazon grocery

US immigration

Death penalty

Method

Accuracy

Method

STM
HSTM-Het.
HSTM

0.90 (0.003)
0.91 (0.005)
0.93 (0.005)

STM
HSTM-Het.
HSTM

Accuracy

0.72 (0.02)
0.77 (0.02)
0.80 (0.01)

Method

STM
HSTM-Het.
HSTM

Accuracy

0.69 (0.01)
0.76 (0.01)
0.78 (0.02)

Same sex marriage

Gun rights

Method

Accuracy

Method

STM
HSTM-Het.
HSTM

0.77 (0.01)
0.81 (0.01)
0.83 (0.01)

STM
HSTM-Het.
HSTM

Accuracy

0.69 (0.02)
0.73 (0.01)
0.78 (0.02)

Table 5: An ablation study of the HSTM outcome model reveals that the heterogeneity term in Eq. 5
improves prediction. We report the average results across five folds. Bolded numbers indicate that
the result shows a statistically significant improvement over the next best performing method (with a
significance level of α = 0.01).

zLDA (in Table 3). The difference between the
methods is that HSTM-Het includes a sparsity-
inducing L1 penalty for the linear-in-words term,
suggesting that sparsity helps predictive perfor-
mance in the settings we study.

In four out of the eight datasets, HSTMs see
a statistically significant improvement (at sig-
nificance level α = 0.01) over the HSTM-Het
variant, which only uses the linear-in-words and
linear-in-topics terms. The finding suggests that
the heterogeneity term is useful for text prediction.
The gain over HSTM-Het varies, with the biggest
gain in the media framing corpus setting with
articles about gun control.

6.5 Exploratory Analysis

Following the example in Table 1, we use HSTMs
to perform exploratory data analysis on the re-
maining article subjects in the media framing
corpus (Card et al., 2015): death penalty, same
sex marriage, and gun rights.

Table 6 visualizes the first four topics learned
by HSTMs in each set of articles. As with the
example in Table 1, the ‘‘neutral’’ words reflect the
learned topics, the ‘‘pro’’ words capture language
used to signal a pro tone (and similarly with ‘‘anti’’
words). As we describe in § 3, we identify the
neutral, pro, and anti words using the fitted HSTM

model parameters β and γ. For each subject in
the media framing corpus, Table 6 suggests that
HSTMs find relevant topics and heterogeneous
associations between words and the outcome (i.e.,
tone) in the pro and anti words, as below.

In articles about the death penalty, different ex-
ecutions and court rulings appear as topics. In one
ruling, pro death penalty language focuses on the
perpetrators (‘‘Tamerlan’’) and the victims. In
contrast, anti death penalty articles invoke phrases
such as ‘‘evidence’’ and ‘‘pleas’’. Pro death pen-
alty language invokes the crimes and their victims
while anti death penalty language includes racially
biased rulings, exoneration, and innocence.

In articles about same sex marriage, the HSTM
topics include supreme court rulings, recently
passed equality bills, and the views of the church.
In articles about recently passed bills, pro same sex
marriage is reflected by highlighting politicians
and states that passed bills (‘‘Cuomo’’, ‘‘Mary-
land’’) while the anti tone is reflected by words
such as ‘‘opposes’’ and ‘‘traditional.’’

In articles about gun rights, some discussed
topics are about marches, background checks,
school shootings, and gun laws. In articles about
marches, being pro gun-rights is signaled with
language about firearms while an anti gun-rights
tone is signaled by invoking gun violence.

740

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
7
2
0
3
0
6
8
8

/

/
t

l

a
c
_
a
_
0
0
4
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
7
2
0
3
0
6
8
8

/

/
t

l

a
c
_
a
_
0
0
4
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Table 6: Topics learned by HSTMs from the media framing corpus. We
visualize the top seven words across four topics. In each fitted topic, the
‘‘neutral’’ words capture the topic, ‘‘pro’’ words language is used to
signal a pro tone towards the subject (and similarly for ‘‘anti’’ words).

Table 7 visualizes the overall pro and anti
words in each article subject. As discussed in § 3,
we use the inferred model parameters ω (i.e.,
linear-in-words term). For each subject, the overall
pro-issue and anti-issue words are different than
the words that have heterogenous associations
with tone, depending on the topic (in Table 6).

Comparisons to Related Methods. To further
investigate the benefits of HSTMs for exploratory
text analysis, Table 8, visualizes the type of analy-
sis enabled by LDA, STMs, and BoW regression
on articles about same sex marriage. Table 8 re-
veals some qualitative contrasts between HSTM-
based text analysis and the related analyses.

741

Death penalty

Overall pro:

first, pronounced, died, victims, upholds, rejected, upheld

Overall anti:

spared, life, freed, exonerated, bias, indigent, cited

Same sex marriage

Overall pro:

latest, toward, maryland, even, love, struck, equality

Overall anti:

woman, homosexual, define, homosexuality, definition, added, evangelical

Overall pro:
Overall anti:

means, owners, lawsuits, lawful, liberal, convention, defending
sensible, lobby, found, lost, reasonable, handguns, dangerous

Gun rights

Table 7: Words with highest positive and negative ω weights learned by HSTMs for various
subjects in the media framing corpus. These reflect words that reflect overall pro and anti tone
irrespective of topic.

STM topics

new, york, marriage, civil, legalize

marriage, married, god, us, people

church, rev, methodist, unions, clergy

marriage, court, states, ruling, state

LDA topics

marriage, new, gay, state, york, civil, rights

marriage, gay, people, couples, but, one, right

church, said, unions, united, rev, methodist, bishop

marriage, court, state, gay, supreme, couples, states

BoW regression coefficients

Pro: benefits, equality, new, married, rights, gay, couples

Anti: amendment, said, woman, marriages, man, church, ban

Table 8: HSTMs reveal different insights compared to the
analysis enabled by the STM, LDA, and BoW regression on
articles about same sex marriage. First, we visualize topics 1
through 4 found by the STM and LDA. Second, we visualize
the pro and anti words found by BoW regression.

After fitting LDA to the articles, most learned
topics involve words such as ‘‘marriage’’ and
‘‘gay’’, making it harder to distinguish the themes
that each topic capture. Further, from the LDA
output alone, it can be difficult to distinguish
between pro and anti language in each topic,
especially with lack of knowledge of the domain.

Since we initialized both STMs and HSTMs
with the log topics from LDA, it is not surprising
that they find topic related to those from LDA.
However, the topics found by the STM appear
much more similar to LDA topics than the ones
found by HSTM. As with LDA, ‘‘marriage’’ ap-
pears in three out of the four displayed STM topics.

742

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
7
2
0
3
0
6
8
8

/

/
t

l

a
c
_
a
_
0
0
4
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

As with LDA, the topics do not appear to reveal
pro and anti language. For example, the third topic
found by the STM includes the word ‘‘union’’,
which indicates a pro tone towards same sex
marriage, as well as the word ‘‘church,’’ which is
associated more closely with the anti side.

The topics found by HSTM (shown in Table 6)
from same sex marriage articles diverge from the
topics found by both the STM and LDA. For
example, the neutral words in the second topic
include words such as ‘‘moral’’ and ‘‘institution’’,
which were not ranked high in the corresponding
LDA and STM topics.

Moreover, HSTM appears to be more effective
at separating neutral words from pro- and anti-tone
ones. For example, in the third topic about same
sex marriage, the neutral words seem to capture re-
ligion generally. The pro words include ‘‘rabbis’’,
‘‘unitarian’’, and ‘‘episcopalian’’, denominations
which are known to be more open to same sex
marriage compared with other religious groups.
The anti words include ‘‘catholic’’ and ‘‘bishop’’,
ideologies that maintain an anti same sex mar-
riage stance.

BoW regression infers pro and anti language
that is similar to the overall pro-issue and anti-
issue words found by HSTMs (Table 7), but (by
design) does not cluster these into topics.

7 Discussion

We addressed the problem of modeling the rela-
tionship between text and an outcome of interest
in service of text analysis and supervised predic-
tion. We proposed the HSTM to jointly model the
structure in text and capture heterogeneity in the
relationship between text and outcomes based on
the latent topics that drive the text.

We evaluated the HSTM on eight prediction
settings and find that it outperforms related meth-
ods. We conducted an ablation study to examine
how different components of the HSTM con-
tribute to its performance. The study revealed that
the HSTM’s heterogeneous model of outcomes
improves prediction. In addition to forming pre-
dictions, the HSTM lets us explore text datasets.
We applied the HSTM to four corpora of news
articles labeled with pro/anti tone, and discovered
pro and anti wording choices.

2018) with the HSTM to leverage both word
embeddings and topics. Another area is adapting
the HSTM to produce dynamic topics that vary
over time (Blei and Lafferty, 2006).

Acknowledgments

This work is supported by NSF grants IIS 2127869
and SaTC 2131508, ONR grants N00014-17-1-
2131 and N00014-15-1-2209, the Simons Foun-
dation, and the Sloan Foundation.

References

Adrian Benton and Mark Dredze. 2018. Deep
Dirichlet multinomial regression. In Proceed-
ings of NAACL-HLT. https://doi.org
/10.18653/v1/N18-1034

David M. Blei and John D. Lafferty. 2006. Dyna-
mic topic models. In Proceedings of ICML.

David M. Blei, Andrew Y. Ng, and Michael
I.
Jordan. 2003. Latent Dirichlet alloca-
tion. Journal of Machine Learning Research,
3(Jan):993–1022.

Christopher P. Burgess,

Irina Higgins, Arka
Pal, Loic Matthey, Nick Watters, Guillaume
Desjardins, and Alexander Lerchner. 2018.
Understanding disentangling in β-vae. arXiv
preprint arXiv:1804.03599.

Ziqiang Cao, Sujian Li, Yang Liu, Wenjie Li,
and Heng Ji. 2015. A novel neural topic model
and its supervised extension. In Proceedings of
AAAI.

Dallas Card, Amber Boydstun, Justin H. Gross,
Philip Resnik, and Noah A. Smith. 2015.
The media frames corpus: Annotations of
frames across issues. In Proceedings of ACL.
https://doi.org/10.3115/v1/P15
-2072

Dallas Card, Chenhao Tan, and Noah A. Smith.
2017. Neural models for documents with meta-
data. In Proceedings of ACL. https://doi
.org/10.18653/v1/P18-1189

Jonathan Chang, Sean Gerrish, Chong Wang,
Jordan L. Boyd-Graber, and David M. Blei.
2009. Reading tea leaves: How humans inter-
pret topic models. In Proceedings of NeurIPS.

The HSTM opens several avenues of future
work. One area to explore is combining black-box
language models such as BERT (Devlin et al.,

Rajarshi Das, Manzil Zaheer, and Chris Dyer.
2015. Gaussian LDA for topic models with
word embeddings. In Proceedings of ACL.

743

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
7
2
0
3
0
6
8
8

/

/
t

l

a
c
_
a
_
0
0
4
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Dorottya Demszky, Nikhil Garg, Rob Voigt,
James Zou, Jesse Shapiro, Matthew Gentzkow,
and Dan Jurafsky. 2019. Analyzing polariza-
tion in social media: Method and application
to tweets on 21 mass shootings. In Proceed-
ings of NAACL-HLT. https://doi.org
/10.18653/v1/N19-1304

Dongyeop Kang, Waleed Ammar, Bhavana
Dalvi, Madeleine van Zuylen, Sebastian
Kohlmeier, Eduard Hovy, and Roy Schwartz.
2018. A dataset of peer reviews (peerread):
Collection, insights and nlp applications. In
Proceedings of NAACL-HLT. https://doi
.org/10.18653/v1/N18-1149

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of NAACL-HLT.

Adji B. Dieng, Francisco J. R. Ruiz, and
David Blei. 2020. Topic modeling in embed-
ding spaces. Transactions of the Association
for Comxputational Linguistics, 8:439–453.
https://doi.org/10.1162/tacl a
00325

Jacob Eisenstein, Amr Ahmed, and Eric P. Xing.
2011. Sparse additive generative models of
text. In Proceedings of ICML.

Shi Feng, Eric Wallace, Alvin Grissom II,
Mohit Iyyer, Pedro Rodriguez, and Jordan
Boyd-Graber. 2018. Pathologies of neural
models make interpretations difficult. In Pro-
ceedings of EMNLP. https://doi.org
/10.18653/v1/D18-1407

Justin Grimmer and Brandon M. Stewart. 2013.
Text as data: The promise and pitfalls of
automatic content analysis methods for polit-
ical texts. Political Analysis, 21(3):267–297.
https://doi.org/10.1093/pan/mps028

Junxian He, Zhiting Hu, Taylor Berg-Kirkpatrick,
Ying Huang, and Eric P. Xing. 2017. Ef-
ficient correlated topic modeling with topic
embedding. In Proceedings of KDD.

Geoffrey E. Hinton. 2002. Training products
of experts by minimizing contrastive diver-
gence. Neural Computation, 14(8):1771–1800.
https://doi.org/10.1162/0899766027
60128018, PubMed: 12180402

Alon Jacovi and Yoav Goldberg. 2020. Towards
faithfully interpretable nlp systems: How
should we define and evaluate faithfulness?
In Proceedings of ACL. https://doi.org
/10.18653/v1/2020.acl-main.386

Sarthak Jain and Byron C. Wallace. 2019. At-
tention is not explanation. In Proceedings of
NAACL-HLT.

Pieter-Jan Kindermans, Sara Hooker,

Julius
Adebayo, Maximilian Alber, Kristof T. Sch¨utt,
Sven D¨ahne, Dumitru Erhan, and Been Kim.
2019. The (un) reliability of saliency meth-
ods, Explainable AI: Interpreting, Explaining
and Visualizing Deep Learning, pages 267–280.
Springer. https://doi.org/10.1007/978
-3-030-28954-6 14

Diederik P. Kingma and Jimmy Ba. 2015.
Adam: A method for stochastic optimization.
In Proceedings of ICLR.

Diederik P. Kingma and Max Welling. 2013.
Auto-encoding variational Bayes. arXiv pre-
print arXiv:1312.6114.

Klaus Krippendorff. 2018. Content Analysis:
An Introduction to its Methodology. Sage
publications. https://doi.org/10.4135
/9781071878781

Simon Lacoste-Julien, Fei Sha, and Michael I.
Jordan. 2008. DiscLDA: Discriminative learn-
ing for dimensionality reduction and classifica-
tion. In Proceedings of NeurIPS.

Hugo Larochelle and Stanislas Lauly. 2012. A
neural autoregressive topic model. In Proceed-
ings of NeurIPS.

Jey Han Lau, Timothy Baldwin, and Trevor
Cohn. 2017. Topically driven neural language
model. Proceedings of ACL.

Jon D. McAuliffe and David M. Blei. 2008.
Supervised topic models. In Proceedings of
NeurIPS.

P. McCullough and J. A. Nelder. 1989. General-

ized Linear Models. Chapman and Hall.

Yishu Miao, Lei Yu, and Phil Blunsom. 2016.
Neural variational inference for text processing.
In Proceedings of ICML.

Viet-An Nguyen, Jordan Boyd-Graber, Philip
Resnik, and Kristina Miler. 2015. Tea party
in the house: A hierarchical ideal point topic
model and its application to republican legis-
lators in the 112th Congress. In Proceedings

744

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
7
2
0
3
0
6
8
8

/

/
t

l

a
c
_
a
_
0
0
4
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

of ACL. https://doi.org/10.3115/v1
/P15-1139

author-topic model for authors and documents.
Proceedings of UAI.

Viet-An Nguyen, Jordan L. Ying, and Philip
Resnik. 2013. Lexical and hierarchical topic
regression. In Proceedings of NeurIPS.

Daniel Ramage, David Hall, Ramesh Nallapati,
and Christopher D. Manning. 2009. Labeled
lda: A supervised topic model for credit attri-
bution in multi-labeled corpora. In Proceedings
of EMNLP. https://doi.org/10.3115
/1699510.1699543

Margaret E. Roberts, Brandon M. Stewart,
Dustin Tingley, Christopher Lucas, Jetson
Leder-Luis, Shana Kushner Gadarian, Bethany
Albertson, and David G. Rand. 2014. Struc-
tural topic models for open-ended survey re-
sponses. American Journal of Political Science,
58(4):1064–1082. https://doi.org/10
.1111/ajps.12103

Michal Rosen-Zvi, Thomas Griffiths, Mark
Steyvers, and Padhraic Smyth. 2012. The

Cynthia Rudin. 2019. Stop explaining black
box machine learning models for high stakes
decisions and use interpretable models instead.
Nature Machine Intelligence, 1(5):206–215.
https://doi.org/10.1038/s42256
-019-0048-x

Sofia Serrano and Noah A. Smith. 2019. Is
attention interpretable? In Proceedings of ACL.

Akash Srivastava and Charles Sutton. 2017.
Autoencoding variational inference for topic
models. Proceedings of ICLR. https://doi
.org/10.18653/v1/P19-1282

Matt Taddy. 2013. Multinomial inverse regres-
sion for text analysis. Journal of the Ameri-
can Statistical Association, 108(503):755–770.
https://doi.org/10.1080/01621459
.2012.734168

Keyon Vafa, Suresh Naidu, and David M. Blei.
2020. Text-based ideal points. In Proceedings
of ACL.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
7
2
0
3
0
6
8
8

/

/
t

l

a
c
_
a
_
0
0
4
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

745
Download pdf