Weakly Supervised Domain Detection
Yumo Xu and Mirella Lapata
Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB
yumo.xu@ed.ac.uk, mlap@inf.ed.ac.uk
Abstract
In this paper we introduce domain detection
as a new natural language processing task.
We argue that the ability to detect textual seg-
ments that are domain-heavy (i.e., sentences or
phrases that are representative of and provide
evidence for a given domain) could enhance the
robustness and portability of various text clas-
sification applications. We propose an encoder-
detector framework for domain detection and
bootstrap classifiers with multiple instance learn-
ing. The model is hierarchically organized and
suited to multilabel classification. We demon-
strate that despite learning with minimal super-
vision, our model can be applied to text spans
of different granularities, languages, and gen-
res. We also showcase the potential of domain
detection for text summarization.
1
Introduction
Text classification is a fundamental
task in
Natural Language Processing (NLP) that has been
found useful in a wide spectrum of applications
ranging from search engines enabling users to
identify content on Web sites, sentiment and
social media analysis, customer relationship man-
agement systems, and spam detection. Over the
past several years, text classification has been pre-
dominantly modeled as a supervised learning
problem (e.g., Kim, 2014; McCallum and Nigam,
1998; Iyyer et al., 2015) for which appropriately
labeled data must be collected. Such data are
often domain-dependent (i.e., covering specific
topics such as those relating to ‘‘Business’’ or
‘‘Medicine’’) and a classifier trained using data
from one domain is likely to perform poorly on
another. For example, the phrase ‘‘the mouse
died quickly’’ may indicate negative sentiment in
review describing the hand-held
a customer
581
pointing device or positive sentiment when
describing a laboratory experiment performed on
a rodent. The ability to handle a wide variety of
domains1 has become more pertinent with the rise
of data-hungry machine learning techniques like
neural networks and their application to a plethora
of textual media ranging from news articles to
Twitter, blog posts, medical
journals, Reddit
comments, and parliamentary debates (Kim, 2014;
Yang et al., 2016; Conneau et al., 2017; Zhang
et al., 2016).
The question of how to best deal with multiple
domains when training data are available for one
or few of them has met with much interest in the
literature. The field of domain adaptation (Jiang
and Zhai, 2007; Blitzer et al., 2006; Daume III,
2007; Finkel and Manning, 2009; Lu et al., 2016)
aims at improving the learning of a predictive
function in a target domain where there is little or
no labeled data, using knowledge transferred from
a source domain where sufficient labeled data are
available. Another line of work (Li and Zong,
2008; Wu and Huang, 2015; Chen and Cardie,
2018) assumes that labeled data may exist for
multiple domains, but in insufficient amounts to
train classifiers for one or more of them. The aim of
multi-domain text classification is to leverage all
the available resources in order to improve system
performance across domains simultaneously.
In this paper we investigate the question of how
domain-specific data might be obtained in order
to enable the development of text classification
tools as well as more domain aware applications
such as summarization, question answering, and
1The term ‘‘domain’’ has been permissively used in the
literature to describe (a) a collection of documents related to a
particular topic such as user-reviews in Amazon for a product
category (e.g., books, movies), (b) a type of information
source (e.g., twitter, news articles), and (c) various fields
of knowledge (e.g., Medicine, Law, Sport). In this paper we
adopt the latter definition of domains, although, nothing in
our approach precludes applying it to different domain labels.
Transactions of the Association for Computational Linguistics, vol. 7, pp. 581–596, 2019. https://doi.org/10.1162/tacl a 00287
Action Editor: Yusuke Miyao. Submission batch: 2/2019; Revision batch: 5/2019; Published 9/2019.
c(cid:2) 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
information extraction. We refer to this task as
domain detection and assume a fairly common
setting where the domains of a corpus collection
are known and the aim is to identify textual
segments that are domain-heavy (i.e., documents,
sentences, or phrases providing evidence for a
given domain).
Domain detection can be formulated as a
multilabel classification problem, where a model
is trained to recognize domain evidence at the
sentence-, phrase-, or word-level. By definition
then, domain detection would require training
thereby
data with fine-grained domain labels,
increasing the annotation burden; we must provide
labels for training domain detectors and for
modeling the task we care about in the first place.
In this paper we consider the problem of fine-
grained domain detection from the perspective
of Multiple Instance Learning (MIL; Keeler and
Rumelhart, 1992) and develop domain models
with very little human involvement. Instead of
learning from individually labeled segments, our
model only requires document-level supervision
and optionally prior domain knowledge and
learns to introspectively judge the domain of
constituent segments. Importantly, we do not
require document-level domain annotations either
because we obtain these via distant supervision
by leveraging information drawn from Wikipedia.
Our domain detection framework comprises
two neural network modules; an encoder learns
representations for words and sentences together
with prior domain information if the latter is
available (e.g., domain definitions), and a de-
tector generates domain-specific scores for words,
sentences, and documents. We obtain a segment-
level domain predictor that is trained end-to-end
on document-level labels using a hierarchical,
attention-based neural architecture (Vaswani et al.,
2017). We conduct domain detection experiments
on English and Chinese and measure system
performance using both automatic and human-
based evaluation. Experimental results show that
our model outperforms several strong baselines
and is robust across languages and text genres,
despite learning from weak supervision. We also
showcase our model’s application potential for text
summarization.
Our contributions in this work are threefold; we
propose domain detection, as a new fine-grained
multilabel
learning problem which we argue
would benefit the development of domain aware
NLP tools; we introduce a weakly supervised
encoder-detector model within the context of
multiple instance learning; and we demonstrate
that it can be applied across languages and text
genres without modification.
2 Related Work
Our work lies at
the intersection of multiple
research areas, including domain adaptation, rep-
resentation learning, multiple instance learning,
and topic modeling. We review related work
below.
Domain adaptation A variety of domain
adaptation methods (Jiang and Zhai, 2007; Arnold
et al., 2007; Pan et al., 2010) have been pro-
posed to deal with the lack of annotated data
in novel domains faced by supervised models.
Daume and Marcu (2006) propose to learn three
separate models, one specific to the source do-
main, one specific to the target domain, and a third
one representing domain general
information.
A simple yet effective feature augmentation
technique is further introduced in Daume (2007)
which Finkel and Manning (2009) subsequently
recast within a hierarchical Bayesian framework.
More recently, Lu et al. (2016) present a gen-
eral regularization framework for domain ad-
aptation while Camacho-Collados and Navigli
(2017)
integrate domain information within
lexical resources. A popular approach within text
classification learns features that are invariant
across multiple domains while explicitly modeling
the individual characteristics of each domain
(Chen and Cardie, 2018; Wu and Huang, 2015;
Bousmalis et al., 2016).
Similar to domain adaptation, our detection task
also identifies the most discriminant features for
different domains. However, whereas adaptation
aims to render models more portable by trans-
ferring knowledge, detection focuses on the
domains themselves and identifies the textual
segments that provide the best evidence for
their semantics, allowing to create data sets with
explicit domain labels to which domain adapta-
tion techniques can be further applied.
instance
Multiple
learning MIL handles
problems where labels are associated with groups
or bags of instances (documents in our case), while
instance labels (segment-level domain labels) are
unobserved. The task is then to make aggregate
582
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
instance-level predictions, by inferring labels
either for bags (Keeler and Rumelhart, 1992;
Dietterich et al., 1997; Maron and Ratan, 1998) or
jointly for instances and bags (Zhou et al., 2009;
Wei et al., 2014; Kotzias et al., 2015). Our domain
detection model is an example of the latter variant.
Initial MIL models adopted a relatively strong
consistency assumption between bag labels and
instance labels. For instance, in binary classi-
fication, a bag was considered positive only if
all its instances were positive (Dietterich et al.,
1997; Maron and Ratan, 1998; Zhang et al.,
2002; Andrews and Hofmann, 2004; Carbonetto
et al., 2008). The assumption was subsequently
relaxed by investigating prediction combinations
(Weidmann et al., 2003; Zhou et al., 2009).
Within NLP, multiple instance learning has
been predominantly applied to sentiment analysis.
Kotzias et al. (2015) use sentence vectors obtained
by a pre-trained hierarchical convolutional neu-
ral network (Denil et al., 2014) as features under
a MIL objective that simply averages instance
contributions towards bag classification (i.e., positive/
negative document sentiment). Pappas and Popescu-
Belis (2014) adopt a multiple instance regression
model
to assign sentiment scores to specific
product aspects, using a weighted summation of
predictions. More recently, Angelidis and Lapata
(2018) propose MILNET, a multiple instance learn-
ing network model for sentiment analysis. They
use an attention mechanism to flexibly weigh
predictions and recognize sentiment-heavy text
snippets (i.e., sentences or clauses).
We depart from previous MIL-based work in
devising an encoding module with self-attention
and non-recurrent structure, which is particularly
suitable for modeling long documents efficiently.
Compared with MILNET (Angelidis and Lapata,
2018), our approach generalizes to segments of
arbitrary granularity; it introduces an instance
scoring function that supports multilabel rather
than binary classification,
and takes prior
knowledge into account (e.g., domain definitions)
to better inform the model’s predictions.
Topic modeling Topic models are built around
the idea that the semantics of a document col-
lection is governed by latent variables. The aim
is therefore to uncover these latent variables—
topics—that shape the meaning of the document
collection. Latent Dirichlet Allocation (LDA; Blei
et al. 2003) is one of the best-known topic models.
In LDA, documents are generated probabilisti-
cally using a mixture over K topics that are in turn
characterized by a distribution over words. And
words in a document are generated by repeatedly
sampling a topic according to the topic distribution
and selecting a word given the chosen topic.
Although most topic models are unsupervised,
some variants can also accommodate document-
level supervision (Mcauliffe and Blei, 2008;
Lacoste-Julien et al., 2009). However,
these
models are not appropriate for analyzing multiply
labeled corpora because they limit documents
to being associated with a single label. Multi-
multinomial LDA (Ramage et al. 2009b) relaxes
this constraint by modeling each document as a
bag of words with a bag of labels, and topics
for each observation are drawn from a shared
topic distribution. Labeled LDA (L-LDA; Ramage
et al., 2009a) goes one step further by directly
topics thereby
associating labels with latent
learning label-word correspondences. L-LDA is
a natural extension of both LDA by incorporating
supervision
naive Bayes
(McCallum and Nigam, 1998) by incorporating
a mixture model (Ramage et al., 2009a).
and multinomial
Similar to L-LDA, DETNET is also designed
to perform learning and inference in multi-label
settings. Our model adopts a more general solu-
tion to the credit attribution problem (i.e., the
association of textual units in a document with
semantic tags or labels). Despite learning from a
weak and distant signal, our model can produce
domain scores for text spans of varying granular-
ity (e.g., sentences and phrases), not just words,
and achieves this with a hierarchically-organized
neural architecture. Aside from learning through
efficient backpropagation, the proposed frame-
work can take incorporate useful prior information
(e.g., pertaining to the labels and their meaning).
3 Problem Formulation
We formulate domain detection as a multilabel
learning problem. Our model
is trained on
samples of document-label pairs. Each document
consists of s sentences x = {x1, . . . , xs}
and is associated with discrete labels y =
{y(c)|c ∈ [1, C]}. In this work, domain labels
are not annotated manually but extrapolated from
Wikipedia (see Section 6 for details). In a non-
MIL framework, a model
typically learns to
predict document labels by directly conditioning
583
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
4 The Encoder Module
We learn representations for words and sentences
using identical encoders with separate learning
parameters. Given a document, the two encoders
implement the following steps:
Z, α = WORDENC(X)
G = [g1; . . . ; gs] where g = Zα
H, β = SENTENC(G)
For each sentence X = [x1; . . . ; xn],
the
word-level encoder yields contextualized word
representations Z and their attention weights α.
Sentence embeddings g are obtained via weighted
averaging and then provided as input
to the
sentence-level encoder, which outputs contex-
tualized representations H and their attention
weights β.
In this work we aim to model fairly long doc-
uments (e.g., Wikipedia articles; see Section 6 for
details). For this reason, our encoder builds on the
Transformer architecture (Vaswani et al., 2017), a
recently proposed highly efficient model that has
achieved state-of-the-art performance in machine
translation (Vaswani et al., 2017) and question
answering (Yu et al., 2018). The Transformer
aims at reducing the fundamental constraint of se-
quential computation that underlies most archi-
tectures based on recurrent neural networks. It
eliminates recurrence in favor of applying a
self-attention mechanism which directly models
relationships between all words in a sentence,
regardless of their position.
Self-attentive encoder As shown in Figure 2,
the Transformer is a non-recurrent framework
comprising m identical layers. Information on the
(relative or absolute) position of each token in a
sequence is represented by the use of positional
encodings which are added to input embeddings
(see the bottom of Figure 2). We denote position-
augmented inputs in a sentence with X. Our
model uses four layers in both word and sentence
encoders. The first three layers are identical to
those in the Transformer (m = 3), comprising a
multi-head self-attention sublayer and a position-
wise fully connected feed-forward network. The
last layer is simply a multi-head self-attention
layer yielding attention weights for subsequent
operations.
Single-head attention takes three parameters as
input in the Transformer (Vaswani et al., 2017):
Figure 1: Overview of DETNET. The encoder learns
document representations in a hierarchical fashion and
the decoder generates domain scores, while selectively
attending to previously encoded information. Prior
information can be optionally incorporated when
available at
the encoding stage through parameter
sharing.
on its sentence representations h1, . . . , hs or
their aggregate. In contrast, y under MIL is a
learned function fθ of latent instance-level labels,
that is, y = fθ(y1, . . . , ys). A MIL classifier will
therefore first produce domain scores for all in-
stances (aka sentences), and then learn to inte-
grate instance scores into a bag (i.e., document)
prediction.
the
In this paper we further assume that
instance-bag relation applies to sentences and
documents but also to words and sentences. In
addition, we incorporate prior domain information
to facilitate learning in a weakly supervised
setting: Each domain is associated with a
definition U (c), namely, a few sentences providing
a high-level description of the domain at hand.
For example, the definition of the ‘‘Lifestyle’’
domain is ‘‘the interests, opinions, behaviors, and
behavioral orientations of an individual, group, or
culture’’.
Figure 1 provides an overview of our Domain
Detection Network, which we call DETNET.
The model includes two modules; an encoder
learns representations for words and sentences
while incorporating prior domain information;
a detector generates domain scores for words,
sentences, and documents by selectively attending
to previously encoded information. We describe
the two modules in more detail below.
584
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
in the word encoder yields a set of attention
matrices A = {A(k)}r
k=1 for each sentence where
A(k) ∈ Rn×n. Therefore, when measuring the
contributions from words to sentences (e.g., in
terms of domain scores and representations) we
can selectively focus on salient words within the
set A = {A(k)}r
k=1:
α = softmax(
1√
nr
r(cid:2)
n(cid:2)
k
(cid:3)
A(k)
(cid:4),(cid:3))
(5)
where the softmax function outputs the salience
distribution over words:
softmax(a(cid:3)) =
(cid:3)
ea(cid:2)
a(cid:2)(cid:5) ∈a ea(cid:2)(cid:5)
(6)
and obtain sentence embeddings g = Zα.
In the same vein, we adopt another self-attentive
encoder to obtain contextualized sentence repre-
sentations H ∈ Rdh×s. The final
layer out-
puts multi-head attention score matrices B =
k=1 (with B(k) ∈ Rs×s), and we calculate
{B(k)}r
sentence salience as:
β = softmax(
1√
sr
r(cid:2)
s(cid:2)
k
j
B(k)
(cid:4),j ).
(7)
Prior information In addition to documents
(and their domain labels), we might have some
prior knowledge about the domain, for example,
its general semantic content and the various topics
related to it. For example, we might expect articles
from the ‘‘Lifestyle’’ domain to not talk about
missiles or warfare, as these are recurrent themes
in the ‘‘Military’’ domain. As mentioned earlier,
throughout this paper we assume we have domain
definitions U expressed in a few sentences as prior
knowledge. Domain definitions share parameters
with WORDENC and SENTENC and are encoded in a
definition matrix U ∈ Rdh×C.
Intuitively, identifying the domain of a word
might be harder than that of a sentence; on ac-
count of being longer and more expressive, sen-
tences provide more domain-related cues than
words whose meaning often relies on supporting
context. We thus inject domain definitions U into
our word detector only.
Figure 2: Self-attentive encoder
(Vaswani et al., 2017) stacking m identical layers.
in Transformer
a query matrix, a key matrix, and a value matrix.
These three matrices are identical and equal to the
inputs X at the first layer of the word encoder.
The output of a single-head attention is calculated
via:
head(X, X, X) = softmax(
XX (cid:2)
√
dx
)X.
(1)
Multi-head attention allows us to jointly at-
tend to information from different representation
subspaces at different positions. This is done by
first applying different linear projections to inputs
and then concatenating them:
2 , XW (k)
head(k) = head(XW (k)
multi-head=concat(head(1), .., head(r))W 4
1 , XW (k)
3 ) (2)
(3)
where we adopt four heads (r = 4) for both
word and sentence encoders. The second sublayer
in the Transformer (see Figure 2) is a fully-
connected feed-forward network applied to each
position separately and identically:2
FFN(x) = max(0, xW 5)W 6.
(4)
After sequentially encoding input embeddings
through the first three layers, we obtain con-
textualized word representations Z ∈ Rdz×n.
Based on Z, the last multi-head attention layer
2We omit here the bias term for the sake of simplicity.
585
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
5 The Detector Module
DETNET adopts three detectors corresponding to
words, sentences, and documents:
Qinstc =
(cid:4)
qinstc
1
; . . . ; qinstc
(cid:5)
P = WORDDET(Z, U )
where qinstc = P α
s
Q = SENTDET(Qinstc, H)
˜y = DOCDET(Q, β)
WORDDET first produces word domain scores
using both lexical semantic information Z and
prior (domain) knowledge U ; SENTDET yields
domain scores for sentences while integrating
downstream instance signals Qinstc and sentence
semantics H; finally, DOCDET makes the final
document-level predictions based on sentence
scores.
Word detector Our first detector yields word
domain scores. For a sentence, we obtain a self-
scoring matrix P self using its own contextual
word semantic information:
P self = tanh(W zZ).
(8)
In contrast
to the representations used in
Angelidis and Lapata (2018), we generate instance
scores from contextualized representations, that is,
Z. Because the softmax function normally favors
single-mode outputs, we adopt tanh(·) ∈ (−1, 1)
as our domain scoring function to tailor MIL to
our multilabel scenario.
As mentioned earlier, we use domain definitions
as prior information at the word level and compute
the prior score via:
P prior = tanh(max(0, U (cid:2)W u)Z)
(9)
where W u ∈ Rdu×dz projects prior information
U onto the input semantic space. The prior score
matrix P prior captures the interactions between
domain definitions and sentential contents.
In this work, we flexibly integrate scoring com-
ponents with gates, as shown in Figure 3. The key
idea is to learn a prior gate Γ balancing Equations (8)
and (9) via:
Γ = γσ(W g,p[Z, P self , P prior])
P = Γ (cid:6) P prior + (J − Γ) (cid:6) P self
(10)
(11)
Figure 3: Domain predictions for words and sentences;
the instance-bag relation applies to words-sentences
(red shadow) and sentences-documents (green shadow).
Squares denote representations of words or sentences,
and circles are domain scores.
(cid:6) denotes element-wise multiplication and [·, ·]
matrix concatenation. σ(·) ∈ (0, 1) is the sigmoid
function and Γ ∈ (0, γ) the prior gate with scaling
factor γ, a hyperparameter controlling the overall
effect of prior information and instances.3
Sentence detector The second detector iden-
tifies sentences with domain-heavy semantics
based on signals from the sentence encoder, prior
information, and word instances. Again, we obtain
a self-scoring matrix Qself via:
Qself = tanh(W hH).
(12)
After computing sentence scores from sentence-
level signals, we estimate domain scores from
individual words. We do this by reusing α in
Equation (5), qinstc = P α. After gathering qinstc
for each sentence, we obtain Qinstc ∈ RC×s as
the full instance score matrix.
Analogously to the word-level detector (see
Equation (10)), we use a sentence-level upward
gate Λ to dynamically propagate domain scores
from downstream word instances to sentence bags:
Λ = λσ(W (cid:3)[H, Qinstc, Qself ])
Q = Λ (cid:6) Qinstc + (J − Λ) (cid:6) Qself
(13)
(14)
where Q is the final sentence score matrix.
Document detector Document-level domain
scores are based on the sentence salience distri-
bution β (see Equation (7)) and are computed as
where J is an all-ones matrix and P ∈ RC×n is
the final domain score matrix at the word-level;
3Initially, we expected to balance these effects by purely
relying on the learned function without a scaling factor. This
led to poor performance, however.
586
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
All Documents
Training Documents
Development Documents
Test Documents
Multilabel Ratio
Average #Words
Vocabulary Size
Synthetic Documents
Synthetic Sentences
Wiki-en
31,562
25,562
3,000
3,000
10.18%
1,152.08
175,555
200
18,922
Wiki-zh
26,280
22,280
2,000
2,000
29.73%
615.85
169,179
200
18,312
Table 1: Statistics of Wikipedia data sets; en
and zh are shorthand for English and Chinese,
respectively. Synthetic documents and sentences
are used in our automatic evaluation experiments
discussed in Section 7.
the weighted average of sentence scores:
˜y = Qβ.
(15)
We use only document-level supervision for
multilabel learning in C domains. Formally, our
training objective is:
L = min − 1
N
N(cid:2)
C(cid:2)
i
c
log(1 + e−˜y(i)
c y(i)
c )
(16)
where N is the training set size. At test time,
we partition domains into a relevant set and an
irrelevant set for unseen samples. Because the
domain scoring function is tanh(·) ∈ (−1, 1), we
use a threshold of 0 against which ˜y is calibrated.4
6 Experimental Set-up
Data sets DETNET was trained on two data sets
created from Wikipedia5 for English and Chinese.6
Wikipedia articles are organized according to a
hierarchy of categories representing the defining
characteristics of a field of knowledge. We re-
cursively collect Wikipedia pages by first deter-
mining the root categories based on their match
with the domain name. We then obtain their
subcategories,
the subcategories of these sub-
categories, and so on. We treat all pages associated
with a category as representative of the domain of
its root category.
4If ∀c ∈ [1, C] : ˜yc < 0 holds, we set ˜yc∗ = 1 and select
c∗ as c∗ = arg maxc ˜yc to produce a positive prediction.
5http://static.wikipedia.org/downloads/
Algorithm 1 Document Generation
Input: S = {Sd}D
1 : Label combinations
O = {Od}D
1 : Sentence subcorpora
(cid:5)max: Maximum document length
Output: A synthetic document
function GENERATE(S, O, (cid:5)max)
Generate a document domain set S doc ∈ S
S sent ← S doc ∪ {GEN}
if |S sent| < C then (cid:6) Number of domain labels
Generate a noisy domain (cid:7) ∈ Y \ S sent
S sent ← S sent ∪ {(cid:7)}
(cid:6) A set of candidate domain sets
end if
S cdt ← ∅;
for Sd ∈ S do
if Sd ∈ S sent then
S cdt ← S cdt ∪ {Sd}
end if
end for
nlabel ← |S cdt|
nsent ← (cid:5)max
L ← ∅
for S cdt
∈ S cdt do
(cid:6) Number of unused labels
(cid:6) Number of sentence blocks
(cid:6) For generated sentences
d
θ = min(|Od|, nsents + 1 − nlabels, 2nsents
nlabels
Generate (cid:5)d ∼ U nif orm(1, θ)
Generate (cid:5)d sentences Ld ⊆ Od
L ← L ∪ Lc
nsent ← (cid:5)max − |L|
nlabel ← nlabel − 1
)
end for
L ← SHUFFLE(L)
return L
end function
In our experiments we used seven target
(BUS),
domains:
‘‘Business and Commerce’’
‘‘Government and Politics’’ (GOV), ‘‘Physical
and Mental Health’’ (HEA), ‘‘Law and Order’’
(LAW), ‘‘Lifestyle’’ (LIF), ‘‘Military’’ (MIL), and
‘‘General Purpose’’ (GEN). Exceptionally, GEN
does not have a natural root category. We leverage
Wikipedia’s 12 Main Categories7 to ensure that
GEN is genuinely different from the other six
domains. We used 5,000 pages for each domain.
Table 1 shows various statistics on our data set.
System comparisons We constructed three
variants of DETNET to explore the contribution
of different model components. DETNET1H has a
single-level hierarchical structure, treating only
sentences as instances and documents as bags;
whereas DETNET2H has a two-level hierarchical
structure (the instance-bag relation applies to
2008-06/en
7https://en.wikipedia.org/wiki/Portal:
6Available at https://github.com/yumoxu/detnet
Contents/Categories.
587
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
and
sentences-documents);
words-sentences
∗ is our full model, which is fully
finally, DETNET
hierarchical and equipped with prior information
(i.e., domain definitions). We also compared
DETNET to a variety of related systems, which
include:
MAJOR: The Majority domain label applies to
all instances.
L-LDA: Labeled LDA (Ramage et al., 2009a)
is a topic model that constrains LDA by defining a
one-to-one correspondence between LDA’s latent
topics and observed labels. This allows L-LDA to
directly learn word-label correspondences. We
obtain domain scores for words through the
topic-word-count matrix M ∈ RC×V , which is
computed during training:
˜M =
(cid:3)
M (cid:2) + β
C
c M (cid:2)
∗,c + C ∗ β
(17)
where C and V are the number of domain
labels and the size of vocabulary, respectively.
to 1/C and
Scalar β is a prior value set
matrix ˜M ∈ RV ×C consists of word scores over
domains. Following the snippet extraction ap-
proach proposed in Ramage et al.
(2009a),
L-LDA can also be used to score sentences as
the expected probability that the domain label had
generated each word. For more details on L-LDA,
we refer the interested reader to Ramage et al.
(2009a).
HIERNET: A hierarchical neural network model
described in Angelidis and Lapata (2018) that
produces document-level predictions by atten-
tively integrating sentence representations. For
this model we used word and sentence encoders
identical to DETNET. HIERNET does not generate
instance-level predictions, however, we assume
that document-level predictions apply to all
sentences.
MILNET: A variant of the MIL-based model
introduced in Angelidis and Lapata (2018) that
considers sentences as instances and documents
as bags (whereas DETNET generalizes the instance-
bag relationship to words and sentences). To make
MILNET comparable to our system, we use an
encoder identical to DETNET—that is two Trans-
former encoders for words and sentences, respec-
tively. Thus, MILNET differs from DETNET1H in
two respects: (a) word representations are sim-
ply averaged without word-level attention to
build sentence embeddings and (b) context-free
588
sentence embeddings generate sentence domain
scores before being fed to the sentence encoder.
Implementation details We used 16 shuffled
samples in a batch where the maximum document
length was set to 100 sentences with the ex-
cess clipped. Word embeddings were initialized
randomly with 256 dimensions. All weight matri-
ces in the model were initialized with the fan-
in trick (Glorot and Bengio, 2010) and biases
were initialized with zero. Apart from using layer
normalization (Ba et al., 2016) in the encoders, we
applied batch normalization (Ioffe and Szegedy,
2015) and a dropout rate of 0.1 in the detectors to
accelerate model training. We trained the model
with the Adam optimizer (Kingma and Ba, 2014).
We set all three gate scaling factors in our model
to 0.1. Hyper-parameters were optimized on the
development set. To make our experiments easy
to replicate, we release our PyTorch (Paszke et al.,
2017) source code.8
7 Automatic Evaluation
the results of our
In this section we present
automatic evaluation for sentence and document
predictions. Problematically, for sentence pre-
dictions we do not have gold-standard domain
labels (we have only extrapolated these from
Wikipedia for documents). We therefore devel-
oped an automatic approach for creating silver
standard domain labels which we describe below.
Test data generation In order to obtain sen-
tences with domain labels, we exploit lead sen-
tences in Wikipedia articles. Lead sentences
typically define the article’s subject matter and
emphasize its topics of interest.9 As most lead
sentences contain domain-specific content, we
can fairly confidently assume that document-level
domain labels will apply. To validate this assump-
tion, we randomly sampled 20 documents contain-
ing 220 lead sentences and asked two annotators
to label these with domain labels. Annotators
overwhelmingly agreed in their assignments with
the document labels; the (average) agreement was
K = 0.89 using Cohen’s Kappa coefficient.
We used the lead sentences to create pseudo
documents simulating real documents whose sen-
tences cover multiple domains. To ensure that
sentence labels are combined reasonably (e.g.,
8Available at https://github.com/yumoxu/detnet
9https://en.wikipedia.org/wiki/Lead paragraph
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Systems
MAJOR
L-LDA
HIERNET
MILNET
DETNET1H
DETNET2H
DETNET
∗
Sentences
zh
en
5.99†
2.81†
37.09†
38.52†
37.26†
30.01†
44.37†
37.12†
51.31†
47.93†
52.50†
47.89†
55.88
54.37
Documents
zh
en
4.41†
3.81†
58.74†
63.10†
68.56†
75.00
69.45†
50.90†
72.85
74.91
71.96†
75.47
74.24
76.48
Performance using Macro-F1% on
Table 2:
automatically created Wikipedia test set; models
with the symbol † are significantly (p < 0.05)
different from the best system in each task using
the approximate randomization test
(Noreen,
1989).
MIL is not likely to coexist with LIF), prior to
generating synthetic documents we traverse the
training set and acquire all domain combinations
S, e.g., S = {{GOV}, {GOV, MIL}}. We then gather
lead sentences representing the same domain
combinations. We generate synthetic documents
with a maximum length of 100 sentences (we also
clip real documents to the same length).
Algorithm 1 shows the pseudocode for docu-
ment generation. We first sample document labels,
then derive candidate label sets for sentences
by introducing GEN and a noisy label (cid:7). After
sampling sentences for each domain, we shuffle
them to achieve domain-varied sentence contexts.
We created two synthetic data sets for English and
Chinese. Detailed statistics are shown in Table 1.
Evaluation metric We evaluate system perfor-
mance automatically using label-based Macro-F1
(Zhang and Zhou, 2014), a widely used metric
for multilabel classification. It measures model
performance for each label specifically and then
macro-averages the results. For each class, given a
(cid:7)
containing the number of
confusion matrix
samples classified as true positive, false positive,
true negative, and false negative, Macro-F1 is
calculated as 1
where C is the
C
number of domain labels.
2tpc
2tpc+fpc+fnc
tp fn
fp tn
C
c=1
(cid:3)
(cid:6)
Results Our results are summarized in Table 2.
We first
report domain detection results for
documents, since reliable performance on this
task is a prerequisite for more fine-grained domain
detection. As shown in Table 2, DETNET does well
on document-level domain detection, managing
to outperform systems over which it has no clear
advantage (such as HIERNET or MILNET).
As far as sentence-level prediction is concerned,
all DETNET variants significantly outperform all
∗ is the best
comparison systems. Overall, DETNET
system achieving 54.37% and 55.88% Macro-F1
on English (en) and Chinese (zh), respectively.
It outperforms MILNET by 17.25% on English
and 11.51% on Chinese. The performance of the
fully hierarchical model DETNET2H is better than
DETNET1H, showing positive effects of directly
incorporating word-level domain signals. We also
observe that prior information is generally helpful
on both languages and both tasks.
8 Human Evaluation
Aside from automatic evaluation, we also assessed
model performance against human elicited domain
labels for sentences and words. The purpose of
this experiment was threefold: (a) to validate
the results obtained from automatic evaluation;
(b) to evaluate finer-grained model performance
at the word level; and (c) to examine whether
our model generalizes to non-Wikipedia articles.
For this, we created a third test set from the
New York Times,10 in addition to our Wikipedia-
based English and Chinese data sets. For all three
corpora, we randomly sampled two documents for
each domain, and then from each document, we
sampled one long paragraph or a few consecutive
short paragraphs containing 8–12 sentences.
Amazon Mechanical Turk (AMT) workers were
asked to read these sentences and assign a domain
based on the seven labels used in this paper
(multiple labels were allowed). Participants were
provided with domain definitions. We obtained
five annotations per sentence and adopted the
majority label as the sentence’s domain label.
We obtained two annotated data sets for English
(Wiki-en and NYT-en) and one for Chinese
(Wiki-zh), consisting of 122/14, 111/11, and
117/12 sentences/documents each.
domain
Word-level
is more
evaluation
challenging;
individual
taken out-of-context,
words might be uninformative or carry meanings
compatible with multiple domains. Expecting
crowdworkers to annotate domain labels word-
by-word with high confidence might be therefore
problematic. In order to reduce annotation com-
plexity, we opted for a retrieval-style task for
10https://catalog.ldc.upenn.edu/LDC2008T19
589
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Systems
MAJOR
L-LDA
HIERNET
MILNET
DETNET1H
DETNET2H
∗
DETNET
Wiki-en
1.34†
27.81†
42.23†
39.30†
48.12†
54.70†
58.01
Sentences
Wiki-zh
6.14†
28.94†
29.93†
45.14†
51.76†
57.60
51.28†
NYT
0.51†
28.08†
44.74†
29.31†
57.06†
55.78†
60.62
Wiki-en
1.39†
24.58†
15.57†
22.11†
16.21†
27.06
26.08
Words
Wiki-zh
14.95†
42.67
24.25†
33.10†
26.90†
43.82
43.18
NYT
0.39†
26.24
18.27†
23.33†
21.61†
26.52
27.03
Table 3: System performance using Macro-F1% (test set created via AMT); models with the symbol † are
significantly (p < 0.05) different from the best system in each task using the approximate randomization
test (Noreen, 1989).
word evaluation. Specifically, AMT workers were
given a sentence and its domain label (obtained
from the sentence-level elicitation study described
above), and asked to highlight which words they
considered consistent with the domain of the
sentence. We used the same corpora/sentences
as in our first AMT study. Analogously, words in
each sentence were annotated by five participants
and their labels were determined by majority
agreement.
Fully hierarchical variants of our model
∗) and L-LDA are able to
(i.e., DETNET2H, DETNET
produce word-level predictions; we thus retrieved
the words within a sentence whose domain score
was above the threshold of 0 and compared them
against
the labels provided by crowdworkers.
MILNET and DETNET1H can only make sentence-
level predictions. In this case, we assume that
the sentence domain applies to all words therein.
HIERNET can only produce document-level predic-
tions based on which we generate sentence labels
and further assume that these apply to sentence
words too. Again, we report Macro-F1, which we
compute as 2p∗r∗
p∗+r∗ where precision p∗ and recall r∗
are both averaged over all words.
We show model performance against AMT
domain labels in Table 3. Consistent with the
automatic evaluation results, DETNET variants are
the best performing models on the sentence-
level task. On the Wikipedia data sets, DETNET2H or
∗ outperform all baselines and DETNET1H by
DETNET
a large margin, showing that word-level signals
can indeed help detect sentence domains. Al-
though statistical models are typically less accu-
rate when they are applied to data that has
a different distribution from the training data,
∗ works surprisingly well on NYT, sub-
DETNET
stantially outperforming all other systems. We
Domains
BUS
HEA
GEN
GOV
LAW
LIF
MIL
Avg
Wiki-en
78.65
42.11
43.33
80.00
69.77
17.24
75.00
58.01
Wiki-zh
68.66
81.36
37.29
37.74
41.03
27.91
65.00
51.28
NYT
77.33
64.52
43.90
62.07
46.51
50.00
80.00
60.62
∗ performance
Table 4: Sentence-level DETNET
(Macro-F1%) across domains on three data sets.
also notice that prior information is useful in
making domain predictions for NYT sentences:
Because our models are trained on Wikipedia,
prior domain definitions largely alleviate the genre
shift to non-Wikipedia sentences. Table 4 provides
∗
a breakdown of the performance of DETNET
across domains. Overall,
the model performs
worst on LIF and GEN domains (which are very
broad) and best on BUS and MIL (which are very
narrow).
With regard to word-level evaluation, DETNET2H
∗ are the best systems and are sig-
and DETNET
nificantly better against all comparison models
by a wide margin, except L-LDA. The latter is
a strong domain detection system at the word-
level since it is able to directly associate words
with domain labels (see Equation (17)) without
resorting to document- or sentence-level predic-
tions. However, our two-level hierarchical model
is superior considering all-around performance
across sentences and documents. The results here
accord with our intuition from previous exper-
iments: hierarchical models outperform simpler
variants (including MILNET) because they are
able to capture and exploit fine-grained domain
590
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Domains
∗
DETNET
BUS monopolization,
enactment, panama,
funding,
arbitron, maturity, groceries, os, elevator, salary,
organizations, pietism, contract, mercantilism,
sectors
psychology, divorce, residence, pilates, dorlands,
culinary, technique, emotion, affiliation, seafood,
famine, malaria, oceans, characters, pregnancy
gender, destruction, beliefs, schizophrenia, area,
writers, armor, creativity, propagation, chem-
informatics, overpopulation, deity, stimulation,
mathematical, cosmology
penology, tenure, governance, alloys, biosecurity,
authoritarianism,
burundi, motto,
criticisms,
imperium, mesopotamia, juche, 420, krytocracy,
criticism
alloys, biosecurity, authoritarianism, mesopotamia,
electronic, economical, pupil, pathophysiology,
imperium, phonology, collusion, cantons, auctori-
tas, sigint, juche
teacher, freight, career, agaricomycetes, casein,
manga, diplogasteria, benefit, pteridophyta, basid-
iomycota, ascomycota, letters, eukaryota, carcino-
gens, lifespan
battles, eads, insignia, commanders, artillery, width,
episodes, neurasthenia, reconnaissance, elevation,
freedom, length, patrol, manufacturer, demise
HEA
GEN
GOV
LAW
LIF
MIL
L-LDA
also, business, company, used, one, management,
may, business, united, 2007, time, first, new,
market, new
also, health, may, used, one, disease, medical,
use, first, people, 1, many, time, water, care
also, one, theory, 1, used, time, two, may, first,
example, many, called, form, would, known
also, government, political, state, united, party,
one, minister, national, states, first, would, used,
new, university
law, also, united, legal, may, act, states, court,
rights, one, case, state, would, v, government
also, used, may, often, one, made, water, food,
many, use, usually, called, known, oil, time
military, war, army, also, air, united, force, states,
one, used, forces, first, royal, british, world
Table 5: Top 15 domain words in the Wiki-en development set according to DETNET
∗ and L-LDA.
signals relatively accurately. Interestingly, prior
information does not seem to have an effect on the
Wikipedia data sets, but is useful when transfer-
ring to NYT. We also observe that models trained
on the Chinese data sets perform consistently
better than English. Analysis of the annotations
provided by crowdworkers revealed that the ratio
of domain words in Chinese is higher compared
with English (27.47% vs. 13.86% in Wikipedia
and 16.42% in NYT), possibly rendering word
retrieval in Chinese an easier task.
Table 5 shows the 15 most representative do-
∗) on
main words identified by our model (DETNET
Wiki-en for our seven domains. We obtained this
list by weighting word domain scores P with their
attention scores:
P ∗ = P (cid:6) [α; . . . ; α](cid:2)
(18)
and ranking all words in the development set
according to P ∗, separately for each domain.
Because words appearing in different contexts
are usually associated with multiple domains, we
determine a word’s ranking for a given domain
based on the highest score. As shown in Table 5,
biosecurity and authoritarianism are prevalent
in both GOV and LAW domains. Interestingly,
with contextualized word representations, fairly
general English words are recognized as domain-
heavy. For example, technique is a strong domain
word in HEA and 420 in GOV (the latter is slang for
the consumption of cannabis and highly associated
with government regulations).
For comparison, we also show the top domain
words identified by L-LDA via matrix ˜M (see
Equation (17)). To produce meaningful output,
we have removed stop words and punctuation
tokens, which are given very high domain scores
by L-LDA (this is not entirely surprising since
˜M is based on simple co-occurrence). Notice
that no such post-processing is necessary for our
model. As shown in Table 5, the top domain
words identified by L-LDA (on the right) are
more general and less informative, than those
from DETNET
∗ (on the left).
9 Domain-Specific Summarization
In this section we illustrate how fine-grained
domain scores can be used to produce domain
591
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
summaries, following an extractive, unsupervised
approach. We assume the user specifies the
domains they are interested in a priori (e.g., LAW,
HEA) and the system returns summaries targeting
the semantics of these domains.
Specifically, we introduce DETRANK, an ex-
tension of the well-known TEXTRANK algorithm
(Mihalcea and Tarau, 2004), which incorporates
∗. For each
domain signals acquired by DETNET
document, TEXTRANK builds a directed graph
G = (V, E) with nodes V corresponding to sen-
tences, and undirected edges E whose weights are
computed based on sentence similarity. Speci-
fically, edge weights are represented with matrix
E where each element Ei,j corresponds to the
transition probability from vertex i to vertex j.
Following Barrios et al. (2016), Ei,j is computed
with the Okapi BM25 algorithm (Robertson et al.,
1995), a probabilistic version of TF-IDF, and small
weights (< 0.001) are set to zeros. Unreachable
nodes are further pruned to acquire the final vertex
set V .
To enhance TEXTRANK with domain informa-
tion, we first multiply sentence-level domain scores
Q with their corresponding attention scores:
Q∗ = Q (cid:6) [β; . . . ; β](cid:2).
(19)
and for a given domain c, we can extract a
(domain) sentence score vector q∗ = Q∗
c,∗ ∈
R1×s. Then, from q∗, we produce vector ˜q ∈
R1×|V |
representing a distribution of domain
signals over sentences:
ˆq = [q∗
(cid:8)
i ]i∈V
ˆq − ˆqmin
ˆqmax
− ˆqmin
(cid:9)
(20)
(21)
˜q = softmax
In order to render domain signals in different
sentences more discernible, we scale all elements
in ˆq to [0, 1] before obtaining a legitimate
distribution with the softmax function. Finally, we
integrate the domain component into the original
transition matrix as:
˜E = φ ∗ ˆq + (1 − φ) ∗ E
(22)
where φ ∈ (0, 1) controls the extent to which
domain-specific information influences sentence
selection for the summarization task; higher φ
will lead to summaries which are more domain-
relevant. Here, we empirically set φ = 0.3. The
main difference between DETRANK and TEXTRANK
592
Method
TEXTRANK
DETRANK
Inf
45.45†
54.55
Succ
51.11
48.89
Coh
42.50†
57.50
All
46.35†
53.65
Table 6: Human evaluation results for summaries
produced by TEXTRANK and DETRANK; proportion
of times AMT workers found models Informative
(Inf), Succinct (Succ), and Coherent (Coh); All
is the average across ratings; symbol † denotes
that differences between models are statistically
significant (p < 0.05) using a pairwise t-test.
is that TEXTRANK treats 1 − φ as a damping factor
and a uniform probability distribution is applied
to ˆq.
In order to decide which sentence to include in
the summary, a node’s centrality is measured using
a graph-based ranking algorithm (Mihalcea and
Tarau, 2004). Specifically, we run a Markov chain
with ˜E on G until it converges to the stationary
distribution e∗ where each element denotes the
salience of a sentence. In the proposed DETRANK
algorithm, e∗ jointly expresses the importance
of a sentence in the document and its relevance
to the given domain (controlled by φ). We rank
sentences according to e∗ and select the top K
ones, subject to a budget (e.g., 100 words).
We ran a judgment elicitation study on sum-
maries produced by TEXTRANK and DETRANK.
Participants were provided with domain defini-
tions and asked to decide which summary was best
according to the criteria of: Informativeness (does
the summary contain more information about a
specific domain, e.g., ‘‘Government and Poli-
tics’’?), Succinctness (does the summary avoid
unnecessary detail and redundant information?),
and Coherence (does the summary make logical
sense?). AMT workers were allowed to answer
‘‘Both’’ or ‘‘Neither’’ in cases where they could
not discriminate between summaries. We sampled
50 summary pairs from the English Wikipedia
development set. We collected three responses per
summary pair and determined which system par-
ticipants preferred based on majority agreement.
Table 6 shows the proportion of times AMT
workers preferred each system according to the
criteria of Informativeness, Succinctness, Coher-
ence, and overall. As can be seen, participants
find DETRANK summaries more informative and
coherent. Although it is perhaps not surprising for
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 4: Summaries for the Wikipedia article ‘‘Arms Industry’’. The red heat map is for MIL and the blue one
for BUS. Words with higher domain scores are highlighted with deeper color.
DETRANK to produce summaries which are domain
informative since it explicitly takes domain sig-
nals into account, it is interesting to note that
focusing on a specific domain also helps discard
irrelevant information and produce more coherent
summaries. This, on the other hand, possibly ren-
ders DETRANK’s summaries more verbose (see the
Succinctness ratings in Table 6).
Figure 4 shows example summaries for the
Wikipedia article Arms Industry for domains
MIL and BUS.11 Both summaries begin with
a sentence that
introduces the arms industry
to the reader. When MIL is the domain of
interest, the summary focuses on military products
such as guns and missiles. When the domain
changes to BUS, the summary puts more emphasis
on trade—for example, market competition and
companies doing military business, such as Boeing
and Eurofighter.
10 Conclusions
In this work, we proposed an encoder-detector
framework for domain detection. Leveraging only
weak domain supervision, our model achieves
results superior to competitive baselines across
different languages, segment granularities, and
text genres. Aside from identifying domain-
specific training data, we also show that our
model holds promise for other natural language
tasks, such as text summarization. Beyond domain
detection, we hope that some of the work described
here might be of relevance to other multilabel
classification problems such as sentiment analysis
(Angelidis and Lapata, 2018), relation extraction
11https://en.wikipedia.org/wiki/Arms industry
(Surdeanu et al., 2012), and named entity recog-
nition (Tang et al., 2017). More generally, our
experiments show that the proposed framework
can be applied to textual data using minimal super-
vision, significantly alleviating the annotation
bottleneck for text classification problems.
A key feature in achieving performance superior
to competitive baselines is the hierarchical nature
of our model, where representations are encoded
step-by-step, first for words, then for sentences,
and finally for documents. The framework flexibly
integrates prior information which can be used to
enhance the otherwise weak supervision signal or
to render the model more robust across genres. In
the future, we would like to investigate semi-
supervised instantiations of MIL, where aside
from bag labels, small amounts of instance labels
are also available (Kotzias et al., 2015). It would
also be interesting to examine how the label space
influences model performance, especially because
in our scenario the labels are extrapolated from
Wikipedia and might be naturally noisy and/or
ambiguous.
Acknowledgments
The authors would like to thank the anonymous
reviewers and the action editor, Yusuke Miyao,
for their valuable feedback. We acknowledge
the financial support of the European Research
Council (Lapata; award number 681760). This
research is based upon work supported in part by
the Office of the Director of National Intelligence
(ODNI), Intelligence Advanced Research Projects
Activity (IARPA), via contract FA8650-17-C-
9118. The views and conclusions contained
herein are those of the authors and should not
593
be interpreted as necessarily representing the
official policies or endorsements, either expressed
or implied, of the ODNI, IARPA, or the U.S.
Government. The U.S. Government is authorized
to reproduce and distribute reprints for Govern-
mental purposes notwithstanding any copyright
annotation therein.
References
Stuart Andrews and Thomas Hofmann. 2004. Mul-
tiple instance learning via disjunctive program-
ming boosting. In Advances in Neural Information
Processing Systems, 17, pages 65–72. Curran
Associates, Inc.
Stefanos Angelidis and Mirella Lapata. 2018.
Multiple instance learning networks for fine-
grained sentiment analysis. Transactions of
the Association for Computational Linguistics,
6:17–31.
for
Andrew Arnold, Ramesh Nallapati, and William W
Cohen. 2007. A comparative study of meth-
In
ods
Proceedings of the 2007 IEEE International
Conference on Data Mining, pages 77–82.
Omaha, NE.
transductive transfer
learning.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E
Hinton. 2016. Layer normalization. arXiv preprint
arXiv:1607.06450.
Federico Barrios, Federico L´opez, Luis Argerich,
and Rosa Wachenchauzer. 2016. Variations of
the similarity function of textrank for auto-
mated summarization. arXiv preprint arXiv:
1602.03606.
David M Blei, Andrew Y Ng, and Michael I
Jordan. 2003. Latent dirichlet
allocation.
Journal of Machine Learning Research,
3(Jan):993–1022.
John Blitzer, Ryan McDonald, and Fernando
Pereira. 2006. Domain adaptation with struc-
In Proceed-
tural correspondence learning.
the 2006 Conference on Empirical
ings of
Methods in Natural Language Processing,
pages 120–128. Sydney.
in Neural Information Processing Systems, 29,
pages 343–351.
Jose Camacho-Collados and Roberto Navigli.
2017. Babeldomains: Large-scale domain label-
ing of lexical resources. In Proceedings of the
15th Conference of the European Chapter of
the Association for Computational Linguistics,
volume 2, pages 223–228. Valencia.
Peter Carbonetto, Gyuri Dork´o, Cordelia Schmid,
Hendrik K¨uck, and Nando De Freitas. 2008.
Learning to recognize objects with little super-
vision. International Journal of Computer Vision,
77(1–3):219–237.
Xilun Chen and Claire Cardie. 2018. Multinomial
adversarial networks for multi-domain text clas-
sification. In Proceedings of the 2018 Con-
ference of the North American Chapter of the
Association for Computational Linguistics: Hu-
man Language Technologies, pages 1226–1240.
New Orleans, LA.
Alexis Conneau, Holger Schwenk, Lo¨ıc Barrault,
and Yann Lecun. 2017. Very deep convo-
lutional networks for text classification. In
Proceedings of
the
the Association for
European Chapter of
Computational Linguistics, pages 1107–1116.
Valencia.
the 15th Conference of
Hal Daume III. 2007. Frustratingly easy domain
adaptation. In Proceedings of the 45th Annual
Meeting of the Association of Computational
Linguistics, pages 256–263. Prague.
Hal Daume III and Daniel Marcu. 2006. Domain
adaptation for statistical classifiers. Journal of
Artificial Intelligence Research, 26:101–126.
Misha Denil, Alban Demiraj, Nal Kalchbrenner,
Phil Blunsom, and Nando de Freitas. 2014.
Modelling, visualising and summarising doc-
uments with a single convolutional neural net-
work, University of Oxford.
Thomas G Dietterich, Richard H Lathrop, and
Tom´as Lozano-P´erez. 1997. Solving the multi-
ple instance problem with axis-parallel rect-
angles. Artificial Intelligence, 89(1–2):31–71.
Konstantinos Bousmalis, George Trigeorgis, Nathan
Silmerman, Dilip Krishnan, and Dumitru Erhan.
2016. Domain separation networks. In Advances
Jenny Rose Finkel and Christopher D Manning.
2009. Hierarchical Bayesian domain adap-
tation. In Proceedings of Human Language
594
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Technologies: The 2009 Annual Conference of
the North American Chapter of the Association
for Computational Linguistics, pages 602–610.
Boulder, CO.
Xavier Glorot
and Yoshua Bengio. 2010.
Understanding the difficulty of training deep
feedforward neural networks. In Proceedings of
the 13th International Conference on Artificial
Intelligence and Statistics, pages 249–256.
Sardinia.
Sergey Ioffe and Christian Szegedy. 2015. Batch nor-
malization: Accelerating deep network training
by reducing internal covariate shift. In Proceed-
ings of the 32nd International Conference on
Machine Learning, pages 448–456. Lille.
Mohit Iyyer, Varun Manjunatha, Jordan Boyd-
Graber, and Hal Daum´e III. 2015. Deep un-
ordered composition rivals syntactic methods
for text classification. In Proceedings of the
53rd Annual Meeting of the Association for
Computational Linguistics and the 7th Inter-
national Joint Conference on Natural Language
Processing, pages 1681–1691. Beijing.
Jing Jiang and ChengXiang Zhai. 2007. Instance
weighting for domain adaptation in NLP. In
Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics,
pages 264–271. Prague.
Jim Keeler and David E. Rumelhart. 1992. A
self-organizing integrated segmentation and
recognition neural net. In Advances in Neural
Information Processing Systems, pages 496–503.
Morgan-Kaufmann.
Yoon Kim. 2014. Convolutional neural networks
for sentence classification. In Proceedings of
the 2014 Conference on Empirical Methods in
Natural Language Processing, pages 1746–1751.
Doha.
Diederik P. Kingma and Jimmy Ba. 2014. Adam:
A method for stochastic optimization. CoRR,
abs/1412.6980.
Dimitrios Kotzias, Misha Denil, Nando De Freitas,
and Padhraic Smyth. 2015. From group to in-
dividual labels using deep features. In Proceed-
ings of the 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining, pages 597–606. New York, NY.
Simon Lacoste-Julien, Fei Sha, and Michael I.
Jordan. 2009. Disclda: Discriminative learning
for dimensionality reduction and classification.
In Advances in Neural Information Processing
Systems, pages 897–904. Curran Associates,
Inc.
Shoushan Li and Chengqing Zong. 2008. Multi-
domain sentiment classification. In Proceedings
of the 46th Annual Meeting of the Association
for Computational Linguistics: Human Lan-
guage Technologies, pages 257–260. Columbus,
OH.
Wei Lu, Hai Leong Chieu, and Jonathan L¨ofgren.
2016. A general regularization framework for
domain adaptation. In Proceedings of the 2016
Conference on Empirical Methods in Natural
Language Processing, pages 950–954. Austin,
TX.
Oded Maron and Aparna Lakshmi Ratan. 1998.
Multiple-instance learning for natural scene
classification. In Proceedings of the Annual
International Conference on Machine Learn-
ing, volume 98, pages 341–349.
Jon D. Mcauliffe and David M. Blei. 2008. Super-
vised topic models. In Advances in Neural In-
formation Processing Systems, pages 121–128.
Curran Associates, Inc.
Andrew McCallum and Kamal Nigam. 1998. A
comparison of event models for naive Bayes
text classification. In Proceeedings of the AAAI-
98 Workshop on Learning For Text Catego-
rization, volume 752, pages 41–48. Madison,
WI.
Rada Mihalcea and Paul Tarau. 2004. Textrank:
Bringing order into text. In Proceedings of
the 2004 conference on empirical methods in
natural language processing, pages 404–411.
Barcelona.
Eric W. Noreen. 1989. Computer-intensive methods
for testing hypotheses, Wiley New York.
Sinno Jialin Pan and Qiang Yang. 2010. A survey
IEEE Transactions on
on transfer
Knowledge and Data Engineering, 22(10):
1345–1359.
learning.
Nikolaos Pappas and Andrei Popescu-Belis.
2014. Explaining the stars: Weighted multiple-
instance learning for aspect-based sentiment
595
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
analysis. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language
Processing, pages 455–466. Doha.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory
Chanan, Edward Yang, Zachary DeVito, Zeming
Lin, Alban Desmaison, Luca Antiga, and Adam
Lerer. 2017. Automatic differentiation in pytorch.
In Advances in Neural Information Processing
Systems. Curran Associates, Inc.
Daniel Ramage, David Hall, Ramesh Nallapati,
and Christopher D. Manning. 2009a. Labeled
lda: A supervised topic model for credit attribu-
tion in multi-labeled corpora. In Proceedings
of the 2009 Conference on Empirical Methods
in Natural Language Processing, pages 248–256.
Suntec.
Daniel Ramage, Paul Heymann, Christopher D.
Manning, and Hector Garcia-Molina. 2009b.
Clustering the tagged Web. In Proceedings of
the Second ACM International Conference on
Web Search and Data Mining, pages 54–63.
Barcelona.
Stephen E. Robertson, Steve Walker, Susan
Jones, Micheline M. Hancock-Beaulieu, Mike
Gatford. 1995. Okapi at trec-3. Nist Special
Publication Sp, 109:109.
Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,
and Christopher D. Manning. 2012. Multi-instance
multi-label learning for relation extraction. In
Proceedings of the 2012 Joint Conference on
Empirical Methods in Natural Language Processing
and Computational Natural Language Learning,
pages 455–465. Jeju Island, Korea.
Siliang Tang, Ning Zhang, Jinjiang Zhang, Fei Wu,
and Yueting Zhuang. 2017. Nite: A neural
inductive teaching framework for domain specific
ner. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language
Processing, pages 2652–2657. Copenhagen.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017. Atten-
tion is all you need. In Advances in Neural Infor-
mation Processing Systems, pages 6000–6010.
Curran Associates, Inc.
Xiu-Shen Wei, Jianxin Wu, and Zhi-Hua Zhou.
2014. Scalable multi-instance learning. In Pro-
ceedings of the 2014 IEEE International Confer-
ence on Data Mining, pages1037–1042.Shenzhen.
Nils Weidmann, Eibe Frank, and Bernhard
Pfahringer. 2003. A two-level learning method
for generalized multi-instance problems. In
Proceedings of the European Conference on
Machine Learning, pages 468–479. Warsaw.
Fangzhao Wu and Yongfeng Huang. 2015. Collab-
orative multi-domain sentiment classification.
In Proceedings of 2015 International Confer-
ence on Data Mining, pages 459–468. Atlanic
City, NJ.
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong
He, Alex Smola, and Eduard Hovy. 2016. Hier-
archical attention networks for document classi-
fication. In Proceedings of the 2016 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, pages 1480–1489. San Diego, CA.
Adams Wei Yu, David Dohan, Minh-Thang
Luong, Rui Zhao, Kai Chen, Mohammad
Norouzi, and Quoc V. Le. 2018. Qanet:
Combining local convolution with global self-
attention for reading comprehension. arXiv
preprint arXiv:1804.09541.
Min-Ling Zhang and Zhi-Hua Zhou. 2014. A
review on multi-label
learning algorithms.
IEEE Transactions on Knowledge and Data
Engineering, 26(8):1819–1837.
Qi Zhang, Sally A Goldman, Wei Yu, and
Jason E. Fritts. 2002. Content-based image
retrieval using multiple-instance
learning.
In Proceedings of
the 19th International
Conference on Machine Learning, volume 2,
pages 682–689. Sydney.
Ye Zhang, Iain Marshall, and Byron C. Wallace.
2016. Rationale-augmented convolutional neu-
ral networks for text classification. In Pro-
ceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing,
pages 795–804. Austin, TX.
Zhi-Hua Zhou, Yu-Yin Sun, and Yu-Feng Li.
2009. Multi-instance learning by treating in-
stances as non-iid samples. In Proceedings of
the 26th Annual International Conference on
Machine Learning, pages 1249–1256. Montreal.
596
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1
/
/
t
l
a
c
_
a
_
0
0
2
8
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Download pdf