Weakly Supervised Domain Detection

Weakly Supervised Domain Detection

Yumo Xu and Mirella Lapata
Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB
yumo.xu@ed.ac.uk, mlap@inf.ed.ac.uk

Abstract

In this paper we introduce domain detection
as a new natural language processing task.
We argue that the ability to detect textual seg-
ments that are domain-heavy (i.e., sentences or
phrases that are representative of and provide
evidence for a given domain) could enhance the
robustness and portability of various text clas-
sification applications. We propose an encoder-
detector framework for domain detection and
bootstrap classifiers with multiple instance learn-
ing. The model is hierarchically organized and
suited to multilabel classification. We demon-
strate that despite learning with minimal super-
vision, our model can be applied to text spans
of different granularities, languages, and gen-
res. We also showcase the potential of domain
detection for text summarization.

1

Introduction

Text classification is a fundamental
task in
Natural Language Processing (NLP) that has been
found useful in a wide spectrum of applications
ranging from search engines enabling users to
identify content on Web sites, sentiment and
social media analysis, customer relationship man-
agement systems, and spam detection. Over the
past several years, text classification has been pre-
dominantly modeled as a supervised learning
problem (e.g., Kim, 2014; McCallum and Nigam,
1998; Iyyer et al., 2015) for which appropriately
labeled data must be collected. Such data are
often domain-dependent (i.e., covering specific
topics such as those relating to ‘‘Business’’ or
‘‘Medicine’’) and a classifier trained using data
from one domain is likely to perform poorly on
another. For example, the phrase ‘‘the mouse
died quickly’’ may indicate negative sentiment in
review describing the hand-held
a customer

581

pointing device or positive sentiment when
describing a laboratory experiment performed on
a rodent. The ability to handle a wide variety of
domains1 has become more pertinent with the rise
of data-hungry machine learning techniques like
neural networks and their application to a plethora
of textual media ranging from news articles to
Twitter, blog posts, medical
journals, Reddit
comments, and parliamentary debates (Kim, 2014;
Yang et al., 2016; Conneau et al., 2017; Zhang
et al., 2016).

The question of how to best deal with multiple
domains when training data are available for one
or few of them has met with much interest in the
literature. The field of domain adaptation (Jiang
and Zhai, 2007; Blitzer et al., 2006; Daume III,
2007; Finkel and Manning, 2009; Lu et al., 2016)
aims at improving the learning of a predictive
function in a target domain where there is little or
no labeled data, using knowledge transferred from
a source domain where sufficient labeled data are
available. Another line of work (Li and Zong,
2008; Wu and Huang, 2015; Chen and Cardie,
2018) assumes that labeled data may exist for
multiple domains, but in insufficient amounts to
train classifiers for one or more of them. The aim of
multi-domain text classification is to leverage all
the available resources in order to improve system
performance across domains simultaneously.

In this paper we investigate the question of how
domain-specific data might be obtained in order
to enable the development of text classification
tools as well as more domain aware applications
such as summarization, question answering, and

1The term ‘‘domain’’ has been permissively used in the
literature to describe (a) a collection of documents related to a
particular topic such as user-reviews in Amazon for a product
category (e.g., books, movies), (b) a type of information
source (e.g., twitter, news articles), and (c) various fields
of knowledge (e.g., Medicine, Law, Sport). In this paper we
adopt the latter definition of domains, although, nothing in
our approach precludes applying it to different domain labels.

Transactions of the Association for Computational Linguistics, vol. 7, pp. 581–596, 2019. https://doi.org/10.1162/tacl a 00287
Action Editor: Yusuke Miyao. Submission batch: 2/2019; Revision batch: 5/2019; Published 9/2019.
c(cid:2) 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1

/

/
t

l

a
c
_
a
_
0
0
2
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

information extraction. We refer to this task as
domain detection and assume a fairly common
setting where the domains of a corpus collection
are known and the aim is to identify textual
segments that are domain-heavy (i.e., documents,
sentences, or phrases providing evidence for a
given domain).

Domain detection can be formulated as a
multilabel classification problem, where a model
is trained to recognize domain evidence at the
sentence-, phrase-, or word-level. By definition
then, domain detection would require training
thereby
data with fine-grained domain labels,
increasing the annotation burden; we must provide
labels for training domain detectors and for
modeling the task we care about in the first place.
In this paper we consider the problem of fine-
grained domain detection from the perspective
of Multiple Instance Learning (MIL; Keeler and
Rumelhart, 1992) and develop domain models
with very little human involvement. Instead of
learning from individually labeled segments, our
model only requires document-level supervision
and optionally prior domain knowledge and
learns to introspectively judge the domain of
constituent segments. Importantly, we do not
require document-level domain annotations either
because we obtain these via distant supervision
by leveraging information drawn from Wikipedia.
Our domain detection framework comprises
two neural network modules; an encoder learns
representations for words and sentences together
with prior domain information if the latter is
available (e.g., domain definitions), and a de-
tector generates domain-specific scores for words,
sentences, and documents. We obtain a segment-
level domain predictor that is trained end-to-end
on document-level labels using a hierarchical,
attention-based neural architecture (Vaswani et al.,
2017). We conduct domain detection experiments
on English and Chinese and measure system
performance using both automatic and human-
based evaluation. Experimental results show that
our model outperforms several strong baselines
and is robust across languages and text genres,
despite learning from weak supervision. We also
showcase our model’s application potential for text
summarization.

Our contributions in this work are threefold; we
propose domain detection, as a new fine-grained
multilabel
learning problem which we argue
would benefit the development of domain aware

NLP tools; we introduce a weakly supervised
encoder-detector model within the context of
multiple instance learning; and we demonstrate
that it can be applied across languages and text
genres without modification.

2 Related Work

Our work lies at
the intersection of multiple
research areas, including domain adaptation, rep-
resentation learning, multiple instance learning,
and topic modeling. We review related work
below.

Domain adaptation A variety of domain
adaptation methods (Jiang and Zhai, 2007; Arnold
et al., 2007; Pan et al., 2010) have been pro-
posed to deal with the lack of annotated data
in novel domains faced by supervised models.
Daume and Marcu (2006) propose to learn three
separate models, one specific to the source do-
main, one specific to the target domain, and a third
one representing domain general
information.
A simple yet effective feature augmentation
technique is further introduced in Daume (2007)
which Finkel and Manning (2009) subsequently
recast within a hierarchical Bayesian framework.
More recently, Lu et al. (2016) present a gen-
eral regularization framework for domain ad-
aptation while Camacho-Collados and Navigli
(2017)
integrate domain information within
lexical resources. A popular approach within text
classification learns features that are invariant
across multiple domains while explicitly modeling
the individual characteristics of each domain
(Chen and Cardie, 2018; Wu and Huang, 2015;
Bousmalis et al., 2016).

Similar to domain adaptation, our detection task
also identifies the most discriminant features for
different domains. However, whereas adaptation
aims to render models more portable by trans-
ferring knowledge, detection focuses on the
domains themselves and identifies the textual
segments that provide the best evidence for
their semantics, allowing to create data sets with
explicit domain labels to which domain adapta-
tion techniques can be further applied.

instance

Multiple
learning MIL handles
problems where labels are associated with groups
or bags of instances (documents in our case), while
instance labels (segment-level domain labels) are
unobserved. The task is then to make aggregate

582

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1

/

/
t

l

a
c
_
a
_
0
0
2
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

instance-level predictions, by inferring labels
either for bags (Keeler and Rumelhart, 1992;
Dietterich et al., 1997; Maron and Ratan, 1998) or
jointly for instances and bags (Zhou et al., 2009;
Wei et al., 2014; Kotzias et al., 2015). Our domain
detection model is an example of the latter variant.
Initial MIL models adopted a relatively strong
consistency assumption between bag labels and
instance labels. For instance, in binary classi-
fication, a bag was considered positive only if
all its instances were positive (Dietterich et al.,
1997; Maron and Ratan, 1998; Zhang et al.,
2002; Andrews and Hofmann, 2004; Carbonetto
et al., 2008). The assumption was subsequently
relaxed by investigating prediction combinations
(Weidmann et al., 2003; Zhou et al., 2009).

Within NLP, multiple instance learning has
been predominantly applied to sentiment analysis.
Kotzias et al. (2015) use sentence vectors obtained
by a pre-trained hierarchical convolutional neu-
ral network (Denil et al., 2014) as features under
a MIL objective that simply averages instance
contributions towards bag classification (i.e., positive/
negative document sentiment). Pappas and Popescu-
Belis (2014) adopt a multiple instance regression
model
to assign sentiment scores to specific
product aspects, using a weighted summation of
predictions. More recently, Angelidis and Lapata
(2018) propose MILNET, a multiple instance learn-
ing network model for sentiment analysis. They
use an attention mechanism to flexibly weigh
predictions and recognize sentiment-heavy text
snippets (i.e., sentences or clauses).

We depart from previous MIL-based work in
devising an encoding module with self-attention
and non-recurrent structure, which is particularly
suitable for modeling long documents efficiently.
Compared with MILNET (Angelidis and Lapata,
2018), our approach generalizes to segments of
arbitrary granularity; it introduces an instance
scoring function that supports multilabel rather
than binary classification,
and takes prior
knowledge into account (e.g., domain definitions)
to better inform the model’s predictions.

Topic modeling Topic models are built around
the idea that the semantics of a document col-
lection is governed by latent variables. The aim
is therefore to uncover these latent variables—
topics—that shape the meaning of the document
collection. Latent Dirichlet Allocation (LDA; Blei
et al. 2003) is one of the best-known topic models.

In LDA, documents are generated probabilisti-
cally using a mixture over K topics that are in turn
characterized by a distribution over words. And
words in a document are generated by repeatedly
sampling a topic according to the topic distribution
and selecting a word given the chosen topic.

Although most topic models are unsupervised,
some variants can also accommodate document-
level supervision (Mcauliffe and Blei, 2008;
Lacoste-Julien et al., 2009). However,
these
models are not appropriate for analyzing multiply
labeled corpora because they limit documents
to being associated with a single label. Multi-
multinomial LDA (Ramage et al. 2009b) relaxes
this constraint by modeling each document as a
bag of words with a bag of labels, and topics
for each observation are drawn from a shared
topic distribution. Labeled LDA (L-LDA; Ramage
et al., 2009a) goes one step further by directly
topics thereby
associating labels with latent
learning label-word correspondences. L-LDA is
a natural extension of both LDA by incorporating
supervision
naive Bayes
(McCallum and Nigam, 1998) by incorporating
a mixture model (Ramage et al., 2009a).

and multinomial

Similar to L-LDA, DETNET is also designed
to perform learning and inference in multi-label
settings. Our model adopts a more general solu-
tion to the credit attribution problem (i.e., the
association of textual units in a document with
semantic tags or labels). Despite learning from a
weak and distant signal, our model can produce
domain scores for text spans of varying granular-
ity (e.g., sentences and phrases), not just words,
and achieves this with a hierarchically-organized
neural architecture. Aside from learning through
efficient backpropagation, the proposed frame-
work can take incorporate useful prior information
(e.g., pertaining to the labels and their meaning).

3 Problem Formulation

We formulate domain detection as a multilabel
learning problem. Our model
is trained on
samples of document-label pairs. Each document
consists of s sentences x = {x1, . . . , xs}
and is associated with discrete labels y =
{y(c)|c ∈ [1, C]}. In this work, domain labels
are not annotated manually but extrapolated from
Wikipedia (see Section 6 for details). In a non-
MIL framework, a model
typically learns to
predict document labels by directly conditioning

583

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1

/

/
t

l

a
c
_
a
_
0
0
2
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

4 The Encoder Module

We learn representations for words and sentences
using identical encoders with separate learning
parameters. Given a document, the two encoders
implement the following steps:

Z, α = WORDENC(X)
G = [g1; . . . ; gs] where g = Zα
H, β = SENTENC(G)

For each sentence X = [x1; . . . ; xn],
the
word-level encoder yields contextualized word
representations Z and their attention weights α.
Sentence embeddings g are obtained via weighted
averaging and then provided as input
to the
sentence-level encoder, which outputs contex-
tualized representations H and their attention
weights β.

In this work we aim to model fairly long doc-
uments (e.g., Wikipedia articles; see Section 6 for
details). For this reason, our encoder builds on the
Transformer architecture (Vaswani et al., 2017), a
recently proposed highly efficient model that has
achieved state-of-the-art performance in machine
translation (Vaswani et al., 2017) and question
answering (Yu et al., 2018). The Transformer
aims at reducing the fundamental constraint of se-
quential computation that underlies most archi-
tectures based on recurrent neural networks. It
eliminates recurrence in favor of applying a
self-attention mechanism which directly models
relationships between all words in a sentence,
regardless of their position.

Self-attentive encoder As shown in Figure 2,
the Transformer is a non-recurrent framework
comprising m identical layers. Information on the
(relative or absolute) position of each token in a
sequence is represented by the use of positional
encodings which are added to input embeddings
(see the bottom of Figure 2). We denote position-
augmented inputs in a sentence with X. Our
model uses four layers in both word and sentence
encoders. The first three layers are identical to
those in the Transformer (m = 3), comprising a
multi-head self-attention sublayer and a position-
wise fully connected feed-forward network. The
last layer is simply a multi-head self-attention
layer yielding attention weights for subsequent
operations.

Single-head attention takes three parameters as
input in the Transformer (Vaswani et al., 2017):

Figure 1: Overview of DETNET. The encoder learns
document representations in a hierarchical fashion and
the decoder generates domain scores, while selectively
attending to previously encoded information. Prior
information can be optionally incorporated when
available at
the encoding stage through parameter
sharing.

on its sentence representations h1, . . . , hs or
their aggregate. In contrast, y under MIL is a
learned function fθ of latent instance-level labels,
that is, y = fθ(y1, . . . , ys). A MIL classifier will
therefore first produce domain scores for all in-
stances (aka sentences), and then learn to inte-
grate instance scores into a bag (i.e., document)
prediction.

the
In this paper we further assume that
instance-bag relation applies to sentences and
documents but also to words and sentences. In
addition, we incorporate prior domain information
to facilitate learning in a weakly supervised
setting: Each domain is associated with a
definition U (c), namely, a few sentences providing
a high-level description of the domain at hand.
For example, the definition of the ‘‘Lifestyle’’
domain is ‘‘the interests, opinions, behaviors, and
behavioral orientations of an individual, group, or
culture’’.

Figure 1 provides an overview of our Domain
Detection Network, which we call DETNET.
The model includes two modules; an encoder
learns representations for words and sentences
while incorporating prior domain information;
a detector generates domain scores for words,
sentences, and documents by selectively attending
to previously encoded information. We describe
the two modules in more detail below.

584

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1

/

/
t

l

a
c
_
a
_
0
0
2
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

in the word encoder yields a set of attention
matrices A = {A(k)}r
k=1 for each sentence where
A(k) ∈ Rn×n. Therefore, when measuring the
contributions from words to sentences (e.g., in
terms of domain scores and representations) we
can selectively focus on salient words within the
set A = {A(k)}r

k=1:

α = softmax(

1√
nr

r(cid:2)

n(cid:2)

k

(cid:3)

A(k)
(cid:4),(cid:3))

(5)

where the softmax function outputs the salience

distribution over words:

softmax(a(cid:3)) =

(cid:3)

ea(cid:2)
a(cid:2)(cid:5) ∈a ea(cid:2)(cid:5)

(6)

and obtain sentence embeddings g = Zα.

In the same vein, we adopt another self-attentive
encoder to obtain contextualized sentence repre-
sentations H ∈ Rdh×s. The final
layer out-
puts multi-head attention score matrices B =
k=1 (with B(k) ∈ Rs×s), and we calculate
{B(k)}r
sentence salience as:

β = softmax(

1√
sr

r(cid:2)

s(cid:2)

k

j

B(k)

(cid:4),j ).

(7)

Prior information In addition to documents
(and their domain labels), we might have some
prior knowledge about the domain, for example,
its general semantic content and the various topics
related to it. For example, we might expect articles
from the ‘‘Lifestyle’’ domain to not talk about
missiles or warfare, as these are recurrent themes
in the ‘‘Military’’ domain. As mentioned earlier,
throughout this paper we assume we have domain
definitions U expressed in a few sentences as prior
knowledge. Domain definitions share parameters
with WORDENC and SENTENC and are encoded in a
definition matrix U ∈ Rdh×C.

Intuitively, identifying the domain of a word
might be harder than that of a sentence; on ac-
count of being longer and more expressive, sen-
tences provide more domain-related cues than
words whose meaning often relies on supporting
context. We thus inject domain definitions U into
our word detector only.

Figure 2: Self-attentive encoder
(Vaswani et al., 2017) stacking m identical layers.

in Transformer

a query matrix, a key matrix, and a value matrix.
These three matrices are identical and equal to the
inputs X at the first layer of the word encoder.
The output of a single-head attention is calculated
via:

head(X, X, X) = softmax(

XX (cid:2)

dx

)X.

(1)

Multi-head attention allows us to jointly at-
tend to information from different representation
subspaces at different positions. This is done by
first applying different linear projections to inputs
and then concatenating them:

2 , XW (k)
head(k) = head(XW (k)
multi-head=concat(head(1), .., head(r))W 4

1 , XW (k)

3 ) (2)
(3)

where we adopt four heads (r = 4) for both
word and sentence encoders. The second sublayer
in the Transformer (see Figure 2) is a fully-
connected feed-forward network applied to each
position separately and identically:2

FFN(x) = max(0, xW 5)W 6.

(4)

After sequentially encoding input embeddings
through the first three layers, we obtain con-
textualized word representations Z ∈ Rdz×n.
Based on Z, the last multi-head attention layer

2We omit here the bias term for the sake of simplicity.

585

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1

/

/
t

l

a
c
_
a
_
0
0
2
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

5 The Detector Module

DETNET adopts three detectors corresponding to
words, sentences, and documents:

Qinstc =

(cid:4)

qinstc
1

; . . . ; qinstc

(cid:5)

P = WORDDET(Z, U )
where qinstc = P α
s
Q = SENTDET(Qinstc, H)
˜y = DOCDET(Q, β)

WORDDET first produces word domain scores
using both lexical semantic information Z and
prior (domain) knowledge U ; SENTDET yields
domain scores for sentences while integrating
downstream instance signals Qinstc and sentence
semantics H; finally, DOCDET makes the final
document-level predictions based on sentence
scores.

Word detector Our first detector yields word
domain scores. For a sentence, we obtain a self-
scoring matrix P self using its own contextual
word semantic information:

P self = tanh(W zZ).

(8)

In contrast

to the representations used in
Angelidis and Lapata (2018), we generate instance
scores from contextualized representations, that is,
Z. Because the softmax function normally favors
single-mode outputs, we adopt tanh(·) ∈ (−1, 1)
as our domain scoring function to tailor MIL to
our multilabel scenario.

As mentioned earlier, we use domain definitions
as prior information at the word level and compute
the prior score via:

P prior = tanh(max(0, U (cid:2)W u)Z)

(9)

where W u ∈ Rdu×dz projects prior information
U onto the input semantic space. The prior score
matrix P prior captures the interactions between
domain definitions and sentential contents.

In this work, we flexibly integrate scoring com-
ponents with gates, as shown in Figure 3. The key
idea is to learn a prior gate Γ balancing Equations (8)
and (9) via:

Γ = γσ(W g,p[Z, P self , P prior])
P = Γ (cid:6) P prior + (J − Γ) (cid:6) P self

(10)

(11)

Figure 3: Domain predictions for words and sentences;
the instance-bag relation applies to words-sentences
(red shadow) and sentences-documents (green shadow).
Squares denote representations of words or sentences,
and circles are domain scores.

(cid:6) denotes element-wise multiplication and [·, ·]
matrix concatenation. σ(·) ∈ (0, 1) is the sigmoid
function and Γ ∈ (0, γ) the prior gate with scaling
factor γ, a hyperparameter controlling the overall
effect of prior information and instances.3

Sentence detector The second detector iden-
tifies sentences with domain-heavy semantics
based on signals from the sentence encoder, prior
information, and word instances. Again, we obtain
a self-scoring matrix Qself via:

Qself = tanh(W hH).

(12)

After computing sentence scores from sentence-
level signals, we estimate domain scores from
individual words. We do this by reusing α in
Equation (5), qinstc = P α. After gathering qinstc
for each sentence, we obtain Qinstc ∈ RC×s as
the full instance score matrix.

Analogously to the word-level detector (see
Equation (10)), we use a sentence-level upward
gate Λ to dynamically propagate domain scores
from downstream word instances to sentence bags:

Λ = λσ(W (cid:3)[H, Qinstc, Qself ])
Q = Λ (cid:6) Qinstc + (J − Λ) (cid:6) Qself

(13)

(14)

where Q is the final sentence score matrix.

Document detector Document-level domain
scores are based on the sentence salience distri-
bution β (see Equation (7)) and are computed as

where J is an all-ones matrix and P ∈ RC×n is
the final domain score matrix at the word-level;

3Initially, we expected to balance these effects by purely
relying on the learned function without a scaling factor. This
led to poor performance, however.

586

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
2
8
7
1
9
2
3
6
4
1

/

/
t

l

a
c
_
a
_
0
0
2
8
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

All Documents
Training Documents
Development Documents
Test Documents
Multilabel Ratio
Average #Words
Vocabulary Size
Synthetic Documents
Synthetic Sentences

Wiki-en
31,562
25,562
3,000
3,000
10.18%
1,152.08
175,555
200
18,922

Wiki-zh
26,280
22,280
2,000
2,000
29.73%
615.85
169,179
200
18,312

Table 1: Statistics of Wikipedia data sets; en
and zh are shorthand for English and Chinese,
respectively. Synthetic documents and sentences
are used in our automatic evaluation experiments
discussed in Section 7.

the weighted average of sentence scores:

˜y = Qβ.

(15)

We use only document-level supervision for
multilabel learning in C domains. Formally, our
training objective is:

L = min − 1
N

N(cid:2)

C(cid:2)

i

c

log(1 + e−˜y(i)

c y(i)
c )

(16)

where N is the training set size. At test time,
we partition domains into a relevant set and an
irrelevant set for unseen samples. Because the
domain scoring function is tanh(·) ∈ (−1, 1), we
use a threshold of 0 against which ˜y is calibrated.4

6 Experimental Set-up

Data sets DETNET was trained on two data sets
created from Wikipedia5 for English and Chinese.6
Wikipedia articles are organized according to a
hierarchy of categories representing the defining
characteristics of a field of knowledge. We re-
cursively collect Wikipedia pages by first deter-
mining the root categories based on their match
with the domain name. We then obtain their
subcategories,
the subcategories of these sub-
categories, and so on. We treat all pages associated
with a category as representative of the domain of
its root category.

4If ∀c ∈ [1, C] : ˜yc < 0 holds, we set ˜yc∗ = 1 and select c∗ as c∗ = arg maxc ˜yc to produce a positive prediction. 5http://static.wikipedia.org/downloads/ Algorithm 1 Document Generation Input: S = {Sd}D 1 : Label combinations O = {Od}D 1 : Sentence subcorpora (cid:5)max: Maximum document length Output: A synthetic document function GENERATE(S, O, (cid:5)max) Generate a document domain set S doc ∈ S S sent ← S doc ∪ {GEN} if |S sent| < C then (cid:6) Number of domain labels Generate a noisy domain (cid:7) ∈ Y \ S sent S sent ← S sent ∪ {(cid:7)} (cid:6) A set of candidate domain sets end if S cdt ← ∅; for Sd ∈ S do if Sd ∈ S sent then S cdt ← S cdt ∪ {Sd} end if end for nlabel ← |S cdt| nsent ← (cid:5)max L ← ∅ for S cdt ∈ S cdt do (cid:6) Number of unused labels (cid:6) Number of sentence blocks (cid:6) For generated sentences d θ = min(|Od|, nsents + 1 − nlabels, 2nsents nlabels Generate (cid:5)d ∼ U nif orm(1, θ) Generate (cid:5)d sentences Ld ⊆ Od L ← L ∪ Lc nsent ← (cid:5)max − |L| nlabel ← nlabel − 1 ) end for L ← SHUFFLE(L) return L end function In our experiments we used seven target (BUS), domains: ‘‘Business and Commerce’’ ‘‘Government and Politics’’ (GOV), ‘‘Physical and Mental Health’’ (HEA), ‘‘Law and Order’’ (LAW), ‘‘Lifestyle’’ (LIF), ‘‘Military’’ (MIL), and ‘‘General Purpose’’ (GEN). Exceptionally, GEN does not have a natural root category. We leverage Wikipedia’s 12 Main Categories7 to ensure that GEN is genuinely different from the other six domains. We used 5,000 pages for each domain. Table 1 shows various statistics on our data set. System comparisons We constructed three variants of DETNET to explore the contribution of different model components. DETNET1H has a single-level hierarchical structure, treating only sentences as instances and documents as bags; whereas DETNET2H has a two-level hierarchical structure (the instance-bag relation applies to 2008-06/en 7https://en.wikipedia.org/wiki/Portal: 6Available at https://github.com/yumoxu/detnet Contents/Categories. 587 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 8 7 1 9 2 3 6 4 1 / / t l a c _ a _ 0 0 2 8 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 and sentences-documents); words-sentences ∗ is our full model, which is fully finally, DETNET hierarchical and equipped with prior information (i.e., domain definitions). We also compared DETNET to a variety of related systems, which include: MAJOR: The Majority domain label applies to all instances. L-LDA: Labeled LDA (Ramage et al., 2009a) is a topic model that constrains LDA by defining a one-to-one correspondence between LDA’s latent topics and observed labels. This allows L-LDA to directly learn word-label correspondences. We obtain domain scores for words through the topic-word-count matrix M ∈ RC×V , which is computed during training: ˜M = (cid:3) M (cid:2) + β C c M (cid:2) ∗,c + C ∗ β (17) where C and V are the number of domain labels and the size of vocabulary, respectively. to 1/C and Scalar β is a prior value set matrix ˜M ∈ RV ×C consists of word scores over domains. Following the snippet extraction ap- proach proposed in Ramage et al. (2009a), L-LDA can also be used to score sentences as the expected probability that the domain label had generated each word. For more details on L-LDA, we refer the interested reader to Ramage et al. (2009a). HIERNET: A hierarchical neural network model described in Angelidis and Lapata (2018) that produces document-level predictions by atten- tively integrating sentence representations. For this model we used word and sentence encoders identical to DETNET. HIERNET does not generate instance-level predictions, however, we assume that document-level predictions apply to all sentences. MILNET: A variant of the MIL-based model introduced in Angelidis and Lapata (2018) that considers sentences as instances and documents as bags (whereas DETNET generalizes the instance- bag relationship to words and sentences). To make MILNET comparable to our system, we use an encoder identical to DETNET—that is two Trans- former encoders for words and sentences, respec- tively. Thus, MILNET differs from DETNET1H in two respects: (a) word representations are sim- ply averaged without word-level attention to build sentence embeddings and (b) context-free 588 sentence embeddings generate sentence domain scores before being fed to the sentence encoder. Implementation details We used 16 shuffled samples in a batch where the maximum document length was set to 100 sentences with the ex- cess clipped. Word embeddings were initialized randomly with 256 dimensions. All weight matri- ces in the model were initialized with the fan- in trick (Glorot and Bengio, 2010) and biases were initialized with zero. Apart from using layer normalization (Ba et al., 2016) in the encoders, we applied batch normalization (Ioffe and Szegedy, 2015) and a dropout rate of 0.1 in the detectors to accelerate model training. We trained the model with the Adam optimizer (Kingma and Ba, 2014). We set all three gate scaling factors in our model to 0.1. Hyper-parameters were optimized on the development set. To make our experiments easy to replicate, we release our PyTorch (Paszke et al., 2017) source code.8 7 Automatic Evaluation the results of our In this section we present automatic evaluation for sentence and document predictions. Problematically, for sentence pre- dictions we do not have gold-standard domain labels (we have only extrapolated these from Wikipedia for documents). We therefore devel- oped an automatic approach for creating silver standard domain labels which we describe below. Test data generation In order to obtain sen- tences with domain labels, we exploit lead sen- tences in Wikipedia articles. Lead sentences typically define the article’s subject matter and emphasize its topics of interest.9 As most lead sentences contain domain-specific content, we can fairly confidently assume that document-level domain labels will apply. To validate this assump- tion, we randomly sampled 20 documents contain- ing 220 lead sentences and asked two annotators to label these with domain labels. Annotators overwhelmingly agreed in their assignments with the document labels; the (average) agreement was K = 0.89 using Cohen’s Kappa coefficient. We used the lead sentences to create pseudo documents simulating real documents whose sen- tences cover multiple domains. To ensure that sentence labels are combined reasonably (e.g., 8Available at https://github.com/yumoxu/detnet 9https://en.wikipedia.org/wiki/Lead paragraph l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 8 7 1 9 2 3 6 4 1 / / t l a c _ a _ 0 0 2 8 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Systems MAJOR L-LDA HIERNET MILNET DETNET1H DETNET2H DETNET ∗ Sentences zh en 5.99† 2.81† 37.09† 38.52† 37.26† 30.01† 44.37† 37.12† 51.31† 47.93† 52.50† 47.89† 55.88 54.37 Documents zh en 4.41† 3.81† 58.74† 63.10† 68.56† 75.00 69.45† 50.90† 72.85 74.91 71.96† 75.47 74.24 76.48 Performance using Macro-F1% on Table 2: automatically created Wikipedia test set; models with the symbol † are significantly (p < 0.05) different from the best system in each task using the approximate randomization test (Noreen, 1989). MIL is not likely to coexist with LIF), prior to generating synthetic documents we traverse the training set and acquire all domain combinations S, e.g., S = {{GOV}, {GOV, MIL}}. We then gather lead sentences representing the same domain combinations. We generate synthetic documents with a maximum length of 100 sentences (we also clip real documents to the same length). Algorithm 1 shows the pseudocode for docu- ment generation. We first sample document labels, then derive candidate label sets for sentences by introducing GEN and a noisy label (cid:7). After sampling sentences for each domain, we shuffle them to achieve domain-varied sentence contexts. We created two synthetic data sets for English and Chinese. Detailed statistics are shown in Table 1. Evaluation metric We evaluate system perfor- mance automatically using label-based Macro-F1 (Zhang and Zhou, 2014), a widely used metric for multilabel classification. It measures model performance for each label specifically and then macro-averages the results. For each class, given a (cid:7) containing the number of confusion matrix samples classified as true positive, false positive, true negative, and false negative, Macro-F1 is calculated as 1 where C is the C number of domain labels. 2tpc 2tpc+fpc+fnc tp fn fp tn C c=1 (cid:3) (cid:6) Results Our results are summarized in Table 2. We first report domain detection results for documents, since reliable performance on this task is a prerequisite for more fine-grained domain detection. As shown in Table 2, DETNET does well on document-level domain detection, managing to outperform systems over which it has no clear advantage (such as HIERNET or MILNET). As far as sentence-level prediction is concerned, all DETNET variants significantly outperform all ∗ is the best comparison systems. Overall, DETNET system achieving 54.37% and 55.88% Macro-F1 on English (en) and Chinese (zh), respectively. It outperforms MILNET by 17.25% on English and 11.51% on Chinese. The performance of the fully hierarchical model DETNET2H is better than DETNET1H, showing positive effects of directly incorporating word-level domain signals. We also observe that prior information is generally helpful on both languages and both tasks. 8 Human Evaluation Aside from automatic evaluation, we also assessed model performance against human elicited domain labels for sentences and words. The purpose of this experiment was threefold: (a) to validate the results obtained from automatic evaluation; (b) to evaluate finer-grained model performance at the word level; and (c) to examine whether our model generalizes to non-Wikipedia articles. For this, we created a third test set from the New York Times,10 in addition to our Wikipedia- based English and Chinese data sets. For all three corpora, we randomly sampled two documents for each domain, and then from each document, we sampled one long paragraph or a few consecutive short paragraphs containing 8–12 sentences. Amazon Mechanical Turk (AMT) workers were asked to read these sentences and assign a domain based on the seven labels used in this paper (multiple labels were allowed). Participants were provided with domain definitions. We obtained five annotations per sentence and adopted the majority label as the sentence’s domain label. We obtained two annotated data sets for English (Wiki-en and NYT-en) and one for Chinese (Wiki-zh), consisting of 122/14, 111/11, and 117/12 sentences/documents each. domain Word-level is more evaluation challenging; individual taken out-of-context, words might be uninformative or carry meanings compatible with multiple domains. Expecting crowdworkers to annotate domain labels word- by-word with high confidence might be therefore problematic. In order to reduce annotation com- plexity, we opted for a retrieval-style task for 10https://catalog.ldc.upenn.edu/LDC2008T19 589 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 8 7 1 9 2 3 6 4 1 / / t l a c _ a _ 0 0 2 8 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Systems MAJOR L-LDA HIERNET MILNET DETNET1H DETNET2H ∗ DETNET Wiki-en 1.34† 27.81† 42.23† 39.30† 48.12† 54.70† 58.01 Sentences Wiki-zh 6.14† 28.94† 29.93† 45.14† 51.76† 57.60 51.28† NYT 0.51† 28.08† 44.74† 29.31† 57.06† 55.78† 60.62 Wiki-en 1.39† 24.58† 15.57† 22.11† 16.21† 27.06 26.08 Words Wiki-zh 14.95† 42.67 24.25† 33.10† 26.90† 43.82 43.18 NYT 0.39† 26.24 18.27† 23.33† 21.61† 26.52 27.03 Table 3: System performance using Macro-F1% (test set created via AMT); models with the symbol † are significantly (p < 0.05) different from the best system in each task using the approximate randomization test (Noreen, 1989). word evaluation. Specifically, AMT workers were given a sentence and its domain label (obtained from the sentence-level elicitation study described above), and asked to highlight which words they considered consistent with the domain of the sentence. We used the same corpora/sentences as in our first AMT study. Analogously, words in each sentence were annotated by five participants and their labels were determined by majority agreement. Fully hierarchical variants of our model ∗) and L-LDA are able to (i.e., DETNET2H, DETNET produce word-level predictions; we thus retrieved the words within a sentence whose domain score was above the threshold of 0 and compared them against the labels provided by crowdworkers. MILNET and DETNET1H can only make sentence- level predictions. In this case, we assume that the sentence domain applies to all words therein. HIERNET can only produce document-level predic- tions based on which we generate sentence labels and further assume that these apply to sentence words too. Again, we report Macro-F1, which we compute as 2p∗r∗ p∗+r∗ where precision p∗ and recall r∗ are both averaged over all words. We show model performance against AMT domain labels in Table 3. Consistent with the automatic evaluation results, DETNET variants are the best performing models on the sentence- level task. On the Wikipedia data sets, DETNET2H or ∗ outperform all baselines and DETNET1H by DETNET a large margin, showing that word-level signals can indeed help detect sentence domains. Al- though statistical models are typically less accu- rate when they are applied to data that has a different distribution from the training data, ∗ works surprisingly well on NYT, sub- DETNET stantially outperforming all other systems. We Domains BUS HEA GEN GOV LAW LIF MIL Avg Wiki-en 78.65 42.11 43.33 80.00 69.77 17.24 75.00 58.01 Wiki-zh 68.66 81.36 37.29 37.74 41.03 27.91 65.00 51.28 NYT 77.33 64.52 43.90 62.07 46.51 50.00 80.00 60.62 ∗ performance Table 4: Sentence-level DETNET (Macro-F1%) across domains on three data sets. also notice that prior information is useful in making domain predictions for NYT sentences: Because our models are trained on Wikipedia, prior domain definitions largely alleviate the genre shift to non-Wikipedia sentences. Table 4 provides ∗ a breakdown of the performance of DETNET across domains. Overall, the model performs worst on LIF and GEN domains (which are very broad) and best on BUS and MIL (which are very narrow). With regard to word-level evaluation, DETNET2H ∗ are the best systems and are sig- and DETNET nificantly better against all comparison models by a wide margin, except L-LDA. The latter is a strong domain detection system at the word- level since it is able to directly associate words with domain labels (see Equation (17)) without resorting to document- or sentence-level predic- tions. However, our two-level hierarchical model is superior considering all-around performance across sentences and documents. The results here accord with our intuition from previous exper- iments: hierarchical models outperform simpler variants (including MILNET) because they are able to capture and exploit fine-grained domain 590 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 8 7 1 9 2 3 6 4 1 / / t l a c _ a _ 0 0 2 8 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Domains ∗ DETNET BUS monopolization, enactment, panama, funding, arbitron, maturity, groceries, os, elevator, salary, organizations, pietism, contract, mercantilism, sectors psychology, divorce, residence, pilates, dorlands, culinary, technique, emotion, affiliation, seafood, famine, malaria, oceans, characters, pregnancy gender, destruction, beliefs, schizophrenia, area, writers, armor, creativity, propagation, chem- informatics, overpopulation, deity, stimulation, mathematical, cosmology penology, tenure, governance, alloys, biosecurity, authoritarianism, burundi, motto, criticisms, imperium, mesopotamia, juche, 420, krytocracy, criticism alloys, biosecurity, authoritarianism, mesopotamia, electronic, economical, pupil, pathophysiology, imperium, phonology, collusion, cantons, auctori- tas, sigint, juche teacher, freight, career, agaricomycetes, casein, manga, diplogasteria, benefit, pteridophyta, basid- iomycota, ascomycota, letters, eukaryota, carcino- gens, lifespan battles, eads, insignia, commanders, artillery, width, episodes, neurasthenia, reconnaissance, elevation, freedom, length, patrol, manufacturer, demise HEA GEN GOV LAW LIF MIL L-LDA also, business, company, used, one, management, may, business, united, 2007, time, first, new, market, new also, health, may, used, one, disease, medical, use, first, people, 1, many, time, water, care also, one, theory, 1, used, time, two, may, first, example, many, called, form, would, known also, government, political, state, united, party, one, minister, national, states, first, would, used, new, university law, also, united, legal, may, act, states, court, rights, one, case, state, would, v, government also, used, may, often, one, made, water, food, many, use, usually, called, known, oil, time military, war, army, also, air, united, force, states, one, used, forces, first, royal, british, world Table 5: Top 15 domain words in the Wiki-en development set according to DETNET ∗ and L-LDA. signals relatively accurately. Interestingly, prior information does not seem to have an effect on the Wikipedia data sets, but is useful when transfer- ring to NYT. We also observe that models trained on the Chinese data sets perform consistently better than English. Analysis of the annotations provided by crowdworkers revealed that the ratio of domain words in Chinese is higher compared with English (27.47% vs. 13.86% in Wikipedia and 16.42% in NYT), possibly rendering word retrieval in Chinese an easier task. Table 5 shows the 15 most representative do- ∗) on main words identified by our model (DETNET Wiki-en for our seven domains. We obtained this list by weighting word domain scores P with their attention scores: P ∗ = P (cid:6) [α; . . . ; α](cid:2) (18) and ranking all words in the development set according to P ∗, separately for each domain. Because words appearing in different contexts are usually associated with multiple domains, we determine a word’s ranking for a given domain based on the highest score. As shown in Table 5, biosecurity and authoritarianism are prevalent in both GOV and LAW domains. Interestingly, with contextualized word representations, fairly general English words are recognized as domain- heavy. For example, technique is a strong domain word in HEA and 420 in GOV (the latter is slang for the consumption of cannabis and highly associated with government regulations). For comparison, we also show the top domain words identified by L-LDA via matrix ˜M (see Equation (17)). To produce meaningful output, we have removed stop words and punctuation tokens, which are given very high domain scores by L-LDA (this is not entirely surprising since ˜M is based on simple co-occurrence). Notice that no such post-processing is necessary for our model. As shown in Table 5, the top domain words identified by L-LDA (on the right) are more general and less informative, than those from DETNET ∗ (on the left). 9 Domain-Specific Summarization In this section we illustrate how fine-grained domain scores can be used to produce domain 591 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 8 7 1 9 2 3 6 4 1 / / t l a c _ a _ 0 0 2 8 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 summaries, following an extractive, unsupervised approach. We assume the user specifies the domains they are interested in a priori (e.g., LAW, HEA) and the system returns summaries targeting the semantics of these domains. Specifically, we introduce DETRANK, an ex- tension of the well-known TEXTRANK algorithm (Mihalcea and Tarau, 2004), which incorporates ∗. For each domain signals acquired by DETNET document, TEXTRANK builds a directed graph G = (V, E) with nodes V corresponding to sen- tences, and undirected edges E whose weights are computed based on sentence similarity. Speci- fically, edge weights are represented with matrix E where each element Ei,j corresponds to the transition probability from vertex i to vertex j. Following Barrios et al. (2016), Ei,j is computed with the Okapi BM25 algorithm (Robertson et al., 1995), a probabilistic version of TF-IDF, and small weights (< 0.001) are set to zeros. Unreachable nodes are further pruned to acquire the final vertex set V . To enhance TEXTRANK with domain informa- tion, we first multiply sentence-level domain scores Q with their corresponding attention scores: Q∗ = Q (cid:6) [β; . . . ; β](cid:2). (19) and for a given domain c, we can extract a (domain) sentence score vector q∗ = Q∗ c,∗ ∈ R1×s. Then, from q∗, we produce vector ˜q ∈ R1×|V | representing a distribution of domain signals over sentences: ˆq = [q∗ (cid:8) i ]i∈V ˆq − ˆqmin ˆqmax − ˆqmin (cid:9) (20) (21) ˜q = softmax In order to render domain signals in different sentences more discernible, we scale all elements in ˆq to [0, 1] before obtaining a legitimate distribution with the softmax function. Finally, we integrate the domain component into the original transition matrix as: ˜E = φ ∗ ˆq + (1 − φ) ∗ E (22) where φ ∈ (0, 1) controls the extent to which domain-specific information influences sentence selection for the summarization task; higher φ will lead to summaries which are more domain- relevant. Here, we empirically set φ = 0.3. The main difference between DETRANK and TEXTRANK 592 Method TEXTRANK DETRANK Inf 45.45† 54.55 Succ 51.11 48.89 Coh 42.50† 57.50 All 46.35† 53.65 Table 6: Human evaluation results for summaries produced by TEXTRANK and DETRANK; proportion of times AMT workers found models Informative (Inf), Succinct (Succ), and Coherent (Coh); All is the average across ratings; symbol † denotes that differences between models are statistically significant (p < 0.05) using a pairwise t-test. is that TEXTRANK treats 1 − φ as a damping factor and a uniform probability distribution is applied to ˆq. In order to decide which sentence to include in the summary, a node’s centrality is measured using a graph-based ranking algorithm (Mihalcea and Tarau, 2004). Specifically, we run a Markov chain with ˜E on G until it converges to the stationary distribution e∗ where each element denotes the salience of a sentence. In the proposed DETRANK algorithm, e∗ jointly expresses the importance of a sentence in the document and its relevance to the given domain (controlled by φ). We rank sentences according to e∗ and select the top K ones, subject to a budget (e.g., 100 words). We ran a judgment elicitation study on sum- maries produced by TEXTRANK and DETRANK. Participants were provided with domain defini- tions and asked to decide which summary was best according to the criteria of: Informativeness (does the summary contain more information about a specific domain, e.g., ‘‘Government and Poli- tics’’?), Succinctness (does the summary avoid unnecessary detail and redundant information?), and Coherence (does the summary make logical sense?). AMT workers were allowed to answer ‘‘Both’’ or ‘‘Neither’’ in cases where they could not discriminate between summaries. We sampled 50 summary pairs from the English Wikipedia development set. We collected three responses per summary pair and determined which system par- ticipants preferred based on majority agreement. Table 6 shows the proportion of times AMT workers preferred each system according to the criteria of Informativeness, Succinctness, Coher- ence, and overall. As can be seen, participants find DETRANK summaries more informative and coherent. Although it is perhaps not surprising for l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 8 7 1 9 2 3 6 4 1 / / t l a c _ a _ 0 0 2 8 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 8 7 1 9 2 3 6 4 1 / / t l a c _ a _ 0 0 2 8 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 4: Summaries for the Wikipedia article ‘‘Arms Industry’’. The red heat map is for MIL and the blue one for BUS. Words with higher domain scores are highlighted with deeper color. DETRANK to produce summaries which are domain informative since it explicitly takes domain sig- nals into account, it is interesting to note that focusing on a specific domain also helps discard irrelevant information and produce more coherent summaries. This, on the other hand, possibly ren- ders DETRANK’s summaries more verbose (see the Succinctness ratings in Table 6). Figure 4 shows example summaries for the Wikipedia article Arms Industry for domains MIL and BUS.11 Both summaries begin with a sentence that introduces the arms industry to the reader. When MIL is the domain of interest, the summary focuses on military products such as guns and missiles. When the domain changes to BUS, the summary puts more emphasis on trade—for example, market competition and companies doing military business, such as Boeing and Eurofighter. 10 Conclusions In this work, we proposed an encoder-detector framework for domain detection. Leveraging only weak domain supervision, our model achieves results superior to competitive baselines across different languages, segment granularities, and text genres. Aside from identifying domain- specific training data, we also show that our model holds promise for other natural language tasks, such as text summarization. Beyond domain detection, we hope that some of the work described here might be of relevance to other multilabel classification problems such as sentiment analysis (Angelidis and Lapata, 2018), relation extraction 11https://en.wikipedia.org/wiki/Arms industry (Surdeanu et al., 2012), and named entity recog- nition (Tang et al., 2017). More generally, our experiments show that the proposed framework can be applied to textual data using minimal super- vision, significantly alleviating the annotation bottleneck for text classification problems. A key feature in achieving performance superior to competitive baselines is the hierarchical nature of our model, where representations are encoded step-by-step, first for words, then for sentences, and finally for documents. The framework flexibly integrates prior information which can be used to enhance the otherwise weak supervision signal or to render the model more robust across genres. In the future, we would like to investigate semi- supervised instantiations of MIL, where aside from bag labels, small amounts of instance labels are also available (Kotzias et al., 2015). It would also be interesting to examine how the label space influences model performance, especially because in our scenario the labels are extrapolated from Wikipedia and might be naturally noisy and/or ambiguous. Acknowledgments The authors would like to thank the anonymous reviewers and the action editor, Yusuke Miyao, for their valuable feedback. We acknowledge the financial support of the European Research Council (Lapata; award number 681760). This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract FA8650-17-C- 9118. The views and conclusions contained herein are those of the authors and should not 593 be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Govern- mental purposes notwithstanding any copyright annotation therein. References Stuart Andrews and Thomas Hofmann. 2004. Mul- tiple instance learning via disjunctive program- ming boosting. In Advances in Neural Information Processing Systems, 17, pages 65–72. Curran Associates, Inc. Stefanos Angelidis and Mirella Lapata. 2018. Multiple instance learning networks for fine- grained sentiment analysis. Transactions of the Association for Computational Linguistics, 6:17–31. for Andrew Arnold, Ramesh Nallapati, and William W Cohen. 2007. A comparative study of meth- In ods Proceedings of the 2007 IEEE International Conference on Data Mining, pages 77–82. Omaha, NE. transductive transfer learning. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450. Federico Barrios, Federico L´opez, Luis Argerich, and Rosa Wachenchauzer. 2016. Variations of the similarity function of textrank for auto- mated summarization. arXiv preprint arXiv: 1602.03606. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022. John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with struc- In Proceed- tural correspondence learning. the 2006 Conference on Empirical ings of Methods in Natural Language Processing, pages 120–128. Sydney. in Neural Information Processing Systems, 29, pages 343–351. Jose Camacho-Collados and Roberto Navigli. 2017. Babeldomains: Large-scale domain label- ing of lexical resources. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, volume 2, pages 223–228. Valencia. Peter Carbonetto, Gyuri Dork´o, Cordelia Schmid, Hendrik K¨uck, and Nando De Freitas. 2008. Learning to recognize objects with little super- vision. International Journal of Computer Vision, 77(1–3):219–237. Xilun Chen and Claire Cardie. 2018. Multinomial adversarial networks for multi-domain text clas- sification. In Proceedings of the 2018 Con- ference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 1226–1240. New Orleans, LA. Alexis Conneau, Holger Schwenk, Lo¨ıc Barrault, and Yann Lecun. 2017. Very deep convo- lutional networks for text classification. In Proceedings of the the Association for European Chapter of Computational Linguistics, pages 1107–1116. Valencia. the 15th Conference of Hal Daume III. 2007. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 256–263. Prague. Hal Daume III and Daniel Marcu. 2006. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26:101–126. Misha Denil, Alban Demiraj, Nal Kalchbrenner, Phil Blunsom, and Nando de Freitas. 2014. Modelling, visualising and summarising doc- uments with a single convolutional neural net- work, University of Oxford. Thomas G Dietterich, Richard H Lathrop, and Tom´as Lozano-P´erez. 1997. Solving the multi- ple instance problem with axis-parallel rect- angles. Artificial Intelligence, 89(1–2):31–71. Konstantinos Bousmalis, George Trigeorgis, Nathan Silmerman, Dilip Krishnan, and Dumitru Erhan. 2016. Domain separation networks. In Advances Jenny Rose Finkel and Christopher D Manning. 2009. Hierarchical Bayesian domain adap- tation. In Proceedings of Human Language 594 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 8 7 1 9 2 3 6 4 1 / / t l a c _ a _ 0 0 2 8 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 602–610. Boulder, CO. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pages 249–256. Sardinia. Sergey Ioffe and Christian Szegedy. 2015. Batch nor- malization: Accelerating deep network training by reducing internal covariate shift. In Proceed- ings of the 32nd International Conference on Machine Learning, pages 448–456. Lille. Mohit Iyyer, Varun Manjunatha, Jordan Boyd- Graber, and Hal Daum´e III. 2015. Deep un- ordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Language Processing, pages 1681–1691. Beijing. Jing Jiang and ChengXiang Zhai. 2007. Instance weighting for domain adaptation in NLP. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 264–271. Prague. Jim Keeler and David E. Rumelhart. 1992. A self-organizing integrated segmentation and recognition neural net. In Advances in Neural Information Processing Systems, pages 496–503. Morgan-Kaufmann. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1746–1751. Doha. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980. Dimitrios Kotzias, Misha Denil, Nando De Freitas, and Padhraic Smyth. 2015. From group to in- dividual labels using deep features. In Proceed- ings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 597–606. New York, NY. Simon Lacoste-Julien, Fei Sha, and Michael I. Jordan. 2009. Disclda: Discriminative learning for dimensionality reduction and classification. In Advances in Neural Information Processing Systems, pages 897–904. Curran Associates, Inc. Shoushan Li and Chengqing Zong. 2008. Multi- domain sentiment classification. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Lan- guage Technologies, pages 257–260. Columbus, OH. Wei Lu, Hai Leong Chieu, and Jonathan L¨ofgren. 2016. A general regularization framework for domain adaptation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 950–954. Austin, TX. Oded Maron and Aparna Lakshmi Ratan. 1998. Multiple-instance learning for natural scene classification. In Proceedings of the Annual International Conference on Machine Learn- ing, volume 98, pages 341–349. Jon D. Mcauliffe and David M. Blei. 2008. Super- vised topic models. In Advances in Neural In- formation Processing Systems, pages 121–128. Curran Associates, Inc. Andrew McCallum and Kamal Nigam. 1998. A comparison of event models for naive Bayes text classification. In Proceeedings of the AAAI- 98 Workshop on Learning For Text Catego- rization, volume 752, pages 41–48. Madison, WI. Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 404–411. Barcelona. Eric W. Noreen. 1989. Computer-intensive methods for testing hypotheses, Wiley New York. Sinno Jialin Pan and Qiang Yang. 2010. A survey IEEE Transactions on on transfer Knowledge and Data Engineering, 22(10): 1345–1359. learning. Nikolaos Pappas and Andrei Popescu-Belis. 2014. Explaining the stars: Weighted multiple- instance learning for aspect-based sentiment 595 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 8 7 1 9 2 3 6 4 1 / / t l a c _ a _ 0 0 2 8 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 analysis. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 455–466. Doha. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In Advances in Neural Information Processing Systems. Curran Associates, Inc. Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009a. Labeled lda: A supervised topic model for credit attribu- tion in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 248–256. Suntec. Daniel Ramage, Paul Heymann, Christopher D. Manning, and Hector Garcia-Molina. 2009b. Clustering the tagged Web. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, pages 54–63. Barcelona. Stephen E. Robertson, Steve Walker, Susan Jones, Micheline M. Hancock-Beaulieu, Mike Gatford. 1995. Okapi at trec-3. Nist Special Publication Sp, 109:109. Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. 2012. Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 455–465. Jeju Island, Korea. Siliang Tang, Ning Zhang, Jinjiang Zhang, Fei Wu, and Yueting Zhuang. 2017. Nite: A neural inductive teaching framework for domain specific ner. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2652–2657. Copenhagen. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Atten- tion is all you need. In Advances in Neural Infor- mation Processing Systems, pages 6000–6010. Curran Associates, Inc. Xiu-Shen Wei, Jianxin Wu, and Zhi-Hua Zhou. 2014. Scalable multi-instance learning. In Pro- ceedings of the 2014 IEEE International Confer- ence on Data Mining, pages1037–1042.Shenzhen. Nils Weidmann, Eibe Frank, and Bernhard Pfahringer. 2003. A two-level learning method for generalized multi-instance problems. In Proceedings of the European Conference on Machine Learning, pages 468–479. Warsaw. Fangzhao Wu and Yongfeng Huang. 2015. Collab- orative multi-domain sentiment classification. In Proceedings of 2015 International Confer- ence on Data Mining, pages 459–468. Atlanic City, NJ. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hier- archical attention networks for document classi- fication. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489. San Diego, CA. Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. Qanet: Combining local convolution with global self- attention for reading comprehension. arXiv preprint arXiv:1804.09541. Min-Ling Zhang and Zhi-Hua Zhou. 2014. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8):1819–1837. Qi Zhang, Sally A Goldman, Wei Yu, and Jason E. Fritts. 2002. Content-based image retrieval using multiple-instance learning. In Proceedings of the 19th International Conference on Machine Learning, volume 2, pages 682–689. Sydney. Ye Zhang, Iain Marshall, and Byron C. Wallace. 2016. Rationale-augmented convolutional neu- ral networks for text classification. In Pro- ceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 795–804. Austin, TX. Zhi-Hua Zhou, Yu-Yin Sun, and Yu-Feng Li. 2009. Multi-instance learning by treating in- stances as non-iid samples. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1249–1256. Montreal. 596 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 8 7 1 9 2 3 6 4 1 / / t l a c _ a _ 0 0 2 8 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3
Download pdf