Domain-Specific Word Embeddings with Structure Prediction
David Lassner1,2∗ Stephanie Brandl1,2,3∗ Anne Baillot4 Shinichi Nakajima1,2,5
1TU Berlin, Deutschland
2BIFOLD, Deutschland
3University of Copenhagen, Denmark
4Le Mans Universit´e, Frankreich
5RIKEN Center for AIP, Japan
{lassner@tu-berlin.de,brandl@di.ku.dk}
∗Authors contributed equally.
Abstrakt
Complementary to finding good general word
embeddings, an important question for repre-
sentation learning is to find dynamic word em-
beddings, Zum Beispiel, across time or domain.
Current methods do not offer a way to use
or predict information on structure between
sub-corpora, time or domain and dynamic em-
beddings can only be compared after post-
Ausrichtung. We propose novel word embedding
methods that provide general word repre-
sentations for the whole corpus, Domain-
specific representations for each sub-corpus,
sub-corpus structure, and embedding align-
ment simultaneously. We present an empiri-
cal evaluation on New York Times articles
and two English Wikipedia datasets with arti-
cles on science and philosophy. Our method,
called Word2Vec with Structure Prediction
(W2VPred), provides better performance than
baselines in terms of the general analogy tests,
domain-specific analogy tests, and multiple
specific word embedding evaluations as well
as structure prediction performance when no
structure is given a priori. As a use case in the
field of Digital Humanities we demonstrate
how to raise novel research questions for high
literature from the German Text Archive.
1 Einführung
Word embeddings
(Mikolov et al., 2013B;
Pennington et al., 2014) are a powerful tool for
word-level representation in a vector space that
captures semantic and syntactic relations between
Wörter. They have been successfully used in many
applications such as text classification (Joulin
et al., 2016) and machine translation (Mikolov
et al., 2013A). Word embeddings highly depend
on their training corpus. Zum Beispiel, technical
terms used in scientific documents can have a
different meaning in other domains, and words
can change their meaning over time—‘‘apple’’
did not mean a tech company before Apple Inc.
320
was founded. Andererseits, such local or
domain-specific representations are also not in-
dependent of each other, because most words
are expected to have a similar meaning across
domains.
There are many situations where a given target
corpus is considered to have some structure. Für
Beispiel, when analyzing news articles, one can
expect that articles published in 2000 Und 2001 Sind
more similar to each other than the ones from 2000
Und 2010. When analyzing scientific articles, Verwendet
of technical terms are expected to be similar in
articles on similar fields of science. This implies
that the structure of a corpus can be a useful side
resource for obtaining better word representation.
Various approaches to analyze semantic shifts
in text have been proposed where typically first
individual static embeddings are trained and then
aligned afterwards (z.B., Kulkarni et al., 2015;
Hamilton et al., 2016; Kutuzov et al., 2018;
Tahmasebi et al., 2018). As most word embed-
dings are invariant with respect to rotation and
scaling, it is necessary to map word embeddings
from different training procedures into the same
vector space in order to compare them. Das
procedure is usually called alignment, for which
orthogonal Procrustes can be applied as has been
used in Hamilton et al. (2016).
Kürzlich, new methods to train diachronic word
embeddings have been proposed where the align-
ment process is integrated in the training process.
Bamler and Mandt (2017) propose a Bayesian
approach that extends the skip-gram model
(Mikolov et al., 2013B). Rudolph and Blei (2018)
analyze dynamic changes in word embeddings
based on exponential family embeddings. Yao
et al. (2018) propose Dynamic Word2Vec where
word embeddings for each year of the New York
Times corpus are trained based on individual pos-
itive point-wise information matrices and aligned
simultaneously.
Transactions of the Association for Computational Linguistics, Bd. 11, S. 320–335, 2023. https://doi.org/10.1162/tacl a 00538
Action Editor: Jacob Eisenstein. Submission batch: 2/2022; Revision batch: 8/2022; Published 3/2023.
C(cid:13) 2023 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
T
l
A
C
_
A
_
0
0
5
3
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
We argue that apart from diachronic word
embeddings there is a need to train dynamic
word embeddings that not only capture temporal
shifts in language but for instance also semantic
shifts between domains or regional differences.
that those embeddings can be
It is important
trained on small datasets. We therefore propose
two generalizations of Dynamic Word2Vec. Unser
first method is called Word2Vec with Structure
Constraint (W2VConstr), where domain-specific
embeddings are learned under regularization with
any kind of structure. This method performs
well when a respective graph structure is given
a priori. For more general cases where no
structure information is given, we propose our
second method, called Word2Vec with Structure
Prediction (W2VPred), where domain-specific
embeddings and sub-corpora structure are learned
gleichzeitig. W2VPred simultaneously solves
three central problems that arise with word
embedding representations:
1. Words in the sub-corpora are embedded in the
same vector space, and are therefore directly
comparable without post-alignment.
2. The different representations are trained si-
multaneously on the whole corpus as well as
on the sub-corpora, which makes embeddings
for both general and domain-specific words
robust, due to the information exchange
between sub-corpora.
3. The estimated graph structure can be used for
confirmatory evaluation when a reasonable
prior structure is given. W2VPred together
with W2VConstr identifies the cases where
the given structure is not ideal, and sug-
gests a refined structure which leads to an
improved embedding performance; we call
this method Word2Vec with Denoised Struc-
ture Constraint. When no structure is given,
W2VPred provides insights on the struc-
ture of sub-corpora, Zum Beispiel, Ähnlichkeit
between authors or scientific domains.
All our methods rely on static word embeddings
as opposed to currently often used contextualized
word embeddings. As we learn one representation
per slice such as year or author, thus considering a
much broader context than contextualized embed-
dings, we are able to find a meaningful structure-
between corresponding slices. Another main ad-
vantage comes from the fact that our methods do
not require any pre-training and can be run on a
single GPU.
We test our methods on 4 different datasets
with different structures (Sequenzen, trees, Und
general graphs), domains (news, wikipedia, hoch
Literatur), and languages (English and German).
We show on numerous established evaluation
methods that W2VConstr and W2VPred sig-
nificantly outperform baseline methods with
regard to general as well as domain-specific
embedding quality. We also show that W2VPred
is able to predict the structure of a given corpus,
outperforming all baselines. Zusätzlich, Wir
show robust heuristics to select hyperparameters
based on proxy measurements in a setting
where the true structure is not known. Endlich,
we show how W2VPred can be used in an
explorative setting to raise novel
Forschung
questions in the field of Digital Humanities.
Our code is available at https://github
.com/stephaniebrandl/domain-word
-embeddings.
2 Related Work
Various approaches to track, detect, and quan-
tify semantic shifts in text over time have been
proposed (Kim et al., 2014; Kulkarni et al., 2015;
Hamilton et al., 2016; Zhang et al., 2016; Marjanen
et al., 2019).
This research is driven by the hypothesis that
semantic shifts occur, Zum Beispiel, im Laufe der Zeit
(Bleich et al., 2016) and viewpoints (Azarbonyad
et al., 2017),
in political debates (Reese and
Lewis 2009), or caused by cultural developments
(Lansdall-Welfare et al., 2017). Analysing those
shifts can be crucial in political and social stud-
ies but also in literary studies, as we show in
Abschnitt 5.
Typically, methods first train individual static
embeddings for different timestamps, and then
align them afterwards (z.B., Kulkarni et al., 2015;
Hamilton et al., 2016; Kutuzov et al., 2018;
Devlin et al., 2019; Jawahar and Seddah, 2019;
Hofmann et al., 2020; and a comprehensive
survey by Tahmasebi et al., 2018). Other ap-
proaches, which deal with more general structure
(Azarbonyad et al., 2017; Gonen et al., 2020)
and more general applications (Zeng et al., 2017;
Shoemark et al., 2019), also rely on post-alignment
321
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
T
l
A
C
_
A
_
0
0
5
3
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
of static word embeddings (Grave et al., 2019).
With the rise of larger language models such as
BERT (Devlin et al., 2019) Und, with that, con-
textualized embeddings, a part of the research
question has shifted towards detecting language
change in contextualized word embeddings (z.B.,
Jawahar and Seddah, 2019; Hofmann et al., 2020).
Recent methods directly learn dynamic word
embeddings in a common vector space with-
out post-alignment: Bamler and Mandt (2017)
proposed a Bayesian probabilistic model that gen-
eralizes the skip-gram model (Mikolov et al.,
2013B) to learn dynamic word embeddings that
evolve over time. Rudolph and Blei (2018) ana-
lyzed dynamic changes in word embeddings based
on exponential family embeddings, a probabilis-
tic framework that generalizes the concept of
word embeddings to other types of data (Rudolf
et al., 2016). Yao et al. (2018) proposed Dynamic
Word2Vec (DW2V) to learn individual word em-
beddings for each year of the New York Times
dataset (1990-2016) while simultaneously align-
ing the embeddings in the same vector space.
Speziell, they solve the following problem for
each timepoint t = 1, . . . , T sequentially:
prior information is available but not necessarily
reliable.
3.1 Word2Vec with Structure Constraint
We reformulate the diachronic term in Eq. 1 als
T
W diac
T,t′ kUt − Ut′k2
F
LD =
t′=1
T,t′ = 1({|t − t′| = 1}),
X
with W diac
(3)
Wo 1(·) denotes the indicator function. Das
allows us to generalize DW2V for different neigh-
borhood structures: Instead of the chronological
sequence (3), we assume W ∈ RT ×T to be an ar-
bitrary affinity matrix representing the underlying
semantic structure, given as prior knowledge.
Let D ∈ RT ×T be the pairwise distance matrix
between embeddings such that
Dt,t′ = kUt − Ut′k2
F ,
(4)
and we impose regularization on the distance,
instead of the norm of each embeddings. Das
yields the following optimization problem:
LF + τ LRD + λLS, Wo
(5)
min
Ut
LF + τ LR + λLD, Wo
min
Ut
2
F , LR = kUtk2
Yt − UtU ⊤
LF =
F ,
T
F + kUt − Ut+1k2
LD = kUt−1 − Utk2
(cid:13)
F
(cid:13)
(cid:13)
(cid:13)
(1)
(2)
represent the losses for data fidelity, regulariza-
tion, and diachronic constraint, jeweils. Ut ∈
RV ×d is the matrix consisting of d-dimensional
embeddings for V words in the vocabulary, Und
Yt ∈ RV ×V represents the positive pointwise
mutual
Information (PPMI) Matrix (Levy and
Goldberg, 2014). The diachronic constraint LD
the word embed-
encourages alignment of
dings with the parameter λ controlling how
much the embeddings are allowed to be dy-
namisch (λ = 0: no alignment and λ → ∞: statisch
embeddings).
3 Methoden
By generalizing DW2V, we propose two meth-
Odds, one for the case where sub-corpora structure
is given as prior knowledge, and the other for
the case where no structure is given a priori.
We also argue that combining both methods can
improve the performance in cases where some
2
F , LRD = kDkF ,
LF =
Yt − UtU ⊤
T
T
t′=1 Wt,t′ Dt,t′.
LS =
(cid:13)
(cid:13)
We call this generalization of DW2V Word2Vec
with Structure Constraint (W2VConstr).
(cid:13)
(cid:13)
P
(6)
3.2 Word2Vec with Structure Prediction
When no structure information is given, we need
to estimate the similarity matrix W from the data.
We define W based on the similarity between em-
beddings. Speziell, we initialize (each entry
von) the embeddings {Ut}T
t=1 by independent uni-
form distribution in [0, 1). Dann, in each iteration,
we compute the distance matrix D by Eq. (4), Und
set
W to its (entry-wise) inverse, das ist,
F
Wt,t′ ←
D−1
T,t′
0
(
for t 6= t′,
for t = t′
(7)
F
and normalize it according to the corresponding
column and row:
Wt,t′ ←
Wt,t′
F
Wt,t′′+Pt′′ f
Wt′′,t′
Pt′′ f
.
(8)
The structure loss (6) with the similarity ma-
trix W updated by Eqs. 7 Und 8 constrains the
322
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
T
l
A
C
_
A
_
0
0
5
3
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
distances between embeddings according to the
similarity structure that is at the same time es-
timated from the distances between embeddings.
We call this variant Word2Vec with Structure
Prediction (W2VPred). Effectively, W serves as
a weighting factor that strengthens connections
between close embeddings.
3.3 Word2Vec with Denoised
Structure Constraint
We propose a third method that combines
W2VConstr and W2VPred for the scenario where
W2VConstr results in poor word embeddings be-
cause the a priori structure is not optimal. In diesem
Fall, we suggest applying W2VPred and consider
the resulting structure as an input for W2VConstr.
This procedure needs prior knowledge of the
dataset and a human-in-the-loop to interpret the
predicted structure by W2VPred in order to add
or remove specific edges in the new ground truth
Struktur. In the experiment section, we will con-
dense the predicted structure by W2VPred into
a sparse, denoised ground truth structure that is
meaningful. We call this method Word2Vec with
Denoised Structure Constraint (W2VDen).
3.4 Optimization
We solve the problem (5) iteratively for each em-
bedding Ut, given the other embeddings {Ut′}t′6=t
are fixed. We define one epoch as complete when
{Ut} has been updated for all t. We applied gra-
dient descent with Adam (Kingma and Ba, 2014)
with default values for the exponential decay rates
given in the original paper and a learning rate of
0.1. The learning rate has been reduced after 100
epochs to 0.05 and after 500 epochs to 0.01 mit
a total number of 1000 Epochen. Both models have
been implemented in PyTorch. W2VPred updates
W by Eqs. 7 Und 8 after every iteration.
4 Experiments on Benchmark Data
We conducted four experiments starting with
well-known settings and datasets and incremen-
tally moving to new datasets with different
structures. The first experiment focuses on the
general embedding quality,
the second one
presents results on domain-specific embeddings,
the third one evaluates the method’s ability to
predict structure and the fourth one shows the
method’s performance on various word similar-
ity tasks. In the following subsections, we will
Category
Natural Sciences
Chemistry
Computer Science
Biology
Maschinenbau & Technologie
Civil Engineering
Electrical & Electronic Engineering
Mechanical Engineering
Social Sciences
Business & Economics
Law
Psychologie
Humanities
Literatur & Languages
Geschichte & Archaeology
Religion & Philosophy & Ethics
#Articles
8536
19164
11201
10988
20091
17797
6809
4978
17347
14747
13265
5788
15066
24800
16453
19356
Tisch 1: Categories and the number of articles
in the WikiFoS dataset. One cluster contains 4
categories (rows): The top one is the main cate-
gory and the following 3 are subcategories. Fields
joined by & originate from 2 separate categories
in Wikipedia3 but were joined, according to the
OECD’s definition.2
first describe the data, preprocessing, and then
the results. Further details on implementation and
hyperparameters can be found in Appendix A.
4.1 Datasets
We evaluated our methods on the following three
benchmark datasets.
New York Times (NYT): The New York Times
dataset1 (NYT) contains headlines, lead texts, Und
paragraphs of English news articles published
online and offline between January 1990 and June
2016 with a total of 100,945 documents. Wir
grouped the dataset by years with 1990-1998 als
the train set and 1999-2016 as the test set.
Wikipedia Field of Science and Technology
(WikiFoS): We selected categories of
Die
OECD’s list of Fields of Science and Tech-
nology2 and downloaded the corresponding
1https://sites.google.com/site
/zijunyaorutgers/.
2http://www.oecd.org/science/inno
/38235147.pdf.
323
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
T
l
A
C
_
A
_
0
0
5
3
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Category
Logic
Concepts in Logic
History of Logic
Aesthetics
Philosophers of Art
Literary Criticism
Ethics
Moral Philosophers
Social Philosophy
Epistemology
Epistemologists
Cognition
Metaphysics
Ontology
Philosophy of Mind
#Articles
3394
1455
76
7349
30
3826
5842
170
3816
3218
372
8504
1779
796
976
Tisch 2: Categories and the number of articles
in the WikiPhil dataset. One cluster contains 3
categories: The top one is the main category and
the following are subcategories in Wikipedia.
articles from the English Wikipedia. The re-
sulting dataset Wikipedia Field of Science and
Technologie (WikiFoS) contains four clusters,
each of which consists of one main cate-
gory and three subcategories, mit 226,386
unique articles in total (siehe Tabelle 1). We pub-
lished the data set at https://huggingface
.co/datasets/millawell/wikipedia
field of science. The articles belonging
to multiple categories3 were randomly assigned
to a single category in order to avoid similarity
because of overlapping texts instead of structural
Ähnlichkeit. In each category, we randomly chose
1/3 of the articles for the train set, und das
remaining 2/3 were used as the test set.
Wikipedia Philosophy (WikiPhil): Based on
Wikipedia’s definition of categories in philosophy,
we selected 5 main categories and their 2 largest
subcategories each (siehe Tabelle 2). Categories and
subcategories are based on the definition given by
Wikipedia. We downloaded 41,603 unique articles
in total from the English Wikipedia. Ähnlich
to WikiFoS, the articles belonging to multiple
categories were randomly assigned to a single
category, and the articles in each category were
divided into a train set (1/3) and a test set (2/3).
3https://en.wikipedia.org/wiki
/Wikipedia:Contents/Categories.
324
Figur 1: Prior affinity matrix W used for W2VConstr
(upper), and the estimated affinity matrix by W2VPred
(lower) where the number indicates how close slices are
(1: identical, 0: very distant). The estimated affinity for
NYT implies the year 2006 is an outlier. We checked the
corresponding articles and found that many paragraphs
and tokens are missing in that year. Note that the
diagonal entries do not contribute to the loss for all
Methoden.
4.2 Vorverarbeitung
We lemmatized all tokens, das ist, assigned their
base forms with spaCy4 and grouped the data
by years (for NYT) or categories (for WikiPhil
and WikiFoS). For each dataset, we defined one
individual vocabulary where we considered the
20,000 most frequent (lemmatized) words of the
entire dataset that are also within the 20,000 most
frequent words in at least 3 independent slices, Das
Ist, years or categories. Hier entlang, we filtered out
‘‘trend’’ words that are of significance only within
a very short time period/only a few categories. Der
100 most frequent words were filtered out as stop
Wörter. We set the symmetric context window (Die
number of words before and after a specific word
considered as context for the PPMI matrix) Zu 5.
4.3 Ex1: General Embedding Performance
In our first experiment, we compare the quality of
the word embeddings trained by W2VConstr and
W2VPred with the embeddings trained by base-
line methods, GloVe, Skip-Gram, CBOW and
DW2V. For GloVe, Skip-Gram and CBOW, Wir
computed one set of embeddings on the entire
dataset. For DW2V, W2VConstr, and W2VPred,
4https://spacy.io.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
T
l
A
C
_
A
_
0
0
5
3
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
domain-specific embeddings {Ut} were averaged
over all domains. We use the same vocabulary for
all methods. For W2VConstr, we set the affin-
ity matrix W as shown in the upper row of
Figur 1, based on the a priori known struc-
tur, das ist, diachronic structure for NYT, und das
category structure in Tables 1 Und 2 for WikiFoS
and WikiPhil. The lower row of Figure 1 zeigt an
the learned structure by W2VPred.
Speziell, we set the ground-truth affinity
T,t′ = 1 Wenn |t − t′| =
T,t′ as follows: for NYT, W ∗
W ∗
1, and W ∗
T,t′ = 0 ansonsten; for WikiFoS and
WikiPhil, W ∗
T,t′ = 1 if t is the parent category of
T,t′ = 0.5 if t and t′ are under
t′ or vice versa, W ∗
the same parent category, and W ∗
T,t′ = 0 ansonsten
(see Tables 1 Und 2 for the category structure
of WikiFoS and WikiPhil, jeweils, und das
top row of Figure 1 for the visualization of the
ground-truth affinity matrices).
We evaluate the embeddings on general analo-
gies (Mikolov et al., 2013B) to capture the general
meaning of a word. Tisch 3 shows the corre-
sponding accuracies averaged across 10 runs with
different random seeds.
For NYT, W2VConstr performs similarly to
DW2V, which has essentially the same constraint
term—LS in Eq. (6) for W2VConstr is the same as
LD in Eq. (2) for DW2V up to scaling when W
is set to the prior affinity matrix for NYT—
and significantly outperforms the other base-
lines. W2VPred performs slightly worse then
the best methods. For WikiFoS, W2VConstr and
W2VPred outperform all baselines by a large
margin. In WikiPhil, W2VConstr performs poorly
(worse than GloVe), while W2VPred outperforms
all other methods by a large margin. Standard
deviation across the 10 runs are less than one for
NYT (all methods and all n), slightly higher for
WikiFoS, and highest for WikiPhil W2VPred and
W2VConstr (0.28-3.17).
These different behaviors can be explained by
comparing the estimated (lower row) and the a pri-
ori given (upper row) affinity matrices shown in
Figur 1. In NYT, the estimated affinity decays
smoothly as the time difference between two slices
erhöht sich. This implies that the a priori given di-
achronic structure is good enough to enhance
the word embedding quality (by W2VConstr and
DW2V), and estimating the affinity matrix (von
W2VPred) slightly degrades the performance due
to the increased number of unknown parameters to
A
T
A
D
T
Y
N
S
Ö
F
ich
k
ich
W
l
ich
H
P
ich
k
ich
W
Method
GloVe
Skip-Gram
CBOW
DW2V
W2VConstr (unser)
W2VPred (unser)
GloVe
Skip-Gram
CBOW
W2VConstr (unser)
W2VPred (unser)
W2VDen (unser)
GloVe
Skip-Gram
CBOW
W2VConstr (unser)
W2VPred (unser)
W2VDen (unser)
general analogy tests
n=5
26.41
16.20
19.92
32.88
33.01
31.66
23.74
12.09
17.47
45.96
45.73
46.50*
17.45
10.18
6.61
10.37
31.99
36.21*
n=1
9.40
3.62
5.58
11.27
10.90
10.28
6.33
3.54
4.25
11.91
11.82
11.61
2.59
2.76
3.11
0.42
4.37
5.96*
n=10
33.58
25.61
27.60
42.97
43.12
41.88
32.58
15.77
26.21
56.88
56.40
57.08*
24.19
17.48
9.47
15.02
41.75
46.15*
Tisch 3: General analogy test performance for our
Methoden, W2VConstr and W2VPred, and baseline
Methoden, GloVe, Skip-Gram, CBOW, and DW2V
averaged across ten runs with different random
seeds. The best method and the methods that
are not significantly outperformed by the best is
marked with a gray background, according to the
Wilcoxon signed rank test for α = 0.05. W2VDen
is compared against the best method from the same
data set and if it is significantly better, it is marked
with an asterisk (*).
be estimated. In WikiFoS, although the estimated
affinity matrix shows somewhat similar structure
to the given one a priori, it is not as smooth as
the one in NYT and we can recognize two instead
of four clusters in the estimated affinity matrix
consisting of the first two main categories (Natu-
ral Sciences and Engineering & Technologie), Und
the last two (Social Sciences and Humanities),
which we find reasonable according to Table 1. In
summary, W2VConstr and W2VPred outperform
baseline methods when a suitable prior structure
is given. Results on the WikiPhil dataset show
a different tendency: The estimated affinity by
W2VPred is very different from the prior structure,
which implies that the corpus structure defined by
Wikipedia is not suitable for learning word em-
beddings. Infolge, W2VConstr performs even
poorer than GloVe. Gesamt, Tisch 3 zeigt, dass
our proposed W2VPred robustly performs well on
all datasets. In Section 4.5.3, we will further im-
prove the performance by denoising the estimated
325
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
T
l
A
C
_
A
_
0
0
5
3
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
GloVe
Skip-Gram
CBOW
Dynamic Word2Vecabbr
W2VConstr (unser)
W2VPred (unser)
n=1
7.72
10.49
6.35
39.47
38.23
41.87
n=5
14.39
19.89
11.36
61.94
57.73
64.60
n=10
17.87
24.78
14.59
67.35
64.54
69.67
Tisch 4: Accuracies for
(NYT).
temporal analogies
structure by W2VPred for the case where a prior
structure is not given or is unreliable.
4.4 Ex2: Domain-specific Embeddings
4.4.1 Quantitative Evaluation
(2018) introduced temporal anal-
Yao et al.
ogy tests that allow us to assess the quality of
word embeddings with respect to their tempo-
ral information. Bedauerlicherweise, domänenspezifisch
tests are only available for the NYT dataset.
Tisch 4 shows temporal analogy test accura-
cies on the NYT dataset. Wie erwartet, GloVe,
Skip-Gram, and CBOW perform poorly. We as-
sume this is because the individual slices are too
small to train reliable embeddings. The embed-
dings trained with DW2V and W2VConstr are
learned collaboratively between slices due to the
diachronic and structure terms and significantly
improve the performance. Vor allem, W2VPred fur-
ther improves the performance by learning a
more suitable structure from the data. In der Tat,
the learned affinity matrix by W2VPred (sehen
Figure 1a) suggests that not the diachronic struc-
ture used by DW2V but a smoother structure
is optimal.
4.4.2 Qualitative Evaluation
Since no domain-specific analogy test is available
for WikiFoS and WikiPhil, we qualitatively ana-
lyzed the domain-specific embeddings by check-
ing nearest neighboring words. Tisch 5 zeigt an
Die 5 nearest neighbors of the word ‘‘power’’ in
the embedded spaces for the 4 main categories of
WikiFoS trained by W2VPred, GloVe, and Skip-
Gram. We averaged the embeddings obtained by
W2VPred over the subcategories in each main cat-
egory. The distance between words are measured
by the cosine similarity.
We see that W2VPred correctly captured the
domain-specifc meaning of ‘‘power’’: In Natural
Sciences and Engineering & Technology the word
is used in a physical context, Zum Beispiel, In
combination with generators, which is the clos-
est word in both categories. In Social Sciences
and Humanities on the other hand, the nearest
words are ‘‘powerful’’ and ‘‘control’’, welche,
in combination, indicates that it refers to ‘‘the
ability to control something or someone’’.5 The
embedding trained by GloVe shows a very
general meaning of power with no clear ten-
dency towards a physical or political context,
whereas Skip-Gram shows a tendency towards
the physical meaning. We observed many simi-
lar examples, Zum Beispiel, charge:electrical-legal,
Leistung:quality-acting,
resistance:physical-
sozial, Wettrennen:championship-ethnicity.
As another example in the NYT corpus, Figur 2
shows the evolution of the word blackberry, welche
can either mean the fruit or the tech company. Wir
selected two slices (2000 & 2012) with the largest
pairwise distance for the blackberry, and chose
the top-5 neighboring words from each year. Der
figure plots the cosine similarities between black-
berry and the neighboring words. The time series
shows how the word blackberry evolved from
being mostly associated with the fruit towards as-
sociated with the company, and back to the fruit.
This can be connected to the release of their smart-
phone in 2002 and the decrease in sales number
nach 2011.6,7 Interessant, the word apple stays
relatively close during the entire time period as
its word vector also (as blackberry) reflects both
meanings, a fruit and a tech company.
4.5 Ex3: Structure Prediction
This subsection discusses the structure predic-
tion performance by W2VPred. We first evaluate
the prediction performance by using the a priori
affinity structure as the ground-truth structure.
The results of this experiment should be inter-
preted with care, because we have already seen
in Section 4.3 that the given a priori affinity
does not necessarily reflect the similarity struc-
ture of the slices in the corpus, in particular for
WikiPhil. We then analyze the correlation be-
tween the embedding quality and the structure
5https://www.oxfordlearnersdictionaries
.com/definition/english/power 1.
6https://www.businessinsider.com
/blackberry-smartphone-rise-fall-mobile
-failure-innovate-2019-11.
7https://www.businessinsider.com
/blackberry-phone-sales-decline-chart
-2016-9.
326
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
T
l
A
C
_
A
_
0
0
5
3
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Nat. Sci
generator
PV
thermoelectric
inverter
converter
Ing&Tech
generator
inverter
alternator
converter
elektrisch
Soc. Sci
mächtig
Kontrolle
wield
drive
generator
Hum
mächtig
Kontrolle
counterbalance
drive
supreme
GloVe
Kontrolle
supply
Kapazität
System
internal
Skip-Gram
Power
inverter
mover
Elektrizität
thermoelectric
Tisch 5: Five nearest neighbors to the word ‘‘power’’ in the domain-specific embedding space, gelernt
by W2VPred, of four main categories of WikiFoS (left four columns), and in the general embedding
space learned by GloVe and Skip-Gram on the entire dataset (right-most columns, jeweils).
Dataset
Method
GloVe
Skip-Gram
CBOW
W2VPred (unser)
Burrows’delta
NYT WikiFos WikiPhil
67.22
71.11
65.28
81.67
55.56
51.66
54.59
45.00
62.50
22.92
36.67
26.67
23.33
23.33
6.67
Tisch 6: Recall@k for structure prediction perfor-
mance evaluation with the prior structure (Figur 1
links) used as the ground-truth.
corpora, Zum Beispiel, for identifying the authors
of anonymously published documents. The base-
line methods based on GloVe, Skip-Gram, Und
CBOW simply learn the domain-specific embed-
dings separately, and the distances between the
slices are evaluated by Eq. 4.
Tisch 6 shows recall@k (averaged over ten tri-
als). As in the analogy tests, the best methods are in
gray cells according to the Wilcoxon test. We see
that W2VPred significantly outperforms the base-
line methods for NYT and WikiFoS. For WikiPhil,
we will further analyze the affinity structure in the
following section.
4.5.2 Assessment of Prior Structure
Im Folgenden, we reevaluate the aforemen-
tioned prior affinity matrix for WikiPhil (sehen
Figur 1). daher, we analyze the correlation
between embedding quality and structure perfor-
mance and find that a suitable ground truth affinity
matrix is necessary to train good word embeddings
with W2VConstr. We trained W2VPred with dif-
ferent parameter setting for (λ, τ ) on the train set,
and applied the global analogy tests and the struc-
ture prediction performance evaluation (mit dem
prior structure as the ground-truth). For λ and τ ,
we considered log-scaled parameters in the ranges
Figur 2: Evolution of the word blackberry in NYT.
Nearest neighbors of the word blackberry have been
selected in 2000 (blueish) Und 2011 (reddish), Und
the embeddings have been computed with W2VPred.
Cosine similarity between each neighboring word and
blackberry is plotted over time, showing the shift in
dominance between fruit and smartphone brand. Der
word apple also relates to both fruit and company, Und
therefore stays close during the entire time period.
prediction performance by W2VPred, um zu
evaluate the a priori affinity as the ground-truth in
each dataset. Endlich, we apply W2VDen which
combines the benefits of both W2VConstr and
W2VPred for the case where the prior structure
is not suitable.
4.5.1 Structure Prediction Performance
Hier, we evaluate the structure prediction accu-
racy by W2VPred with the a priori given affinity
matrix D ∈ RT ×T (shown in the upper row of
Figur 1) as the ground-truth. We report on
recall@k averaged over all domains.
We compare our W2VPred with Burrows’ Delta
(Burrows, 2002) and other baseline methods based
on the GloVe, Skip-Gram, and CBOW embed-
dings. Burrows’ Delta is a commonly used method
in stylometrics to analyze the similarity between
327
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
T
l
A
C
_
A
_
0
0
5
3
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Dataset
NYT
WikiFoS
WikiPhil
WikiPhil (denoised)
ρ
0.58
0.65
−0.19
−0.14
Tisch 7: Pearson correlation coefficients for per-
formance on analogy tests (n = 10) and structure
prediction evaluation (recall@k) by W2VPred for
the parameters applied in the grid search for hy-
perparameter tuning. Linear correlation indicates
that a good word embedding quality also leads to
an accurate structure prediction (und umgekehrt).
Significant correlation coefficients (P < 0.05) are
marked in gray.
[2−2 − 212] and [24 − 212], respectively, and dis-
play correlation values on NYT, WikiFoS, and
WikiPhil in Table 7.
In NYT and WikiFoS, we observe clear posi-
tive correlations between the embedding quality
and the structure prediction performance, which
implies that the estimated structure closer to the
ground truth enhances the embedding quality. The
Pearson correlation coefficients are 0.58 and 0.65,
respectively (both with p < 0.05).
Whereas Table 7 for WikiPhil does not show a
clear positive correlation. Indeed, the Pearson cor-
relation coefficient is even negative with −0.19,
which implies that the prior structure for WikiPhil
is not suitable and even harmful for the word
embedding performance. This result is consis-
tent with the bad performance of W2VConstr on
WikiPhil in Section 4.3.
4.5.3 Structure Discovery by W2VDen
The good performance of W2VPred on WikiPhil
in Section 4.3 suggests that W2VPred has captured
a suitable structure of WikiPhil. Here, we analyze
the learned structure, and polish it with additional
side information.
Figure 3 (left) shows the dendrogram of cat-
egories in WikiPhil obtained from the affinity
matrix W learned by W2VPred. We see that the
two pairs Ethics-Social Philosophy and Cognition-
Epistemology are grouped together, and both pairs
also belong to the same cluster in the original
structure. We also see the grouping of Episte-
mologists, Moral Philosophers, History of Logic,
and Philosophers of Art. This was at first glance
surprising because they belong to four different
328
Figure 3: Left: Dendrogram for categories in WikiPhil
learned by W2VPred based on the affinity matrix W .
Right: Denoised Affinity matrix built from the learned
structure by W2VPred. Newly formed Cluster includes
History of Logic, Moral Philosophers, Epistemologists,
and Philosophers of Art.
clusters in the prior structure. However, looking
into the articles revealed that this is a logical con-
sequence from the fact that the articles in those
categories are almost exclusively about biogra-
phies of philosophers, and are therefore written in
a distinctive style compared to all other slices.
To confirm that the discovered structure cap-
tures the semantic sub-corpora structure, we
defined a new structure for WikiPhil, which is
shown in Figure 3 (right), based on our findings
above and also define a new structure for Wiki-
FoS: A minor characteristic that we found in the
structure of the prediction of W2VPred in com-
parison with the assumed structure is that the two
sub-corpora Humanities and Social Sciences and
the two sub-corpora Natural Sciences and Engi-
neering are a bit closer than other combinations of
sub-corpora, which also intuitively makes sense.
We connected the two sub-corpora by connect-
ing their root node respectively and then apply
W2VDen. The general analogy tests performance
by W2VDen is given in Table 3. In WikiFoS,
the improvement is only slightly significant for
n = 5 and n = 10 and not significant for n = 1.
This implies that the structure that we previously
assumed for WikiFoS already works well. This
shows that applying W2VDen is in fact a general
purpose method that can be applied on any of the
data sets but it is especially useful when there is
a mismatch between the assumed structure and
the structure predicted by W2VPred. In WikiPhil,
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
t
l
a
c
_
a
_
0
0
5
3
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
e
V
lo
G
.36
.41
.46
.44
.50
.34
.50
.38
.53
.55
.15
.26
–
G.
-
ip
k
S
.50
.54
.51
.56
.63
.48
.56
.37
.64
.64
.22
.32
2
NYT
W
O
B
C
.55
.52
.40
.56
.60
.47
.40
.31
.62
.64
.25
.33
3
V
2
W
D
.51
.58
.55
.57
.68
.50
.58
.39
.64
.66
.23
.30
7
C
V
2
W
.52
.59
.53
.56
.68
.49
.54
.40
.64
.67
.23
.30
5
P
V
2
W
.53
.59
.54
.57
.68
.51
.54
.40
.64
.66
.23
.30
7
WikiFos
W
O
B
C
.42
.59
.60
.63
.59
.57
.59
.43
.66
.64
.20
.29
–
G.
-
ip
k
S
.43
.60
.67
.65
.60
.59
.68
.41
.67
.62
.17
.28
3
C
V
2
W
.42
.62
.74
.62
.65
.55
.77
.44
.70
.68
.21
.29
8
e
V
lo
G
.38
.58
.60
.57
.66
.52
.71
.33
.63
.44
.17
.28
1
P
V
2
W
.42
.62
.72
.62
.65
.55
.75
.43
.70
.69
.21
.29
7
e
V
lo
G
.34
.49
.42
.53
.59
.51
.47
.31
.59
.44
.11
.23
–
WikiPhil
W
O
B
C
.38
.49
.38
.57
.48
.51
.54
.39
.61
.61
.17
.25
–
G.
-
ip
k
S
.43
.55
.50
.59
.56
.54
.62
.40
.64
.63
.19
.27
4
D
V
2
W
.38
.59
.63
.59
.62
.55
.66
.43
.67
.66
.19
.27
9
P
V
2
W
.34
.58
.55
.58
.60
.54
.58
.40
.64
.65
.18
.26
1
RW-STANFORD
MTurk-771
RG-65
WS-353-ALL
MTR-3k
WS-353-REL
MC-30
YP-13
WS-353-SIM
MTurk-287
SimVerb-350
SIMLEX-999
Count
Table 8: Correlation values from word similarity tests on different datasets (one per row). The best
method and the methods that are not significantly outperformed by the best is marked with gray
background, according to the Wilcoxon signed rank test for α = 0.05. In this table, we use a shorter
version of the method names (W2VC for W2VConstr, etc.)
GloVe
Skip-Gram
CBOW
Dynamic Word2Vecabbr
W2VConstr (our)
W2VDen (our)
W2VPred (our)
NYT WFos WPhil
0.27
0.29
0.26
0.29
0.30
0.28
0.29
0.31
0.29
—
—
0.28
—
0.32
0.29
0.30
—
—
0.29
0.32
0.28
Table 9: QVEC results: Correlation values of the
aligned dimension between word embeddings and
linguistic word vectors.
we see that W2VDen further improves the perfor-
mance by W2VPred, which already outperforms
all other methods with a large margin. The cor-
relation between the embedding quality and the
structure prediction performance—with the de-
noised estimated affinity matrix as the ground
truth—is shown in Table 7. The Pearson cor-
relation is still negative, −0.14, but no longer
statistically significant (p = 0.11).
4.6 Ex4: Evaluation in Word
Similarity Tasks
We further evaluate word embeddings on vari-
ous word similarity tasks where human-annotated
similarity between words is compared with the
cosine similarity in the embedding space, as
proposed in Faruqui and Dyer (2014). Table 8
shows the correlation coefficients between the
human-annotated similarity and the embedding
cosine similarity, where, again, the best method
and the runner-ups (if not significantly out-
performed) are highlighted.8 We observe that
W2VPred outperforms the other methods in 7 out
of 12 datasets for NYT, and W2VConstr in 8 out of
12 for WikiFoS. For WikiPhil, since we already
know that W2VConstr with the given affinity
matrix does not improve the embedding perfor-
mance, we instead evaluated W2VDen, which
outperforms 9 out of 12 datasets in WikiPhil. In
addition, W2VPred gives comparable perfor-
mance to the best method over all experiments.
We also apply QVEC, which measures
component-wise correlation between distributed
word embeddings, as we use them throughout
the paper, and linguistic word vectors based on
(Fellbaum, 1998). High correlation
WordNet
values indicate high saliency of linguistic prop-
erties and thus serve as an intrinsic evaluation
method that has been shown to highly correlate
with downstream task performance (Tsvetkov
et al., 2015). Results are shown in Table 9,
where we observe that W2VConstr (as well as
W2VDen for WikiPhil) outperforms all baseline
methods, except CBOW in NYT, on all datasets,
and W2VPred performs comparably with the
best method.
8We removed the dataset VERB-143 since we are using
lemmatized tokens and therefore catch only a very small part
of this corpus. We acknowledge that the human annotated
similarity is not domain-specific and therefore not optimal
for evaluating the domain-specific embeddings. However,
we expect that this experiment provides another aspect of the
embedding quality.
329
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
t
l
a
c
_
a
_
0
0
5
3
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
4.7 Summarizing Discussion
In this section, we have shown a good perfor-
mance of W2VConstr and W2VPred in terms
of global and domain-specific embedding qual-
ity on news articles (NYT) and articles from
Wikipedia (WikiFoS, WikiPhil). We have also
shown that W2VPred is able to extract the un-
derlying sub-corpora structure from NYT and
WikiFoS.
On the WikiPhil dataset, the following observa-
tions implied that the prior sub-corpora structure,
based on the Wikipedia’s definition, was not
suitable for analyzing semantic relations:
• Poor general analogy test performance by
W2VConstr (Table 3),
• Low structure prediction performance by all
methods (Table 6)
• Negative correlation between embedding
accuracy and structure score (Table 7).
Accordingly, we analyzed the learned structure
by W2VPred, and further refined it by denoising
with human intervention. Specifically, we ana-
lyzed the dendrogram from Figure 3, and found
that 4 categories are grouped together that we orig-
inally assumed to belong to 4 different clusters.
We further validated our reasoning by applying
W2VDen with the structure shown in Figure 3
resulting in the best embedding performance (see
Table 3).
This procedure poses an opportunity to obtain
good global and domain-specific embeddings and
extract, or validate if given a priori, the under-
lying sub-corpora structure by using W2VConstr
and W2VPred. Namely, we first train W2VPred,
and also W2VConstr if prior structure infor-
mation is available. If both methods similarly
improve the embeddings in comparison with the
methods without using any structure information,
we acknowledge that the prior structure is at
least useful for word embedding performance.
If W2VPred performs well, while W2VConstr
performs poorly, we doubt that the given prior
structure would be suitable, and update the learned
structure by W2VPred. When no prior strucuture
is given, we simply apply W2VPred to learn the
structure.
We can furthermore refine the learned structure
with side information, which results in a clean and
human interpretable structure. Here W2VDen is
used to validate the new structure, and to provide
enhanced word embeddings. In our experiment
on the WikiPhil dataset, the embeddings obtained
this way significantly outperformed all other meth-
ods. The improved performance from W2VPred
is probably due to the fewer degrees of freedom of
W2VConstr, that is, once we know a reasonable
structure, the embeddings can be more accurately
trained with the fixed affinity matrix.
5 Application on Digital Humanities
We propose an application of W2VPred to the
field of Digital Humanities, and develop an ex-
ample more specifically related to Computational
Literary Studies. In the renewal of literary studies
brought by the development and implementation
of computational methods, questions of author-
ship attribution and genre attribution are key to
formulating a structured critique of the classical
design of literary history, and of Cultural Heritage
approaches at large. In particular, the investigation
of historical person networks, knowledge distri-
bution, and intellectual circles has been shown
to benefit significantly from computational meth-
ods (Baillot, 2018; Moretti, 2005). Hence, our
method and its capability to reveal connections
between sub-corpora (such as authors’ works),
can be applied with success to these types of re-
search questions. Here, the use of quantitative and
statistical models can lead to new, hitherto unfath-
omed insights. A corpus-based statistical approach
to literature also entails a form of emancipation
from literary history in that it makes it possible to
shift perspectives, e.g., to reconsider established
author-based or genre-based approaches.
To this end, we applied W2VPred to high lit-
erature texts (Belletristik) from the lemmatized
versions of DTA (German Text Archive), a cor-
pus selection that contains the 20 most represented
authors of the DTA text collection for the period
1770-1900. We applied W2VPred in order to pre-
dict the connections between those authors with
λ = 512, τ = 1024 (same as WikiFoS).
As a measure of comparison, we extracted the
year of publication as established by DTA, and
identified the place of work for each author9 and
categorized each publication into one of three
genre categories (ego document, verse, and fic-
tion). Ego documents are texts written in the first
person that document personal experience in their
9via the German Integrated Authority Files Service
(GND) where available, adding missing data points manually.
330
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
t
l
a
c
_
a
_
0
0
5
3
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
them as longitude and latitude coordinates on
the earths surface. We compute the average
coordinates for each author by converting
the coordinates into the Cartesian system
and take the average on each dimension.
Then, we convert the averages back into
the latitude, longitude system. The spatial
distance between two authors is computed
by the geodesic distance as implemented in
GeoPy.10
3. Genre difference between authors. We man-
ually categorized each title in the corpus into
one of the three categories ego document,
verse, and fiction. A genre representation for
an author Ag = (Agego, Agverse, Agfiction) is the
relative frequency of the respective genre for
that author. The distance between one author
Ag1 and another author Ag2 is computed by
1 − Ag1·Ag2
||Ag1||·||Ag2|| , the cosine distance.
Calculating the Correlations For each author
t, we denote the predicted distance to all other
authors as Xt ∈ RT −1 where T is the number of
all authors. Yt ∈ R(T −1)×3 denotes the distances
from the author t to all other authors in the three
meta data dimensions: space, time, and genre. For
the visualization, we seek for the coefficients of
the linear combination of Y that has the high-
est correlation with X. For this, Non-Negative
Canonical Correlation Analysis with one compo-
nent is applied. The MIFSR algorithm is used as
described by Sigg et al. (2007).11 The coefficients
are normalized to comply with the sum-to-one
constraint for projection on the 2d simplex.
For many authors, the strongest correlation oc-
curs with a mostly temporal structure and fewer
correlate strongest with the spatial or the genre
model. B¨orne and Laukhard, who have a similar
spatial weight and thereby form a spatial cluster,
both resided in France at that time. The impact of
French literature and culture on Laukhard’s and
B¨orne’s writing deserves attention, as suggested
by our findings.
For Fontane, we do not observe a notable spa-
tial proportion, which is surprising because his
sub-corpus mostly consists of ego documents de-
scribing the history and geography of the area
surrounding Berlin, his workplace. However, in
10https://geopy.readthedocs.io/en
/stable/.
11We use ǫ = .00001.
Figure 4: Author’s points in a barycentric coordinates
triangle denote the mixture of the prior knowledge
that has the highest correlation (in parentheses) with
the predicted structure of W2VPred. The correlation
excludes the diagonal, meaning the correlation between
the author itself.
historical context. They include letters, diaries,
and memoirs and have gained momentum as a
primary source in historical research and literary
studies over the past decades. We created pair-
wise distance matrices for all authors based on the
spatial, temporal, and genre information. Tempo-
ral distance was defined as the absolute distance
between the average publication year, the spa-
tial distance as the geodesic distance between the
average coordinates of the work places for each
author and the genre difference as cosine distance
between the genre proportions for each author.
For each author, we correlated linear combina-
tions of this (normalized) spatio-temporal-genre
prior knowledge with the structure found by our
method, which we show in Figure 4.
Reference Dimensions
In this visualization we
want to compare the pairwise distance matrix that
our method predicted with the distance matrices
that can be obtained by meta data available in the
DTA corpus—the reference dimensions:
1. Temporal difference between authors. We
collect the publication year for each title in
the corpus and compute the average publi-
cation year for each author. The temporal
distance between one author At1 and another
author At2 is computed by |At1 −At2|, the ab-
solute difference of the average publication
year.
2. Spatial difference between authors. We query
the German Integrated Authority File for the
authors’ different work places and extract
331
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
t
l
a
c
_
a
_
0
0
5
3
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
contrast to the other authors residing in Berlin,
the style is much more similar to a travel story. In
W2VPred ’s predicted structure, the closest neigh-
bor of Fontane is, in fact, P¨uckler (with a distance
of .052), who also wrote travel stories.
In the case of Goethe, the maximum correla-
tion at the (solely spatio-temporal) resulting point
is relatively low and, interestingly, the highest
disagreement between W2VPred and the prior
knowledge is between Schiller and Goethe. The
spatio-temporal model represents a close prox-
imity; however, in W2VPred’s found structure,
the two authors are much more distant. In this
case, the spatio-temporal properties are not suffi-
cient to fully characterize an author’s writing and
the genre distribution may be skewed due to the
incomplete selection of works in the DTA and due
to the limitations of the labeling scheme, as in the
context of the 19th century, it is often difficult to
distinguish between ego documents and fiction.
Nonetheless, we want to stress the importance
of the analysis where linguistic representation and
structure, captured in W2VPred, is in line with
these properties and, also, where they disagree.
Both agreement and disagreement between the
prior knowledge and the linguistic representation
found by W2VPred can help identifying the appro-
priate ansatz for a literary analysis of an author.
6 Conclusion
We proposed novel methods to capture domain-
specific semantics, which is essential in many
NLP tasks: Word2Vec with Structure Constraint
(W2VConstr) trains domain-specific word embed-
dings based on prior information on the affinity
structure between sub-corpora; Word2Vec with
Structure Prediction (W2VPred) goes one step
further and predicts the structure while learn-
ing domain-specific embeddings simultaneously.
Both methods outperform baseline methods in
benchmark experiments with respect
to em-
bedding quality and the structure prediction
performance. Specifically, we showed that em-
beddings provided by our methods are superior in
terms of global and domain-specific analogy tests,
word similarity tasks, and the QVEC evaluation,
which is known to highly correlate with down-
stream performance. The predicted structure is
more accurate than the baseline methods including
Burrows’ Delta. We also proposed and success-
fully demonstrated a procedure, Word2Vec with
Denoised Structure Constraint (W2VDen), to cope
with the case where the prior structure information
is not suitable for enhancing embeddings, by us-
ing both W2VConstr and W2VPred. Overall, we
showed the benefits of our methods, regardless of
whether (reliable) structure information is given
or not. Finally, we were able to demonstrate how
to use W2VPred to gain insight into the relation
between 19th century authors from the German
Text Archive and also how to raise further research
questions for high literature.
Acknowledgments
We thank Gilles Blanchard for valuable com-
ments on the manuscript. We further thank Felix
Herron for his support in the data collection pro-
cess. DL and SN are supported by the German
Ministry for Education and Research (BMBF) as
BIFOLD - Berlin Institute for the Foundations
of Learning and Data under grants 01IS18025A
and 01IS18037A. SB was partially funded by the
Platform Intelligence in News project, which is
supported by Innovation Fund Denmark via the
Grand Solutions program and by the European
Union under the Grant Agreement no. 10106555,
FairER. Views and opinions expressed are those
of the author(s) only and do not necessarily re-
flect those of the European Union or European
Research Executive Agency (REA). Neither the
European Union nor REA can be held responsible
for them.
A Implementation Details
A.1 Ex1
All word embeddings were trained with d = 50.
GloVe We run GloVe experiments with α =
100 and minimum occurrence = 25.
Skip-Gram, CBOW We use the Gensim
( ˇReh˚uˇrek and Sojka, 2010) implementation of
Skip-Gram and CBOW with min alpha = 0.0001,
sample = 0.001 to reduce frequent words and
for Skip-Gram, we use 5 negative words and
ns component = 0.75.
Parameter Selection The parameters λ and τ
for DW2V, W2VConstr and W2VPred were se-
lected based on the performance in the analogy
tests on the train set. In order to flatten the
contributions from the n nearest neighbors (for
n = 1, 5, 10), we rescaled the accuracies: For
332
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
t
l
a
c
_
a
_
0
0
5
3
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
each n, accuracies are scaled so that the best and
the worst method is 1 and 0, respectively. Then,
we computed their average and maximum.
Analogies Each analogy consists of two word
pairs (e.g., countryA - capitalA; countryB - capi-
talB). We estimate the vector for the last word by
v = capitalA - countryA + countryB, and check
if capitalB is contained in the n nearest neighbors
v.
of the resulting vector
b
the preceding and the following years. For
WikiFoS and WikiPhil, we respectively chose
k = 3 and k = 2, which corresponds to the
number of subcategories that each main category
consists of.
W2VPred Hyperparameters
for W2VPred
were selected on the train set where we maxi-
mized the accuracy on the global analogy test as
before.
A.2 Ex2
b
Temporal Analogies Each of two word pairs
consists of a year and a corresponding term,
as for example, 2000 - Bush; 2008 - Obama,
and the inference accuracy of the last word by
vector operations on the former three tokens in
the embedded space is evaluated. To apply these
analogies, GloVe, Skip-Gram, and CBOW are
trained individually on each year on the same
vocabulary as W2VPred (same parameters for
GloVe as before, with minimum occurrence=10).
For the other methods, DW2V, W2VConstr, and
W2VPred, we can simply use the embedding ob-
tained in Section 4.3. Note that the parameters
τ and λ were optimized based on the general
analogy tests.
A.3 Ex3
Burrows
It compares normalized bag-of-words
features of documents and sub-corpora, and pro-
vides a distance measure between them. Its para-
meters specify which word frequencies are taken
into account. We found that considering the 100th
to the 300th most frequent words gives the best
structure prediction performance on the train set.
Recall@k Let ˆD ∈ RT ×T be the predicted
structure. We report on recall@k averaged over
all domains:
recall@k = 1
T
recall@kt =
TPt(k) =
FNt(k) =
b(x, i, k) =
T
t′
T
X
t′
X
(
T
t recall@kt,
TPt(k)
TPt(k) + FNt(k)
P
where
,
b(Dt, t′, k) & b( ˆDt, t′, k),
b(Dt, t′, k) & ¬b( ˆDt, t′, k), and
1 xi is one of the k smallest in x,
0 otherwise.
For NYT, we chose k = 2, which means rele-
vant nodes are the two next neighbors, that is,
333
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
t
l
a
c
_
a
_
0
0
5
3
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
References
Hosein
Azarbonyad, Mostafa
Dehghani,
Kaspar Beelen, Alexandra Arkut, Maarten
Marx, and Jaap Kamps. 2017. Words are
in
malleable: Computing semantic
political and media discourse. In Proceed-
ings of
the 2017 ACM on Conference on
Information and Knowledge Management,
pages 1509–1518. https://doi.org/10
.1145/3132847.3132878
shifts
Anne Baillot.
2018. Die Krux mit dem
Netz Verkn¨upfung und Visualisierung bei
digitalen Briefeditionen.
In Toni Bernhart,
Marcus Willand, Sandra Richter, and Andrea
Albrecht, editors, Quantitative Ans¨atze in den
Literatur- und Geisteswissenschaften. Sys-
tematische
historische Perspektiven,
pages 355–370. De Gruyter. https://doi
.org/10.1515/9783110523300-016
und
Robert Bamler and Stephan Mandt. 2017.
Dynamic word embeddings. arXiv preprint
arXiv:1702.08359.
Erik Bleich, Hasher Nisar, and Rana Abdelhamid.
2016. The effect of terrorist events on media
portrayals of Islam and Muslims: Evidence
from New York Times headlines, 1985–2013.
Ethnic and Racial Studies, 39(7):1109–1127.
https://doi.org/10.1080/01419870
.2015.1103886
John Burrows. 2002.
‘Delta’: A measure of
stylistic difference and a guide to likely au-
thorship. Literary and Linguistic Comput-
ing, 17(3):267–287. https://doi.org/10
.1093/llc/17.3.267
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional
transformers for lan-
guage understanding. In Proceedings of the
2019 Conference of the North American Chap-
ter of the Association for Computational Lin-
guistics: Human Language Technologies,
Volume
Short Papers),
pages 4171–4186, Minneapolis, Minnesota.
for Computational Linguis-
Association
tics. https://aclanthology.org/N19
-1423
(Long
and
1
Manaal Faruqui and Chris Dyer. 2014. Com-
munity evaluation and exchange of word
vectors at wordvectors.org. In Proceedings of
52nd Annual Meeting of the Association for
Computational Linguistics: System Demonstra-
tions, pages 19–24. https://doi.org/10
.3115/v1/P14-5004
Christiane Fellbaum. 1998. Wordnet: An elec-
tronic lexical database and some of its ap-
plications. https://doi.org/10.7551
/mitpress/7287.001.0001
Hila Gonen, Ganesh Jawahar, Djam´e Seddah,
and Yoav Goldberg. 2020. Simple,
inter-
pretable and stable method for detecting words
with usage change across corpora. In Pro-
ceedings of
the 58th Annual Meeting of
the Association for Computational Linguis-
tics, pages 538–555. https://doi.org
/10.18653/v1/2020.acl-main.51
Edouard Grave, Armand Joulin, and Quentin
Berthet. 2019. Unsupervised alignment of em-
beddings with Wasserstein Procrustes. In The
22nd International Conference on Artificial
Intelligence and Statistics, pages 1880–1890.
PMLR.
William L. Hamilton, Jure Leskovec, and Dan
Jurafsky. 2016. Diachronic word embeddings
reveal statistical laws of semantic change. In
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1489–1501,
Berlin, Germany. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P16-1141
Valentin Hofmann,
Janet B. Pierrehumbert,
and Hinrich Sch¨utze. 2020. Dynamic con-
textualized word embeddings. arXiv preprint
arXiv:2010.12684. https://doi.org/10
.18653/v1/2021.acl-long.542
Ganesh Jawahar and Djam´e Seddah. 2019. Con-
textualized diachronic word representations. In
334
Proceedings of the 1st International Workshop
on Computational Approaches to Historical
Language Change, pages 35–47. https://
doi.org/10.18653/v1/W19-4705
Armand Joulin, Edouard Grave, Piotr Bojanowski,
and Tomas Mikolov. 2016. Bag of
tricks
for efficient text classification. arXiv preprint
arXiv:1607.01759. https://doi.org/10
.18653/v1/E17-2068
Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan
Hegde, and Slav Petrov. 2014. Temporal analy-
sis of language through neural language models.
arXiv preprint arXiv:1405.3515. https://
doi.org/10.3115/v1/W14-2517
Diederik P. Kingma and Jimmy Ba. 2014. Adam:
A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.
Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi,
and Steven Skiena. 2015. Statistically signif-
icant detection of linguistic change. In Pro-
ceedings of the 24th International Conference
on World Wide Web, pages 625–635. Interna-
tional World Wide Web Conferences Steering
Committee. https://doi.org/10.1145
/2736277.2741627
Andrey Kutuzov, Lilja Øvrelid, Terrence
Szymanski, and Erik Velldal. 2018. Diachronic
word embeddings and semantic shifts: A sur-
vey. In Proceedings of the 27th International
Conference on Computational Linguistics,
pages 1384–1397.
Thomas Lansdall-Welfare, Saatviga Sudhahar,
James Thompson, Justin Lewis, FindMyPast
Newspaper Team, and Nello Cristianini. 2017.
Content analysis of 150 years of british period-
icals. Proceedings of the National Academy of
Sciences, 114(4):E457–E465. https://doi
.org/10.1073/pnas.1606380114,
PubMed: 28069962
Omer Levy and Yoav Goldberg. 2014. Neural
word embedding as implicit matrix factor-
ization. In Advances in neural
information
processing systems, pages 2177–2185.
Jani Marjanen, Lidia Pivovarova, Elaine Zosa,
and Jussi Kurunm¨aki. 2019. Clustering ideo-
logical terms in historical newspaper data with
diachronic word embeddings. In 5th Interna-
tional Workshop on Computational History,
HistoInformatics 2019. CEUR-WS.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
t
l
a
c
_
a
_
0
0
5
3
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
word embeddings. In Proceedings of the 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference
on Natural Language
Processing (EMNLP-IJCNLP), pages 66–76.
https://doi.org/10.18653/v1/D19
-1007
C. Sigg, B. Fischer, B. Ommer, V. Roth, and
J. Buhmann. 2007. Non-negative CCA for
audio-visual source separation. In Proceedings
of the IEEE Workshop on Machine Learning for
Signal Processing. https://doi.org/10
.1109/MLSP.2007.4414315
Nina Tahmasebi, Lars Borin, and Adam Jatowt.
2018. Survey of computational approaches
to lexical semantic change. arXiv preprint
arXiv:1811.06278.
Yulia Tsvetkov, Manaal Faruqui, Wang Ling,
Guillaume Lample, and Chris Dyer. 2015.
Evaluation of word vector representations by
subspace alignment. In Proceedings of the 2015
Conference on Empirical Methods in Natu-
ral Language Processing, pages 2049–2054.
https://doi.org/10.18653/v1/D15
-1243
Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao,
and Hui Xiong. 2018. Dynamic word em-
beddings for evolving semantic discovery. In
Proceedings of the Eleventh ACM International
Conference on Web Search and Data Mining,
pages 673–681. ACM. https://doi.org
/10.1145/3159652.3159703, PubMed:
30409101
Ziqian Zeng, Yichun Yin, Yangqiu Song, and
Ming Zhang. 2017. Socialized word embed-
dings. In IJCAI, pages 3915–3921. https://
doi.org/10.24963/ijcai.2017/547
Yating Zhang, Adam Jatowt, Sourav S.
Bhowmick, and Katsumi Tanaka. 2016. The
past is not a foreign country: Detecting se-
mantically similar terms across time. IEEE
Transactions on Knowledge and Data Engi-
neering, 28(10):2793–2807. https://doi
.org/10.1109/TKDE.2016.2591008
Tomas Mikolov, Quoc V. Le, and Ilya Sutskever.
2013a. Exploiting similarities among languages
for machine translation. CoRR, abs/1309.4168.
Tomas Mikolov,
Ilya Sutskever, Kai Chen,
Greg S. Corrado, and Jeff Dean. 2013b.
Distributed representations of words and
phrases and their compositionality. In Advances
in Neural Information Processing Systems,
pages 3111–3119.
Franco Moretti. 2005. Graphs, Maps, Trees:
Abstract Models for a Literary History. Verso.
Jeffrey
Socher,
Pennington, Richard
and
Christopher D. Manning. 2014. GloVe: Global
vectors for word representation. In Proceedings
of the 2014 conference on empirical meth-
ods in natural language processing (EMNLP),
pages 1532–1543. https://doi.org/10
.3115/v1/D14-1162
Stephen D. Reese and Seth C. Lewis. 2009.
Framing the war on terror: The internal-
ization of policy in the US press. Jour-
nalism, 10(6):777–797. https://doi.org
/10.1177/1464884909344480
Radim ˇReh˚uˇrek and Petr Sojka. 2010. Soft-
ware framework for
topic modelling with
large corpora. In Proceedings of the LREC
2010 Workshop on New Challenges for NLP
Frameworks, pages 45–50, Valletta, Malta.
ELRA.
Maja Rudolph and David Blei. 2018. Dynamic
embeddings for language evolution. In Pro-
ceedings of the 2018 World Wide Web Confer-
ence on World Wide Web, pages 1003–1011.
International World Wide Web Conferences
Steering Committee. https://doi.org
/10.1145/3178876.3185999
Maja Rudolph, Francisco Ruiz, Stephan Mandt,
and David Blei. 2016. Exponential family em-
beddings. In Advances in Neural Information
Processing Systems, pages 478–486.
Philippa Shoemark, Farhana Ferdousi Liza, Dong
Nguyen, Scott Hale, and Barbara McGillivray.
2019. Room to glo: A systematic comparison
of semantic change detection approaches with
335
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
8
2
0
7
5
9
4
6
/
/
t
l
a
c
_
a
_
0
0
5
3
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3