A Computational Framework for Slang Generation
Zhewei Sun1, Richard Zemel1,2, Yang Xu1,2
1Department of Computer Science, University of Toronto, Toronto, Canada
2Vector Institute for Artificial Intelligence, Toronto, Canada
{zheweisun, zemel, yangxu}@cs.toronto.edu
Abstrait
lan-
Slang is a common type of informal
guage, but its flexible nature and paucity of
data resources present challenges for existing
natural language systems. We take an ini-
tial step toward machine generation of slang
by developing a framework that models the
speaker’s word choice in slang context. Notre
framework encodes novel slang meaning by
relating the conventional and slang senses
of a word while incorporating syntactic and
contextual knowledge in slang usage. We con-
struct the framework using a combination of
probabilistic inference and neural contrastive
learning. We perform rigorous evaluations on
three slang dictionaries and show that our
approach not only outperforms state-of-the-art
language models, but also better predicts the
historical emergence of slang word usages
from 1960s to 2000s. We interpret the pro-
posed models and find that the contrastively
learned semantic space is sensitive to the simi-
larities between slang and conventional senses
of words. Our work creates opportunities for
the automated generation and interpretation of
informal language.
1 Introduction
Slang is a common type of informal language that
appears frequently in daily conversations, sociale
media, and mobile platforms. The flexible and
ephemeral nature of slang (Eble, 1989; Landau,
1984) poses a fundamental challenge for computa-
tional representation of slang in natural language
systèmes. As of today, slang constitutes only a
small portion of text corpora used in the natural
language processing (NLP) community, and it
is severely under-represented in standard lexical
ressources (Michel et al., 2011). Here we propose
a novel framework for automated generation of
slang with a focus on generative modeling of
slang word meaning and choice.
Existing language models trained on large-scale
text corpora have shown success in a variety of
NLP tasks. Cependant, they are typically biased
toward formal
language and under-represent
slang. Consider the sentence ‘‘I have a feeling
himself someday’’. Directly apply-
he’s gonna
ing a state-of-the-art GPT-2 (Radford et al., 2019)
based language infilling model (par exemple., Donahue
et coll., 2020) would result in the retrieval of kill
as the most probable word choice (probability =
7.7%). Cependant, such a language model is lim-
ited and near-insensitive to slang usage, for exam-
ple, ice—a common slang alternative for kill—
received virtually 0 probability, suggesting that
existing models of distributional semantics, même
the transformer-type models, do not capture slang
effectively, if at all.
Our goal is to extend the capacity of NLP
systems toward slang in a principled framework.
As an initial step, we focus on modeling the gen-
erative process of slang, specifically the problem
of slang word choice that we illustrate in Figure 1.
Given an intended slang sense such as ‘‘to kill’’,
we ask how we can emulate the speaker’s choice
of slang word(s) in informal context.1 We are
particularly interested in how the speaker chooses
existing words from the lexicon and makes inno-
vative use of those words in novel slang context
(such as the use of ice in Figure 1).
Our basic premise is that sensible slang word
choice depends on linking conventional or estab-
lished senses of a word (such as ‘‘frozen water’’
for ice) to its emergent slang senses (such as ‘‘to
kill’’ for ice). Par exemple, the extended use of
ice to express killing could have emerged from
the use of cold ice to refrigerate one’s remains. UN
principled semantic representation should adapt to
such similarity relations. Our proposed framework
is aimed at encoding slang that relates informal
utiliser
le
terms meaning
et
sense
1We will
interchangeably.
462
Transactions of the Association for Computational Linguistics, vol. 9, pp. 462–478, 2021. https://doi.org/10.1162/tacl a 00378
Action Editor: Bo Pang. Submission batch: 10/2020; Revision batch: 1/2021; Published 4/2021.
c(cid:13) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
je
un
c
_
un
_
0
0
3
7
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
for expressing a query slang meaning given its
contexte, 2) an encoder based on contrastive learn-
ing that captures slang meaning in a modified
embedding space, et 3) a prior that incorporates
different forms of context. Spécifiquement, the slang
encoder we propose transforms slang and con-
ventional senses of a word into a slang-sensitive
embedding space where they will lie in close
proximity. En tant que tel, senses like ‘‘frozen water’’,
‘‘unfriendliness’’ and ‘‘to kill’’ will be encour-
aged to be in close proximity in the learned embed-
ding space. En outre, the resulting embedding
space will also set apart slang senses of a word
from senses of other unrelated words, and hence
contrasting within-word senses from across-word
senses in the lexicon. A practical advantage of
this encoding method is that semantic similarities
pertinent
to slang can be extracted automati-
cally from a small amount of training data, et
the learned semantic space will be sensitive to
slang.
Our framework also captures the flexible nature
of slang usages in natural context. Ici, we focus
on syntax and linguistic context, although our
framework should allow for the incorporation
of social or extra-linguistic features as well.
Recent work has found that the flexibility of
slang is reflected prominently in syntactic shift
(Pei et al., 2019). Par exemple, ice—most com-
monly used as a noun—is used as a verb to
express ‘‘to kill’’ (in Figure 1). We build on
these findings by incorporating syntactic shift
as a prior in the probabilistic model, which is
integrated coherently with the contrastive neural
encoder that captures flexibility in slang sense
extension. We also show how a contextualized
language infilling model can provide addi-
tional prior information from linguistic context
(c.f. Erk, 2016).
To preview our results, we show that our
framework yields a substantial improvement on
the accuracy of slang generation against state-of-
the-art embedding methods including deep con-
textualized models, in both few-shot and zero-shot
settings. We evaluate our framework rigorously
on three datasets constructed from slang dictio-
naries and in a historical prediction task. We show
evidence that the learned slang embedding space
yields intuitive interpretation of slang and offers
future opportunities for informal natural language
traitement.
Chiffre 1: A slang generation framework that
models speaker’s choice of a slang term (ice)
based on the novel sense (‘‘to kill’’) in context and
relations with conventional senses (par exemple., ‘‘frozen
water’’).
and conventional word senses, hence capturing
semantic similarities beyond those from existing
language models. En particulier, contextualized
embedding models such as BERT would consider
‘‘frozen water’’ to be semantically distant or
irrelevant from ‘‘to kill’’, so they cannot predict
ice to be appropriate for expressing ‘‘to kill’’ in
slang context.
The capacity for generating novel slang word
usages will have several implications and appli-
cations. From a scientific view, modeling the
generative process of slang word choice will help
explain the emergence of novel slang usages over
time—we show how our framework can predict
the emergence of slang in the history of English.
From a practical perspective, automated slang
generation paves the way for automated slang
interpretation. Existing psycho-linguistic work
suggests that language generation and compre-
hension rely on similar cognitive processes (par exemple.,
Pickering and Garrod, 2013; Ferreira Pinto Jr. et
Xu, 2021). De la même manière, a generative model of slang
can be an integral component of slang comprehen-
sion that informs the relation between a candidate
sense and a query word, where the mapping can
be unseen during training. En outre, a gener-
ative approach to slang may also be applied to
downstream tasks such as naturalistic chatbots,
sentiment analysis, and sarcasm detection (voir
work by Aly and van der Haar [2020] and Wilson
et coll., [2020]).
We propose a neural-probabilistic framework
that involves three components: 1) a probabilis-
tic choice model that infers an appropriate word
463
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
je
un
c
_
un
_
0
0
3
7
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
2 Related Work
NLP for Non-Literal Language. Machine
processing of non-literal
language has been
explored in different linguistic phenomena includ-
ing metaphor (Shutova et al., 2013b; Veale et al.,
2016; Gao et al., 2018; Dankers et al., 2019),
metonymy (Lapata and Lascarides, 2003; Nissim
and Markert, 2003; Shutova et al., 2013un), irony
(Filatova, 2012), neologism (Cook, 2010), idiom
(Fazly et al., 2009; Liu and Hwa, 2018), vulgarity
(Holgate et al., 2018), and euphemism (Magu
and Luo, 2018). Non-literal usages are present in
slang, but these existing studies do not directly
model the semantics of slang. En outre, travail
in this area has typically focused on detection and
comprehension. In contrast, generation of novel
informal language use has been sparsely tackled.
Computational Studies of Slang. Slang has
been extensively studied as a social phenomenon
(Mattiello, 2005), where social variables such as
genre (Blodgett et al., 2016), ethnicity (Bamman
et coll., 2014), and social-economic status (Labov,
1972, 2006) have been shown to play important
roles in slang construction. Plus récemment, an anal-
ysis of social media text has shown that linguistic
features also correlate with the survival of slang
termes, where linguistically appropriate terms have
a higher likelihood of being popularized (Stewart
and Eisenstein, 2018).
Recent work in the NLP community has also
analyzed slang. Ni and Wang (2017) studied slang
comprehension as a translation task. In their study,
both the spelling of a word and its context are
provided as input to a translation model to decode
a definition sentence. Pei et al. (2019) proposed
end-to-end neural models to detect and identify
slang automatically in natural sentences.
Kulkarni and Wang (2018) have proposed
computational models that derive novel word
forms of slang from spellings of existing words.
Ici, we instead explore the generation of novel
slang usage from existing words and focus on
word sense extension toward slang context, based
on the premise that new slang senses are often
derived from existing conventional word senses.
Our work is inspired by previous research
suggesting that slang relies on reusing words in
the existing lexicon (Eble, 2012). Previous work
has applied cognitive models of categorization to
predict novel slang usage (Sun et al., 2019). Dans
that work, the generative model is motivated by
research on word sense extension (Ramiro et al.,
2018). En particulier, slang generation is opera-
tionalized by categorizing slang senses based on
their similarities to dictionary definitions of can-
didate words, enhanced by collaborative filtering
(Goldberg et al., 1992) to capture the fact that
words with similar senses are likely to extend to
similar novel senses (Lehrer, 1985). Cependant, ce
approach presupposes that slang senses are similar
to conventional senses of a word represented in
standard embedding space, an assumption that is
not warranted and yet to be addressed.
Our work goes beyond the existing work in
three important aspects: 1) We capture seman-
tic flexibility of slang usage by contributing a
novel method based on contrastive learning. Notre
method encodes slang meaning and conventional
meaning of a word under a common embedding
espace, thereby improving the inadequate existing
methodology for slang generation that uses com-
mon, slang-insensitive embeddings. 2) We cap-
ture syntactic and contextual flexibility of slang
usage in a coherent probabilistic framework, un
aspect that was ignored in previous work. 3) Nous
rigorously test our framework against slang sense
definition entries from three large slang dictionar-
ies and contribute a new dataset for slang research
to the community.
Contrastive Learning. Contrastive learning is a
semi-supervised learning technique used to extract
semantic representations in data-scarce situations.
It can be incorporated into neural networks in the
form of twin networks, where two exact copies of
an encoder network are applied to two different
examples. The encoded representations are then
compared and back-propagated. Alternative loss
schemes such as Triplet (Weinberger and Saul,
2009; Wang et al., 2014) and Quadruplet loss (Loi
et coll., 2013) have also been proposed to enhance
stability in training. In NLP, contrastive learning
has been applied to learn similarities between text
(Mueller and Thyagarajan, 2016; Neculoiu et al.,
2016) and speech utterances (Kamper et al., 2016)
with recurrent neural networks.
The contrastive learning method we develop
has two main differences: 1) We do not use recur-
rent encoders because they perform poorly on
dictionary-based definitions; 2) We propose a joint
neural-probabilistic framework on the learned
embedding space instead of resorting to methods
such as nearest-neighbor search for generation.
464
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
je
un
c
_
un
_
0
0
3
7
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
3 Computational Framework
Our computational framework for slang genera-
tion comprises three interrelated components: 1) UN
probabilistic formulation of word choice, extend-
ing that in Sun et al. (2019) to leverage encap-
sulated slang senses from a modified embedding
espace; 2) A contrastive encoder—inspired by vari-
ants of twin network (Baldi and Chauvin, 1993;
Bromley et al., 1994)—that constructs a modified
embedding space for slang by adapting the con-
ventional embeddings to incorporate new senses
of slang words; 3) A contextually informed prior
for capturing flexible uses of naturalistic slang.
3.1 Probabilistic Slang Choice Model
Given a query slang sense MS and its context
CS, we cast the problem of slang generation
as inference over candidate words w in our
vocabulary. Assuming all candidate words w are
drawn from a fixed vocabulary V , the posterior is
as follows:
P. (w|MS , CS) ∝ P (MS|w, CS)P. (w|CS)
∝ P (MS|w)P. (w|CS)
(1)
Ici, we define the prior P (w|CS) based on regu-
larities of syntax and/or linguistic context in slang
usage (described in Section 3.4). We formulate the
likelihood P (MS|w)2 by specifying the relations
between conventional senses of word w (denoted
by Mw = {Mw1, Mw2, · · · , Mwm}, c'est à dire., the set
of senses drawn from a standard dictionary) et
the query MS (c'est à dire., slang sense that is outside the
standard dictionary). Spécifiquement, we model the
likelihood by a similarity function that measures
the proximity between the slang sense MS and
the set of conventional senses Mw of word w in
a continuous, embedded semantic space:
P. (MS|w) = P (MS|Mw)
∝ f ({sim(ES, Ewi); Ewi ∈ Ew})
(2)
Ici, F (·) is a similarity function in range [0, 1],
while ES and Ew represent semantic embeddings
of the slang sense MS and the set of conventional
senses Mw. We derive these embeddings from
contrastive learning which we describe in detail in
Section 3.2, and we compare this proposed method
2Ici, we only consider linguistically motivated context
as CS and assume the semantic shift patterns of slang are
universal across all such contexts.
with baseline methods that draw embeddings from
existing sentence embedding models.
Our choice of the similarity function f (·) est
motivated by prior work on few-shot classi-
fication. Spécifiquement, we consider variants of
two established methods: One Nearest Neighbor
(1NN) matching (Koch et al., 2015; Vinyals et al.,
2016) and Prototypical
learning (Snell et al.,
2017).
The 1NN model postulates that a candidate
word should be chosen according to the similarity
between the query slang sense and the closest
conventional sense:
f1nn(ES, Ew) = max
Ewi ∈Ew
sim(ES, Ewi)
(3)
In contrast, the prototype model postulates that a
candidate word should be chosen if its aggregate
(or average) sense is in close proximity of the
query slang sense:
fprototype(ES, Ew) = sim(ES, Eprototype
w
)
Eprototype
w
=
1
|Ew| X
Ewi ∈Ew
Ewi
(4)
In both cases, the similarity between two senses
is defined by the exponentiated negative squared
Euclidean distance in semantic embedding space:
sim(ES, Ew) = exp(−
||ES − Ew||2
2
hs
)
(5)
Ici, hs is a learned kernel width parameter.
We also consider an enhanced version of the
posterior using collaborative filtering (Goldberg
et coll., 1992), where words with similar meaning are
predicted to shift to similar novel slang meanings.
We operationalize this by summing over the close
neighborhood of candidate word L(w):
P. (w|MS , CS) =
X
w′∈L(w)
P. (w|w′)P. (w′|MS , CS)
(6)
Ici, P. (w′|MS, CS) is a fixed term calculated
identically as in Equation (1) and P (w|w′) is the
weighting of words in the close neighborhood of
a candidate word w. This weighting probability
is set proportional to the exponentiated negative
cosine distance between w and its neighbor w′
defined in pre-trained word embedding space, et
the kernel parameter hcf is also estimated from
the training data:
P. (w|w′) ∝ sim(w, w′) = exp(−
d(w, w′)
hcf
) (7)
465
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
je
un
c
_
un
_
0
0
3
7
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Ici, d(w, w′) is the cosine distance between two
words in a word embedding space.
to be closer than that of the negative pair with a
margin m:
3.2 Contrastive Semantic Encoding (CSE)
We develop a contrastive semantic encoder for
constructing a new embedding space represent-
ing slang and conventional word senses that do
not bear surface similarities. Par exemple, le
conventional sense of kick such as ‘‘propel with
foot’’ can hardly be related to the slang sense of
kick such as ‘‘a strong flavor’’. The contrastive
embedding space we construct seeks to rede-
fine or warp similarities, such that the otherwise
unrelated senses will be in closer proximity than
they are under existing embedding methods. Pour
example, two metaphorically related senses can
bear strong similarity in slang usage, even though
they may be far apart in a literal sense.
We sample triplets of word senses as input to
contrastive learning, following work on twin net-
travaux (Baldi and Chauvin, 1993; Bromley et al.,
1994; Chopra et al., 2005; Koch et al., 2015).
We use dictionary definitions of conventional and
slang senses to obtain the initial sense embeddings
(See Section 4.4 for details). Each triplet consists
de 1) an anchor slang sense MS, 2) a positive
conventional sense MP , et 3) a negative con-
ventional sense MN . The positive sense should
ideally be encouraged to lie closely to the anchor
slang sense (in the resulting embedding space),
whereas the negative sense should ideally be
further away from both the positive conventional
and anchor slang senses. Section 3.3 describes the
detailed sampling procedures.
Our triplet network uses a single neural encoder
g to project each word sense representation into a
joint embedding space in Rd.
ES = g(MS); EP = g(MP ); EN = g(MN )
Ltriplet = hm + kES − EP k2
2 − kES − EN k2
2i+
(9)
3.3 Triplet Sampling
To train the triplet network, we build data triplets
from every slang lexical entry in our training set.
For each slang sense MS of word w, we create a
positive pair with each conventional sense Mwi of
the same word w. Then for each positive pair, nous
sample a negative example every training epoch by
randomly selecting a conventional sense Mw′ from
a word w′ that is sufficiently different from w, tel
that the corresponding definition sentence Dw′ has
less than 20% overlap in the set of content words
compared to MS and any conventional definition
sentence Dwi of word w. We rank all candidate
words in our vocabulary against w by computing
cosine distances from pre-trained word embed-
dings and consider a word w′ to be sufficiently
different if it is not in the top 20 pour cent.
Neighborhood Sampling (NS). In addition to
using conventional senses of the matching word
w for constructing positive pairs, we also sample
positive senses from a small neighborhood L(w)
of similar words. This sampling strategy pro-
vides linguistic knowledge from parallel semantic
change to encourage neighborhood structure in
the learned embedding space. Sampling from
neighboring words also augments the size of the
training data considerably in this data-scarce task.
We sample negative senses in a similar way,
except that we also consider all conventional def-
inition sentences from neighboring words when
checking for overlapping senses.
(8)
3.4 Contextual Prior
We choose a 1-layer fully connected network with
ReLU(Nair and Hinton, 2010) as the encoder g
for pre-trained word vectors (par exemple., fastText). Pour
contextualized embedding models we consider, g
will be a Transformer encoder (Vaswani et al.,
2017). In both cases, we apply the same encoder
network g to each of the three inputs. We train the
triplet network using the max-margin triplet loss
(Weinberger and Saul, 2009), where the squared
distance between the positive pair is constrained
The final component of our framework is the prior
P. (w|CS) (see Equation (1)) that captures flexible
use of slang words with regard to syntax and dis-
tributional semantics. Par exemple, slang exhibits
flexible Part-of-Speech (POS) shift, Par exemple,
noun→verb transition as in the example ice, et
surprisals in linguistic context, Par exemple, ice in
‘‘I have a feeling he’s gonna [blank] himself some-
day.’’ Here, we formulate the context CS in two
forms: 1) a syntactic-shift prior, namely, the POS
466
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
je
un
c
_
un
_
0
0
3
7
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
information PS to capture syntactic regularities in
slang, and/or 2) a linguistic context prior, namely,
the linguistic context KS to capture distributional
semantic context when this is available in the data.
Syntactic-Shift Prior (SSP). Given a query
POS tag PS, we construct the syntactic prior by
comparing POS distribution Pw from literal natu-
ral usage of a candidate word w with a smoothed
POS distribution PS centered at PS. Cependant, nous
cannot directly compare PS to Pw because slang
usage often involves shifting POS (Eble, 2012;
Pei et al., 2019). To account for this, we apply a
transformation T by counting the number of POS
transitions for each slang-conventional definition
pair in the training data. Each column of the
transformation matrix T is then normalized, donc
column i of T can be interpreted as the expected
slang-informed POS distribution given the i’th
POS tag in conventional context (par exemple., the noun
column gives the expected slang POS distribution
of a word that is used exclusively as a noun
in conventional usage). The slang-contextualized
POS distribution P ∗
S can then be computed by
applying T on PS: P ∗
S = T ×PS. The prior can be
estimated by comparing the POS distributions Pw
and P ∗
S via Kullback-Leibler (KL) divergence:
P. (w|CS) = P (w|PS) ∝ exp (cid:16)−KL(Pw, P ∗
1
2
S)(cid:17)
(10)
Intuitively, this prior captures the regularities of
syntactic shift in slang usage, and it favors candi-
date words with POS characteristics that fits well
with the queried POS tag in a slang context.
Linguistic Context Prior (LCP). We use a
language model PLM to a given linguistic context
KS to estimate the probability of each candidate
word:
P. (w|CS) = P (w|KS) ∝ PLM (w|KS)+un (11)
Ici, α is a Laplace smoothing constant. We use
the GPT-2 based language infilling model from
Donahue et al. (2020) as PLM and discuss the
implementation in Section 4.3.
4 Experimental Setup
4.1 Lexical Resources
We collected lexical entries of slang and conven-
tional words/phrases from three separate online
467
dictionaries:3 1) Online Slang Dictionary (OSD),4
2) Green’s Dictionary of Slang (GDoS) (Vert,
2010),5 et 3) an open source subset of Urban
Dictionary (UD) data from Kaggle.6 In addition,
we gathered dictionary definitions of conventional
senses of words from the online version of Oxford
Dictionary (OD).7
Slang Dictionary. Both slang dictionaries
(OSD and GDoS) are freely accessible online and
contain slang definitions with meta-data such as
Part-of-Speech tags. Each data entry contains the
word, its slang definition, and its part-of-speech
(POS) tag. En particulier, OSD includes example
sentence(s) for each slang entry which we leverage
as linguistic context, and GDoS contains time-
tagged references that allow us to perform his-
torical prediction (described later). We removed
all acronyms (c'est à dire., fully capitalized words) as they
generally do not extend meaning, and slang def-
initions that share more than 50% content words
with any of their corresponding conventional def-
initions to account for conventionalized slang. Nous
also removed slang with novel word forms where
no conventional sense definitions are available.
Slang phrases were treated as unigrams because
our task only concerns the association between
senses and lexical items. Each sense definition was
considered a data point during both learning and
prediction. We later partitioned definition entries
from each dataset to be used for training, vali-
dation, and testing. Note that a word may appear
in both training and testing but the pairing bet-
ween word senses are unique (See Section 5.3 pour
discussion).
Conventional Word Senses. We focused on
the subset of OD containing word forms that are
also available in the slang datasets described. Pour
each word entry, we removed all definitions that
have been tagged as informal because these are
likely to represent slang senses. This results in
10,091 et 29,640 conventional sense definitions
corresponding to the OSD and GDoS datasets,
respectivement.
3We obtained written permissions from all authors for the
datasets that we use for this work.
4OSD: http://onlineslangdictionary.com.
5GDoS: https://greensdictofslang.com.
6UD: https://www.kaggle.com/therohk/urban
-dictionary-words-dataset.
7OD: https://en.oxforddictionaries.com.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
je
un
c
_
un
_
0
0
3
7
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Data Split. We used all definition entries from
the slang resources such that the corresponding
slang word also exists in the collected OD subset.
The resulting datasets (OSD and GDS) had 2,979
et 29,300 definition entries, respectivement, depuis
1,635 et 6,540 unique slang words, of which
1,253 are shared across both dictionaries. Pour
each dataset, the slang definition entries were
partitioned into a 90% training set and a 10% test
ensemble. Five percent of the data in the training set
were set aside for validation when training the
contrastive encoder.
Urban Dictionary.
In addition to the two
datasets described above, we provide a third
dataset based on Urban Dictionary (UD) that are
made available via Kaggle. Unlike the previous
two datasets, we are able to make this one pub-
licly available without requiring one to obtain
prior permission from the data owners.8 To guard
against the crowd-sourced and noisy nature of UD,
we ensure quality by keeping definition entries
such that 1) it has at least 10 more upvotes than
downvotes, 2) the word entry exists in one of OSD
or GDoS, et 3) at least one of the corresponding
definition sentences in these dictionaries have a
20% or greater overlap in the set of content words
with the UD definition sentence. We also remove
entries with more than 50% overlap in content
words with any other UD slang definitions under
the same word to remove duplicated senses. Ce
results in 2,631 definitions entries from 1,464
unique slang words. The corresponding OD sub-
set has 10,357 conventional sense entries. We find
entries from UD to be more stylistically variable
and lengthier, with a mean entry length of 9.73 dans
comparison to 7.54 et 6.48 for OSD and GDoS,
respectivement.
4.2 Part-of-Speech Data
The natural POS distribution Pw for each candi-
date word w is obtained using POS counts from
the most recent available decade of the HistWords
project (Hamilton et al., 2016). For word entries
that are not available, mostly phrases, we estimate
Pw by counting POS tags from Oxford Dictionary
(OD) entries of w.
When estimating the slang POS transformation
for the syntactic prior, we mapped all POS tags
into one of the following six categories: {verb,
8Code and data available at: https://github.com
other, adv, noun, interj, adj} for the OSD experi-
ments. For GDS, the tag ‘interj’ was excluded as
it is not present in the dataset.
4.3 Contextualized Language Model Baseline
We considered a state-of-the-art GPT-2 based
language infilling model from Donahue et al.
(2020) as both a baseline model and a prior to
our framework (on the OSD data where context
sentences are available for the slang entries). Pour
each entry, we blanked out the corresponding
slang word in the example sentence, effectively
treating our task as a cloze task. We applied
the infilling model to obtain probability scores
for each of the candidate words and apply a
Laplace smoothing of 0.001. We fine-tuned the
LM infilling model using all example sentences
in the OSD training set until convergence. Nous
also experiment with a combined prior where
the two priors are combined using element-wise
multiplication and re-normalization.
4.4 Baseline Embedding Methods
To compare with and compute the baseline
embedding methods M for definition sentences,
we used 300-dimensional fastText embeddings
(Bojanowski et al., 2017) pre-trained with sub-
word information on 600 billion tokens from
Common Crawl9 as well as 768-dimensional
Sentence-Bert (SBERT) (Reimers and Gurevych,
2019) encoders pretrained on Wikipedia and fine-
tuned on NLI datasets (Bowman et al., 2015;
Williams et al., 2018). The fastText embed-
dings were also used to compute cosine distances
d(w, w′) in Equation (7). Embeddings for phrases
and the fastText-based sentence embeddings were
both computed by applying average pooling to
normalized word-level embeddings of all content
words. In the case of SBERT, we fed in the
original definition sentence.
4.5 Training Procedures
We trained the triplet networks for a maximum
de 20 epochs using Adam (Kingma and Ba, 2015)
with a learning rate of 10−4 for fastText and 2−5
for SBERT based models. We preserved dimen-
sions of the input sense vectors for the contrastive
embeddings learned by the triplet network (que
est, 300 for fastText and 768 for SBERT). Nous avons utilisé
1,000 fully-connected units in the contrastive
/zhewei-sun/slanggen.
9http://commoncrawl.org.
468
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
je
un
c
_
un
_
0
0
3
7
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
encoder’s hidden layer for fastText based models.
Triplet margins of 0.1 et 1.0 were used with
fastText and SBERT embeddings respectively.
We trained the probabilistic classification
framework by minimizing negative log likeli-
hood of the posterior P (w∗|MS, CS) on the
ground-truth words for all definition entries in the
training set. We jointly optimized kernel width
parameters using L-BFGS-B (Byrd et al., 1995).
To construct a word w’s neighborhood L(w) dans
both collaborative filtering and triplet sampling,
we considered the 5 closest words in cosine
distances of their fastText embeddings.
5 Results
5.1 Model Evaluation
We first evaluated our models quantitatively by
predicting slang word choices: Given a novel
slang sense (a definition taken from a slang
dictionary) and its part-of-speech, how likely
is the model to predict the ground-truth slang
recorded in the dictionary? To assess model per-
formance, we allowed each model to make up to
|V | ranked predictions where V is the vocabulary
of the dataset being evaluated, and we used stan-
dard Area-Under-Curve (AUC) percentage from
Receiver-Operator Characteristic (ROC) curves
to assess overall performance.
We show the ROC curves for the OSD eval-
uation in Figure 2 as an illustration. The AUC
metric is similar to and a continuous extension
to an F1 score by comprehensively sweeping
through the number of candidate words a model
is allowed to predict. We find this metric to be the
most appropriate because multiple words may be
appropriate to express a probe slang sense.
To examine the effectiveness of the contrastive
embedding method, we varied the semantic rep-
resentation as input to the models by considering
both fastText and SBERT (described in Sec 4.4).
For both embeddings, we experimented with the
baseline variant without the contrastive encod-
ing (par exemple., vanilla embeddings from fastText and
SBERT). We then augmented the models incre-
mentally with the contrastive encoder and the
priors whenever applicable to examine their
respective and joint effects on model performance
in slang word choice prediction. We observed that,
under both datasets, models leveraging the con-
trastively learned sense embeddings more reliably
predict the ground-truth slang words, indicated by
469
Chiffre 2: ROC curves for slang generation in
OSD test set. Collaborative-filtering prototype
model was used for prediction. Ticks on the y-axis
indicate median precision of the models.
both higher AUC scores and consistent improve-
ment in precision over all retrieval ranks. Note
that the vanilla SBERT model, despite being a
much larger model trained on more data, only pre-
sented minor performance gains when compared
with the plain fastText model. This suggests that
simply training larger models on more data does
not better encapsulate slang semantics.
We also analyzed whether the contrastive em-
beddings are robust under different choices of the
probabilistic models. Spécifiquement, we considered
the following four variants of the models: 1)
1-Nearest Neighbor (1NN), 2) Prototype, 3) 1NN
with collaborative filtering (CF), et 4) Proto-
type with CF. Our results show that applying
contrastively learned semantic embeddings con-
sistently improves predictive accuracy across all
probabilistic choice models. The complete set of
results for all 3 datasets is summarized in Table 1.
We noted that the syntactic information from
the prior improves predictive accuracy in all set-
tings, while by itself predicting significantly better
than chance. On OSD, we used the context sen-
tences alone in a contextualized language infilling
model for prediction and also incorporating it as a
prior. Encore, while the prior consistently improves
model prediction, both by itself and when paired
with the syntactic-shift prior, the language model
alone is not sufficient.
We found the syntactic-shift prior and linguistic
context prior to be capturing complementary infor-
mation (mean Spearman correlation of 0.054 ±
0.003 across all examples), resulting in improved
performance when they are combined together.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
je
un
c
_
un
_
0
0
3
7
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Model
1NN Prototype
1NN+CF
Proto+CF
Dataset 1: Online Slang Dictionary (OSD)
Prior Baseline – Uniform
Prior Baseline – Syntactic-shift
Prior Baseline – Linguistic Context(Donahue et al., 2020)
Prior Baseline – Syntactic-shift + Linguistic Context
FastText Baseline(Sun et al., 2019)
FastText + Contrastive Semantic Encoding (CSE)
FastText + CSE + Syntactic-shift Prior (SSP)
FastText + CSE + Linguistic Context Prior (LCP)
FastText + CSE + SSP + LCP
SBERT Baseline
SBERT + CSE
SBERT + CSE + SSP
SBERT + CSE + LCP
SBERT + CSE + SSP + LCP
Dataset 2: Green’s Dictionary of Slang (GDoS)
Prior Baseline – Uniform
Prior Baseline – Syntactic-shift
FastText Baseline(Sun et al., 2019)
FastText + Contrastive Semantic Encoding (CSE)
FastText + CSE + Syntactic-shift Prior (SSP)
SBERT Baseline
SBERT + CSE
SBERT + CSE + SSP
Dataset 3: Urban Dictionary (UD)
Prior Baseline – Uniform
FastText Baseline(Sun et al., 2019)
FastText + Contrastive Semantic Encoding (CSE)
SBERT Baseline
SBERT + CSE
63.2
71.7
73.8
73.6
75.4
67.4
76.6
77.6
77.8
78.5
68.2
73.4
74.5
67.1
77.8
78.5
65.2
71.0
72.4
76.2
51.9
60.6
61.8
67.3
51.5
61.0
52.3
65.2
71.6
73.4
73.2
74.9
68.1
77.4
78.0
78.4
79.0
69.9
74.0
74.8
68.0
78.2
78.7
68.8
72.2
71.7
76.6
66.0
73.0
75.2
74.7
76.5
69.5
77.4
78.8
78.1
79.4
67.8
74.1
75.2
66.8
77.4
78.3
67.6
71.5
74.0
77.2
68.7
72.6
74.4
73.9
75.6
72.0
78.0
78.9
78.7
79.5
69.7
74.8
75.8
67.5
77.9
78.6
70.9
73.7
74.4
78.8
Tableau 1: Summary of model AUC scores (%) for slang generation in 3 slang datasets.
Cependant, the majority of the performance gain
is attributed to the augmented contrastive em-
beddings, which highlights the importance and
supports our premise that encoding of slang and
conventional senses is crucial
to slang word
choice.
d and assessed whether our model can predict
the choices of slang words for slang senses that
emerged in the future decade. We scored the
models on slang words that emerged during each
subsequent decade, simulating a scenario where
future slang usages are incrementally predicted.
5.2 Historical Analysis of Slang Emergence
We next performed a temporal analysis to evaluate
whether our model explains slang emergence over
temps. We used the time tags available in the
GDoS dataset and predicted historically emerged
slang from the past 50 années (1960s–2000s). For a
given slang entry recorded in history, we tagged its
emergent decade using the earliest dated reference
available in the dictionary. For each future decade
d, we trained our model using all entries before
470
the result
Tableau 2 summarizes
from the
historical analysis for the non-contrastive SBERT
baseline and our full model (with contrastive
embeddings), based on the GDoS data. AUC
scores are similar to the previous findings but
slightly lower for both models in this historical
setting. Dans l'ensemble, we find the full model to improve
the baseline consistently over the course of history
examined and achieve similar performance as in
the synchronic evaluation. This provides strong
evidence that our framework is robust and has
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
je
un
c
_
un
_
0
0
3
7
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Decade
# Test Baseline
SBERT+CSE+SSP
(un) Online Slang Dictionary (OSD)
1960s
1970s
1980s
1990s
2000s
2010
1757
1655
1605
1374
67.5
66.3
66.3
66.2
65.9
77.4
77.9
78.6
75.4
77.0
Tableau 2: Summary of model AUC scores
in historical prediction of slang emergence
(1960s-2000s). The non-contrastive SBERT
baseline and the proposed full model (avec
contrastive embedding, CSE, and syntactic
prior, SSP) are compared using collaborative-
filtering Prototype. Models were trained and
tested incrementally through time (test sizes
shown) and trained initially on 20,899 Green’s
Dictionary definitions prior to the 1960s.
explanatory power over the historical emergence
of slang.
5.3 Model Error Analysis and Interpretation
Few-shot vs Zero-shot Prediction. We analyze
our model errors and note that one source of error
stems from whether the probe slang word has
appeared during training versus not. Ici, chaque
candidate word is treated as a class and each
slang sense of a word seen in the training set is
considered a ‘shot’. In the few-shot case, bien que
the slang sense in question was not observed in
prediction, the model has some a priori knowl-
edge about its target word and how it has been
used in slang context (because a word may have
multiple slang senses), thus allowing the model to
generalize toward novel slang usage of that word.
In the zero-shot case, the model needs to select
a novel slang word (c'est à dire., one that never appeared
in training) and hence has no direct knowledge
about how that word should be extended in a
slang context. Such knowledge must be inferred
indirectly, and in this case, from the conventional
senses of the candidate words. The model can
then infer how words with similar conventional
senses might extend to slang context.
Tableau 3 outlines the AUC scores of the collabo-
ratively filtered prototype models under few-shot
and zero-shot settings. For each dataset, we par-
titioned the corresponding test set by whether the
target word appears at least once within another
definition entry in the training data. This results
dans 179, 2,661, et 165 few-shot definitions in
Model
Few-shot Zero-shot
Prior – Uniform
Prior – Syntactic-shift
Prior – Linguistic Context
Prior – SSP + LCP
FT Baseline
FT + CSE
FT + CSE + SSP
FT + CSE + LCP
FT + CSE + SSP + LCP
SBERT Baseline
SBERT + CSE
SBERT + CSE + SSP
SBERT + CSE + LCP
SBERT + CSE + SSP + LCP
55.1
63.4
72.4
74.7
68.3
74.8
76.8
76.7
78.7
72.2
78.3
79.3
79.8
80.7
47.1
56.4
45.8
56.4
69.2
69.4
70.9
69.5
70.9
71.6
77.5
78.3
77.1
77.8
(b) Green’s Dictionary of Slang (GDoS)
Model
Few-shot
Zero-shot
Prior – Uniform
Prior – Syntactic-shift
FT Baseline
FT + CSE
FT + CSE + SSP
SBERT Baseline
SBERT + CSE
SBERT + CSE + SSP
51.8
61.6
70.6
76.3
77.3
68.3
79.0
79.7
48.1
54.8
61.3
59.2
60.7
59.6
66.8
67.7
(c) Urban Dictionary (UD)
Model
Prior – Uniform
FT Baseline
FT + CSE
SBERT Baseline
SBERT + CSE
Few-shot Zero-shot
54.2
68.6
76.2
73.0
80.6
49.1
75.0
69.4
76.8
75.6
Tableau 3: Model AUC scores (%) for Few-shot and
Zero-shot test sets (‘‘CSE’’ for contrastive embed-
ding, ‘‘SSP’’ for syntactic prior, ‘‘LCP’’ for con-
textual prior, and ‘‘FT’’ for fastText).
Chiffre 3: Degree of synonymy in the test examples
relative to training data in each of the 3 datasets.
471
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
je
un
c
_
un
_
0
0
3
7
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
(un) Online Slang Dictionary (OSD)
Training
Testing
Model
FT Baseline
FT + CSE
SBERT Baseline
SBERT + CSE
0.35 ± 0.033
0.28 ± 0.030
0.32 ± 0.033
0.23 ± 0.029
0.33 ± 0.011
0.15 ± 0.0083
0.34 ± 0.011
0.097 ± 0.0069
(b) Green’s Dictionary of Slang (GDoS)
Training
0.30 ± 0.0034
0.19 ± 0.0028
0.32 ± 0.0035
0.10 ± 0.0019
(c) Urban Dictionary (UD)
Training
0.34 ± 0.012
0.20 ± 0.010
0.34 ± 0.012
0.10 ± 0.0075
Testing
0.31 ± 0.037
0.28 ± 0.033
0.28 ± 0.034
0.23 ± 0.031
Testing
0.30 ± 0.010
0.26 ± 0.0097
0.32 ± 0.010
0.22 ± 0.0089
Model
FT Baseline
FT + CSE
SBERT Baseline
SBERT + CSE
Model
FT Baseline
FT + CSE
SBERT Baseline
SBERT + CSE
Tableau 4: Mean Euclidean distance from slang senses
to prototypical conventional senses.
the OSD and UD data). This issue caused the mod-
els to be more biased towards generalizing usage
patterns from more commonly observed words.
Enfin, the SBERT-based models tend to be more
robust towards unseen word-forms, potentiellement
benefiting from their contextualized properties.
Synonymous Slang Senses. We also examined
the influence of synonymy (or sense overlap) dans le
slang datasets. We quantified the degree of sense
synonymy by checking each test sense against all
training senses and computing the edit distance
between the corresponding sets of constituent
content words of the sense definitions.
Chiffre 3 shows the distribution of degree of
synonymy across all test examples where the edit
distance to the closest training example is con-
sidered. We perform our evaluation by binning
based on the degree of synonymy and summa-
rize the results in Figure 4. We do not observe
any substantial changes in performance when
controlling for the degree of synonymy, et en
fact, the highly synonymous definitions appear
to be more difficult (as opposed to easier) pour
the models. Dans l'ensemble, we find the models to yield
consistent improvement across different degrees
of synonymy, particularly with the SBERT based
full model, which offers improvement
in all
cases.
Chiffre 4: Model AUC scores (%) under test sets
with different degrees of synonymy present in
entraînement, for the baselines and the best performing
models (under collaborative-filtering prototype).
OSD, GDoS, and UD, respectivement, along with
120, 269, 96 zero-shot definitions. From our
résultats, we observed that it is more challeng-
ing for the model to generalize usage patterns
to unseen words, with AUC scores often being
higher in the few-shot case. Dans l'ensemble, we found the
model to have the most issues handling zero-shot
cases from GDoS due to the fine-grained senses
recorded in this dictionary, where a word has
more slang senses on average (in comparison to
472
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
je
un
c
_
un
_
0
0
3
7
8
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Model
1. True slang: kick; Slang sense: ‘‘a thrill, amusement or excitement’’
Sample usage: I got a huge kick when things were close to out of hand.
Top-5 slang words predicted by model
Predicted rank of the true slang
SBERT Baseline
Full model
2. True slang: whiff; Slang sense: ‘‘to kill, to murder, [play on SE, to blow away]’’
thrill, pleasure, frolic, yahoo, sparkle
twist, spin, trick, crank, punch
3496 / 6540
96 / 6540
Sample usage: The trouble is he wasn’t alone when you whiffed him.
SBERT Baseline
Full model
3. True slang: chirp; Slang sense: ‘‘an act of informing, a betrayal’’
suicide, homicide, murder, killing, rape
spill, swallow, blow, flare, dash
2735 / 6540
296 / 6540
Sample usage: Once we’re sure there’s no back-fire anywhere, the Sparrow will chirp his last chirp.
SBERT Baseline
Full model
4. True slang: red; Slang sense: ‘‘a communist, a socialist or anyone considered to have left-wing leanings’’
dupe, sin, scam, humbug, hocus
chirp, squeal, squawk, fib, chat
2431 / 6540
1 / 6540
Sample usage: Why the hell would I bed a red?
SBERT Baseline
Full model
5. True slang: team; Slang sense: ‘‘a gang of criminals’’
leveller, wildcat, mole, pawn, domino
orange, bluey, black and tan, violet, shadow
Sample usage: And a little team to follow me – all wanted up the yard.
SBERT Baseline
Full model
gangster, hoodlum, thug, mob, gangsta
brigade, mob, entreprise, gang, school
1744 / 6540
164 / 6540
826 / 6540
15 / 6540
Tableau 5: Example slang word predictions from the contrastively learned full model and SBERT baseline
(with no contrastive embedding) on slang usage from the Green’s Dictionary. Each example shows the
true slang, the probe slang sense, a sample usage, the alternative slang words predicted by each model,
and the predicted rank (colored bars indicate inverse rank) of the true slang from a lexicon of 6,540
words.
Semantic Distance. To understand the conse-
quence of contrastive embedding, we examine
the relative distance between conventional and
slang senses of a word in embedding space and
the extent to which the learned semantic relations
might generalize. We measured the Euclidean
distance between each slang embedding with the
prototype sense vector of all candidate words,
without applying the probabilistic choice models.
Tableau 4 shows the ranks of the corresponding
candidate words, averaged over all slang sense
embeddings considered and normalized between
0 et 1. We observed that contrastive learning
indeed brings closer slang and conventional senses
(from the same word), as indicated by lower mean
semantic distance after the embedding procedure
is applied. Under both fastText and SBERT,
we obtained significant
improvement on both
the OSD and GDoS test sets (p < 0.001). On
UD, the improvement is significant for SBERT
(p = 0.018) but marginal for fastText (p = 0.087).
Examples of Model Prediction. Table 5 shows
5 example slangs from the GDoS test set and the
top words predicted by both the baseline SBERT
model and the full SBERT-based model with
contrastive learning. The full model exhibits a
greater tendency to choose words that appear
remotely related to the queried sense (e.g., spill,
swallow for the act of killing), while the baseline
model favors words that share only surface seman-
tic similarity (e.g., retrieving murder and homi-
cide directly). We found cases where the model
extends meaning metaphorically (e.g., animal to
action, in the case of chirp), euphemistically (e.g.,
spill and swallow for kill), and generalization of
a concept (e.g., brigade and mob for gang), all
of which are commonly attested in slang usage
(Eble, 2012).
We found the full model to achieve better
retrieval accuracy in cases where the queried
slang undergoes a non-literal sense extension,
whereas the baseline model is situated at retriev-
ing candidate words with incremental or literal
changes in meaning. We also noted many cases
where the true slang word is difficult to predict
without appropriate background knowledge. For
instance, the full-model suggested words such as
orange and bluey to mean ‘‘a communist’’ but
could not pinpoint the color red without knowing
its cultural association to communism. Finally,
we observed that our model to perform generally
worse when the target slang sense can hardly
be related to conventional senses of the target
473
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
l
a
c
_
a
_
0
0
3
7
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
word, suggesting that cultural knowledge may be
important to consider in the future.
6 Conclusion
We have presented a framework that combines
probabilistic inference with neural contrastive
learning to generate novel slang word usages. Our
results suggest that capturing semantic and con-
textual flexibility simultaneously helps to improve
the automated generation of slang word choices
with limited training data. To our knowledge
this work constitutes the first formal computa-
tional approach to modeling slang generation,
and we have shown the promise of the learned
semantic space for representing slang senses. Our
framework will provide opportunities for future
research in the natural language processing of
informal
language, particularly the automated
interpretation of slang.
Acknowledgments
We thank the anonymous TACL reviewers and
action editors for their constructive and detailed
comments. We thank Walter Rader and Jonathon
Green respectively for their permissions to use The
Online Slang Dictionary and Green’s Dictionary
of Slang for our research. We thank Graeme Hirst,
Ella Rabinovich, and members of the Language,
Cognition, and Computation (LCC) Group at the
University of Toronto for offering thoughtful
feedback to this work. We also thank Dan Jurafsky
and Derek Denis for stimulating discussion. This
work was supported by a NSERC Discovery
Grant RGPIN-2018-05872 and a Connaught New
Researcher Award to YX.
References
Elton Shah Aly and Dustin Terence van der Haar.
2020. Slang-based text sentiment analysis in
instagram. In Fourth International Congress on
Information and Communication Technology,
pages 321–329, Singapore. Springer Singapore.
DOI: https://doi.org/10.1007/978
-981-32-9343-4 25
Pierre Baldi and Yves Chauvin. 1993. Neural
networks for fingerprint recognition. Neural
Computation, 5(3):402–418. DOI: https://
doi.org/10.1162/neco.1993.5.3.402
David Bamman, Jacob Eisenstein, and Tyler
Schnoebelen. 2014. Gender identity and lexical
variation in social media. Journal of Socio-
linguistics, 18(2):135–160. DOI: https://
doi.org/10.1111/josl.12080
Su Lin Blodgett, Lisa Green, and Brendan
O’Connor. 2016. Demographic dialectal varia-
tion in social media: A case study of African-
American English. In Proceedings of the 2016
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1119–1130,
Austin, Texas. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/D16-1120
Piotr Bojanowski, Edouard Grave, Armand Joulin,
and Tomas Mikolov. 2017. Enriching word vec-
tors with subword information. Transactions
of the Association for Computational Linguis-
tics, 5:135–146. DOI: https://doi.org
/10.1162/tacl a 00051
Samuel R. Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015. A
large annotated corpus for learning natural
the
language inference. In Proceedings of
2015 Conference on Empirical Methods in
Natural Language Processing, pages 632–642,
Lisbon, Portugal. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/D15-1075
Jane Bromley, Isabelle Guyon, Yann LeCun,
Eduard S¨ackinger, and Roopak Shah. 1994.
Signature verification using a ‘‘siamese’’ time
delay neural network. In J. D. Cowan, G.
Tesauro, and J. Alspector, editors, Advances in
Information Processing Systems 6,
Neural
pages 737–744, Morgan-Kaufmann. DOI:
https://doi.org/10.1142/9789812797926
0003
Richard H. Byrd, Peihuang Lu, Jorge Nocedal,
and Ciyou Zhu. 1995. A limited memory
algorithm for bound constrained optimiza-
tion. SIAM Journal on Scientific Computing,
16:1190–1208. DOI: https://doi.org
/10.1137/0916069
Sumit Chopra, Raia Hadsell, and Yann LeCun.
2005. Learning a similarity metric discrimi-
natively, with application to face verification.
In Proceedings of the 2005 IEEE Computer
Society Conference on Computer Vision and
Pattern Recognition (CVPR’05) - Volume 1
474
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
l
a
c
_
a
_
0
0
3
7
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
- Volume 01, CVPR ’05, pages 539–546,
IEEE Computer
Washington, DC, USA.
Society.
C. Paul Cook. 2010. Exploiting Linguistic
Knowledge to Infer Properties of Neologisms.
PhD thesis, University of Toronto, Toronto,
Canada.
Verna Dankers, Marek Rei, Martha Lewis, and
Ekaterina Shutova. 2019. Modelling the inter-
play of metaphor and emotion through multitask
learning. In Proceedings of
the 2019 Con-
ference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 2218–2229,
Hong Kong, China. Association for Com-
putational Linguistics. DOI: https://doi
.org/10.18653/v1/D19-1227
Chris Donahue, Mina Lee, and Percy Liang.
2020. Enabling language models to fill in the
blanks. In Proceedings of
the 58th Annual
Meeting of the Association for Computational
Linguistics, pages 2492–2501, Online. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.225
Connie C. Eble. 1989. The ephemerality of
American college slang. In The Fifteenth Lacus
Forum, 15, pages 457–469.
Connie C. Eble. 2012. Slang & Sociability:
In-group Language Among College Students.
University of North Carolina Press, Chapel
Hill, NC.
Katrin Erk. 2016. What do you know about an
alligator when you know the company it keeps?
Semantics and Pragmatics, 9(17):1–63. DOI:
https://doi.org/10.3765/sp.9.17
Afsaneh Fazly, Paul Cook,
and Suzanne
Stevenson. 2009. Unsupervised type and token
identification of idiomatic expressions. Com-
putational Linguistics, 35(1):61–103. DOI:
https://doi.org/10.1162/coli.08
-010-R1-07-048
Renato Ferreira Pinto Jr. and Yang Xu. 2021. A com-
putational theory of child overextension. Cog-
nition, 206:104472. DOI: https://doi.org
/10.1016/j.cognition.2020.104472
475
Elena Filatova. 2012. Irony and sarcasm: Corpus
generation and analysis using crowdsourcing.
In Proceedings of
the Eighth International
Conference on Language Resources and Eval-
uation (LREC-2012), pages 392–398, Istanbul,
Turkey. European Language Resources Asso-
ciation (ELRA).
Ge Gao, Eunsol Choi, Yejin Choi, and Luke
Zettlemoyer. 2018. Neural metaphor detec-
tion in context. In Proceedings of the 2018
Conference on Empirical Methods in Nat-
ural Language Processing, pages 607–613,
Brussels, Belgium. Association for Compu-
tational Linguistics. DOI: https://doi
.org/10.18653/v1/D18-1060
David Goldberg, David Nichols, Brian M. Oki,
and Douglas Terry. 1992. Using collaborative
filtering to weave an information tapestry.
Comunications of ACM, 35:61–70. DOI:
https://doi.org/10.1145/138859
.138867
Jonathan Green. 2010. Greens Dictionary of
Slang. Chambers, London.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
l
a
c
_
a
_
0
0
3
7
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
reveal
statistical
William L. Hamilton, Jure Leskovec, and Dan
Jurafsky. 2016. Diachronic word embed-
dings
semantic
change. In Proceedings of the 54th Annual
the Association for Computa-
Meeting of
tional Linguistics (Volume 1: Long Papers),
pages 1489–1501, Berlin, Germany. Associa-
tion for Computational Linguistics.
laws of
Eric Holgate, Isabel Cachola, Daniel Preot¸iuc-
Pietro, and Junyi Jessy Li. 2018. Why swear?
Analyzing and inferring the intentions of vul-
gar expressions. In Proceedings of the 2018
Conference on Empirical Methods in Natu-
ral Language Processing, pages 4405–4414,
Brussels, Belgium. Association for Compu-
tational Linguistics. DOI: https://doi
.org/10.18653/v1/D18-1471
Herman Kamper, Weiran Wang, and Karen
Livescu. 2016. Deep convolutional acoustic
word embeddings using word-pair side infor-
mation. In 2016 IEEE International Conference
on Acoustics, Speech and Signal Processing
(ICASSP), pages 4950–4954. DOI: https://
doi.org/10.1109/ICASSP.2016.7472619
Diederik P. Kingma and Jimmy Ba. 2015.
Adam: A method for stochastic optimization.
In 3rd International Conference on Learning
Representations, ICLR 2015, San Diego, CA,
USA, May 7-9, 2015, Conference Track
Proceedings.
Gregory Koch, Richard Zemel, and Ruslan
Salakhutdinov. 2015. Siamese neural networks
In Deep
for one-shot
Learning Workshop at
International
Conference on Machine Learning, volume 2.
image recognition.
the
Vivek Kulkarni and William Yang Wang. 2018.
Simple models for word formation in slang.
In Proceedings of the 2018 Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers),
pages 1424–1434, New Orleans, Louisiana.
Association for Computational Linguistics.
DOI: https://doi.org/10.18653/v1
/N18-1129
William Labov. 1972. Language in the Inner
City: Studies in the Black English Vernacular.
University of Pennsylvania Press.
William Labov. 2006. The Social Stratification
of English in New York City. Cambridge
University Press. DOI: https://doi.org
/10.1017/CBO9780511618208
Sidney Landau. 1984. Dictionaries: The Art and
Craft of Lexicography, Charles Scribner’s Sons,
New York, NY.
Maria Lapata
and Alex Lascarides. 2003.
logical meton-
A probabilistic account of
ymy. Computational
29(2):
261–315. DOI: https://doi.org/10.1162
/089120103322145324
Linguistics,
Marc T. Law, Nicolas Thome, and Matthieu
Cord. 2013. Quadruplet-wise image similarity
learning. In 2013 IEEE International Con-
ference on Computer Vision, pages 249–256.
DOI: https://doi.org/10.1109/ICCV
.2013.38
Adrienne Lehrer. 1985. The
influence of
semantic fields on semantic change. His-
torical Semantics: Historical Word For-
mation, 29:283–296. DOI: https://doi
.org/10.1515/9783110850178.283
476
Changsheng Liu and Rebecca Hwa. 2018.
Heuristically informed unsupervised idiom us-
age recognition. In Proceedings of the 2018
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1723–1731,
Brussels, Belgium. Association for Computa-
tional Linguistics.
Rijul Magu and Jiebo Luo. 2018. Determin-
ing code words in euphemistic hate speech
In Pro-
using word embedding networks.
ceedings of
the 2nd Workshop on Abusive
Language Online (ALW2), pages 93–100,
Brussels, Belgium. Association for Computa-
tional Linguistics. DOI: https://doi.org
/10.18653/v1/W18-5112
Elisa Mattiello. 2005. The pervasiveness of slang
in standard and non-standard English. Mots
Palabras Words, 6:7–41.
Jean-Baptiste Michel, Yuan Kui Shen, Aviva
Presser Aiden, Adrian Veres, Matthew K.
Gray, Joseph P. Pickett, Dale Hoiberg, Dan
Clancy, Peter Norvig, Jon Orwant, Steven
Pinker, Martin A. Nowak, and Erez Lieberman
Aiden. 2011. Quantitative analysis of culture
using millions of digitized books. Science,
331:176–182. DOI: https://doi.org/10
.1126/science.1199644
Jonas Mueller and Aditya Thyagarajan. 2016.
Siamese recurrent architectures for learning
the
sentence similarity.
Thirtieth AAAI Conference on Artificial
Intelligence, AAAI’16, pages 2786–2792.
AAAI Press.
In Proceedings of
Vinod Nair and Geoffrey E. Hinton. 2010.
restricted
Rectified linear units
boltzmann machines. In Proceedings of
the
27th International Conference on International
Conference on Machine Learning, ICML’10,
pages 807–814, USA. Omnipress.
improve
Paul Neculoiu, Maarten Versteegh, and Mihai
Rotaru. 2016. Learning text similarity with
In Proceed-
Siamese recurrent networks.
ings of
the 1st Workshop on Represen-
tation Learning for NLP, pages 148–157,
Berlin, Germany. Association for Compu-
tational Linguistics. DOI: https://doi
.org/10.18653/v1/W16-1617
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
l
a
c
_
a
_
0
0
3
7
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Ke Ni
and William Yang Wang. 2017.
Learning to explain non-standard English
words and phrases. In Proceedings of
the
Eighth International Joint Conference on
Natural Language Processing (Volume 2: Short
Papers), pages 413–417, Taipei, Taiwan. Asian
Federation of Natural Language Processing.
Malvina Nissim and Katja Markert. 2003. Syntac-
tic features and word similarity for supervised
metonymy resolution. In Proceedings of the
41st Annual Meeting on Association for Com-
putational Linguistics - Volume 1, ACL ’03,
pages 56–63, Stroudsburg, PA, USA. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.3115/1075096
.1075104
Zhengqi Pei, Zhewei Sun, and Yang Xu.
2019. Slang detection and identification. In
the 23rd Conference on
Proceedings of
Computational Natural Language Learning
(CoNLL), pages 881–889, Hong Kong, China.
Association for Computational Linguistics.
Martin J. Pickering and Simon Garrod. 2013.
Forward models and their implications for
and dialogue.
production,
Behavioral
36(4):
377–392. DOI: https://doi.org/10
.1017/S0140525X12003238
comprehension,
and Brain
Sciences,
Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9.
Christian Ramiro, Mahesh Srinivasan, Barbara C.
Malt, and Yang Xu. 2018. Algorithms in
the historical emergence of word senses.
Proceedings of
the National Academy of
Sciences, 115:2323–2328. DOI: https://
doi.org/10.1073/pnas.1714730115
Nils Reimers
and Iryna Gurevych. 2019.
Sentence-BERT: Sentence embeddings using
Siamese BERT-networks. In Proceedings of
the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th
International Joint Conference on Natural
(EMNLP-IJCNLP),
Language
pages
3982–3992, Hong Kong, China.
Association for Computational Linguistics.
Processing
DOI: https://doi.org/10.18653/v1
/D19-1410
Ekaterina Shutova, Jakub Kaplan, Simone Teufel,
and Anna Korhonen. 2013a. A computational
model of logical metonymy. ACM Transac-
tions on Speech and Language Processing,
10(3):11:1–11:28. DOI: https://doi.org
/10.1145/2483969.2483973
Ekaterina Shutova, Simone Teufel, and Anna
Korhonen. 2013b. Statistical metaphor process-
ing. Computational Linguistics, 39(2):301–353.
DOI: https://doi.org/10.1162/COLI
a 00124
for
Jake Snell, Kevin Swersky, and Richard S. Zemel.
few-shot
2017. Prototypical networks
learning. In Advances in Neural Information
Processing Systems 30: Annual Conference on
Neural Information Processing Systems 2017,
4-9 December 2017, Long Beach, CA, USA,
pages 4080–4090.
Ian Stewart and Jacob Eisenstein. 2018. Making
‘‘fetch’’ happen: The influence of
social
and linguistic context on nonstandard word
growth and decline. In Proceedings of
the
2018 Conference on Empirical Methods
in Natural Language Processing, pages
4360–4370, Brussels, Belgium. Association for
Computational Linguistics. DOI: https://
doi.org/10.18653/v1/D18-1467
Zhewei Sun, Richard Zemel, and Yang Xu.
2019. Slang generation as categorization.
the 41st Annual Con-
In Proceedings of
ference of
the Cognitive Science Society,
pages 2898–2904. Cognitive Science Society.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
undefinedukasz Kaiser, and Illia Polosukhin.
2017. Attention is all you need. In Proceedings
of the 31st International Conference on Neural
Information Processing Systems, NIPS’17,
pages 6000–6010, Red Hook, NY, USA.
Curran Associates Inc.
Tony Veale, Ekaterina Shutova, and Beata
Beigman Klebanov. 2016. Metaphor: A Com-
putational Perspective. Morgan & Claypool
Publishers. DOI: https://doi.org/10
.2200/S00694ED1V01Y201601HLT031
477
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
l
a
c
_
a
_
0
0
3
7
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Oriol Vinyals, Charles Blundell, Timothy
Lillicrap, Koray Kavukcuoglu, and Daan
for
Wierstra.
the
one shot
30th International Conference on Neural
Information Processing Systems, NIPS’16,
pages 3637–3645, USA. Curran Associates Inc.
networks
In Proceedings of
2016. Matching
learning.
Jiang Wang, Yang Song, Thomas Leung, Chuck
Rosenberg, Jingbin Wang, James Philbin,
Bo Chen, and Ying Wu. 2014. Learning fine-
grained image similarity with deep ranking.
In Proceedings of the 2014 IEEE Conference
on Computer Vision and Pattern Recognition,
CVPR ’14, pages 1386–1393, Washington, DC,
USA. IEEE Computer Society. DOI: https://
doi.org/10.1109/CVPR.2014.180
Kilian Q. Weinberger and Lawrence K. Saul.
2009. Distance metric learning for large margin
nearest neighbor classification. Journal of
Machine Learning Research, 10:207–244.
Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through in-
ference. In Proceedings of the 2018 Confer-
ence of the North American Chapter of the
Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long Papers), pages 1112–1122, New Orleans,
Louisiana. Association for Computational Lin-
guistics. DOI: https://doi.org/10.18653
/v1/N18-1101
Steven Wilson, Walid Magdy, Barbara
McGillivray, Kiran Garimella, and Gareth
Tyson. 2020. Urban dictionary embeddings
for slang NLP applications. In Proceedings of
the 12th Language Resources and Evaluation
Conference, pages 4764–4773, Marseille,
France.
Language Resources
Association.
European
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
7
8
1
9
2
4
2
5
6
/
/
t
l
a
c
_
a
_
0
0
3
7
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
478