Deep Learning for Text Style Transfer: - Specialized Research AI at MIT

Deep Learning for Text Style Transfer:
A Survey

Di Jin∗
Amazon
Alexa AI
djinamzn@amazon.com

Zhijing Jin*
Max Planck Institute for
Intelligent Systems
Empirical Inference Department
and ETH Z ¨urich Department of
Computer Science
zjin@tue.mpg.de

Zhiting Hu
UC San Diego
Halıcıo ˘glu Data Science Institute (HDSI)
zhh019@ucsd.edu

Olga Vechtomova
University of Waterloo
Faculty of Engineering
ovechtom@uwaterloo.ca

Rada Mihalcea
University of Michigan
EECS, College of Engineering
mihalcea@umich.edu

Text style transfer is an important task in natural language generation, which aims to control
certain attributes in the generated text, such as politeness, emotion, humor, and many others.
It has a long history in the ﬁeld of natural language processing, and recently has re-gained
signiﬁcant attention thanks to the promising performance brought by deep neural models. In this
article, we present a systematic survey of the research on neural text style transfer, spanning over
100 representative articles since the ﬁrst neural text style transfer work in 2017. We discuss the

∗ Equal contribution.

Submission received: 25 April 2021; revised version received: 30 August 2021; accepted for publication:
4 December 2021.

https://doi.org/10.1162/COLI a 00426

© 2022 Association for Computational Linguistics
Published under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
(CC BY-NC-ND 4.0) license

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
8
1
1
5
5
2
0
0
6
6
0
8
/
c
o

l
i

_
a
_
0
0
4
2
6
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 48, Number 1

task formulation, existing datasets and subtasks, evaluation, as well as the rich methodologies
in the presence of parallel and non-parallel data. We also provide discussions on a variety of
important topics regarding the future development of this task.1

1. Introduction

Language is situational. Every utterance ﬁts in a speciﬁc time, place, and scenario,
conveys speciﬁc characteristics of the speaker, and typically has a well-deﬁned intent.
For example, someone who is uncertain is more likely to use tag questions (e.g., “This
is true, isn’t it?”) than declarative sentences (e.g., “This is deﬁnitely true.”). Similarly, a
professional setting is more likely to include formal statements (e.g., “Please consider
taking a seat.”) as compared to an informal situation (e.g., “Come and sit!”). For artiﬁcial
intelligence systems to accurately understand and generate language, it is necessary
to model language with style/attribute,2 which goes beyond merely verbalizing the
semantics in a non-stylized way. The values of the attributes can be drawn from a wide
range of choices depending on pragmatics, such as the extent of formality, politeness,
simplicity, personality, emotion, partner effect (e.g., reader awareness), genre of writing
(e.g., ﬁction or non-ﬁction), and so on.

The goal of TST is to automatically control the style attributes of text while preserv-
ing the content. TST has a wide range of applications, as outlined by McDonald and
Pustejovsky (1985) and Hovy (1987). The style of language is crucial because it makes
natural language processing more user-centered. TST has many immediate applications.
For instance, one such application is intelligent bots, for which users prefer distinct and
consistent persona (e.g., empathetic) instead of emotionless or inconsistent persona.
Another application is the development of intelligent writing assistants; for example,
non-expert writers often need to polish their writing to better ﬁt their purpose, for
example, more professional, polite, objective, humorous, or other advanced writing
requirements, which may take years of experience to master. Other applications include
automatic text simpliﬁcation (where the target style is “simple”), debiasing online text
(where the target style is “objective”), ﬁghting against offensive language (where the
target style is “non-offensive”), and so on.

To formally deﬁne TST, let us denote the target utterance as x(cid:48) and the target
discourse style attribute as a(cid:48). TST aims to model p(x(cid:48)|a, x), where x is a given text
carrying a source attribute value a. Consider the previous example of text expressed
by two different extents of formality:

Source sentence x:
Target sentence x(cid:48):

Source attribute a:
“Come and sit!”
“Please consider taking a seat.” Target attribute a(cid:48):

Informal
Formal

In this case, a TST model should be able to modify the formality and generate the
formal sentence x(cid:48) = “Please consider taking a seat.” given the informal input x = “Come
and sit!”. Note that the key difference of TST from another NLP task, style-conditioned
language modeling, is that the latter is conditioned on only a style token, whereas TST

1 Our curated paper list is at https://github.com/zhijing-jin/Text_Style_Transfer_Survey.
2 Note that we interchangeably use the terms style and attribute in this survey. Attribute is a broader

terminology that can include content preferences, e.g., sentiment, topic, and so on. This survey uses style
in the same broad way, following the common practice in recent papers (see Section 2.1).

156

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
8
1
1
5
5
2
0
0
6
6
0
8
/
c
o

l
i

_
a
_
0
0
4
2
6
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Jin et al.

Deep Learning for Text Style Transfer: A Survey

takes as input both the target style attribute a(cid:48) and a source sentence x that constrains
the content.

Crucial to the deﬁnition of style transfer is the distinction of “style” and “content,”
for which there are two common practices. The ﬁrst one is by linguistic deﬁnition, where
non-functional linguistic features are classiﬁed into the style (e.g., formality), and the
semantics are classiﬁed into the content. In contrast, the second practice is data-driven—
given two corpora (e.g., a positive review set and a negative review set), the invariance
between the two corpora is the content, whereas the variance is the style (e.g., sentiment,
topic) (Mou and Vechtomova 2020).

Driven by the growing needs for TST, active research in this ﬁeld has emerged,
from the traditional linguistic approaches, to the more recent neural network–based
approaches. Traditional approaches rely on term replacement and templates. For ex-
ample, early work in NLG for weather forecasts builds domain-speciﬁc templates to
express different types of weather with different levels of uncertainty for different users
(Sripada et al. 2004; Reiter et al. 2005; Belz 2008; Gkatzia, Lemon, and Rieser 2017).
Research that more explicitly focuses on TST starts from the frame language-based
systems (McDonald and Pustejovsky 1985), and schema-based NLG systems (Hovy
1987, 1990) which generate text with pragmatic constraints such as formality under
small-scale well-deﬁned schema. Most of this earlier work required domain-speciﬁc
templates, hand-featured phrase sets that express a certain attribute (e.g., friendly), and
sometimes a look-up table of expressions with the same meaning but multiple different
attributes (Bateman and Paris 1989; Stamatatos et al. 1997; Power, Scott, and Bouayad-
Agha 2003; Reiter, Robertson, and Osman 2003; Sheikha and Inkpen 2011; Mairesse and
Walker 2011).

With the success of deep learning in the last decade, a variety of neural methods
have been recently proposed for TST. If parallel data are provided, standard sequence-
to-sequence models are often directly applied (Rao and Tetreault 2018) (see Section 4).
However, most use cases do not have parallel data, so TST on non-parallel corpora has
become a proliﬁc research area (see Section 5). The ﬁrst line of approaches disentangle
text into its content and attribute in the latent space, and apply generative modeling
(Hu et al. 2017; Shen et al. 2017). This trend was then joined by another distinctive line
of approach, prototype editing (Li et al. 2018), which extracts a sentence template and
its attribute markers to generate the text. Another paradigm soon followed, namely,
pseudo-parallel corpus construction to train the model as if in a supervised way with
the pseudo-parallel data (Zhang et al. 2018d; Jin et al. 2019). These three directions, (1)
disentanglement, (2) prototype editing, and (3) pseudo-parallel corpus construction,
are further advanced with the emergence of Transformer-based models (Sudhakar,
Upadhyay, and Maheswaran 2019; Malmi, Severyn, and Rothe 2020).

Given the advances in TST methodologies, it now starts to expand its impact to
downstream applications, such as persona-based dialog generation (Niu and Bansal
2018; Huang et al. 2018), stylistic summarization (Jin et al. 2020a), stylized language
modeling to imitate speciﬁc authors (Syed et al. 2020), online text debiasing (Pryzant
et al. 2020; Ma et al. 2020), simile generation (Chakrabarty, Muresan, and Peng 2020),
and many others.

Motivation of a Survey on TST. The increasing interest in modeling the style of text can
be regarded as a trend reﬂecting the fact that NLP researchers start to focus more on
user-centeredness and personalization. However, despite the growing interest in TST,
the existing literature shows a large diversity in the selection of benchmark datasets,

157

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
8
1
1
5
5
2
0
0
6
6
0
8
/
c
o

l
i

_
a
_
0
0
4
2
6
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 48, Number 1

Table 1
Overview of the survey.

Motivation

Data

Method

Extended Applications

• Artistic writing
• Communication
• Mitigating social

issues

• Toxicity

Tasks
• Formality
• Politeness • Authorship
• Gender
• Humor
• Romance
• Biasedness • Political slant

• Simplicity
• Sentiment
• Topic

Key Properties
• Parallel vs. non-parallel
• Uni- vs. bi-directional
• Dataset size
• Large vs. small word

overlap

On Parallel Data
• Multi-tasking
• Inference techniques
• Data augmentation

On Non-Parallel Data
• Disentanglement
• Prototype editing
• Pseudo data construction

Helping Other NLP Tasks
• Paraphrasing
• Data augmentation
• Adversarial robustness
• Persona-consistent dialog
• Anonymization
• Summarization
• Style-speciﬁc MT

methodological frameworks, and evaluation metrics. Thus, the aim of this survey is
to provide summaries and potential standardizations on some important aspects of
TST, such as the terminology, problem deﬁnition, benchmark datasets, and evaluation
metrics. We also aim to provide different perspectives on the methodology of TST,
and suggest some potential cross-cutting research questions for our proposed research
agenda of the ﬁeld. As shown in Table 1, the key contributions targeted by this survey
are as follows:

1. We conduct the ﬁrst comprehensive review that covers most existing

works (more than 100 papers) on deep learning-based TST.

2. We provide an overview of the task setting, terminology deﬁnition,

benchmark datasets (Section 2), and evaluation metrics for which we
proposed standard practices that can be helpful for future works
(Section 3).

3. We categorize the existing approaches on parallel data (Section 4) and
non-parallel data (Section 5) for which we distill some uniﬁed
methodological frameworks.

4. We discuss a potential research agenda for TST (Section 6), including
expanding the scope of styles, improving the methodology, loosening
dataset assumptions, and improving evaluation metrics.

5. We provide a vision for how to broaden the impact of TST (Section 7),
including connecting to more NLP tasks, and more specialized
downstream applications, as well as considering some important ethical
impacts.

Paper Selection. The neural TST papers reviewed in this survey are mainly from top
conferences in NLP and artiﬁcial intelligence (AI), including ACL, EMNLP, NAACL,
COLING, CoNLL, NeurIPS, ICML, ICLR, AAAI, and IJCAI. Other than conference
papers, we also include some non-peer-reviewed preprint papers that can offer some

158

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
8
1
1
5
5
2
0
0
6
6
0
8
/
c
o

l
i

_
a
_
0
0
4
2
6
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Jin et al.

Deep Learning for Text Style Transfer: A Survey

Figure 1
Venn diagram of the linguistic deﬁnition of style and data-driven deﬁnition of style.

insightful information about the ﬁeld. The major factors for selecting non-peer-reviewed
preprint papers include novelty and completeness, among others.

2. What Is Text Style Transfer?

This section provides an overview of the style transfer task. Section 2.1 goes through
the deﬁnition of styles and the scope of this survey. Section 2.2 gives a task formulation
and introduces the notations that will be used across the survey. Finally, Section 2.3 lists
all the common subtasks for neural TST which can save the literature review efforts for
future researchers.

2.1 How to Deﬁne Style?

Linguistic Deﬁnition of Style. An intuitive notion of style refers to the manner in which
the semantics is expressed (McDonald and Pustejovsky 1985). Just as everyone has
their own signatures, style originates as the characteristics inherent to every person’s
utterance, which can be expressed through the use of certain stylistic devices such as
metaphors, as well as choice of words, syntactic structures, and so on. Style can also
go beyond the sentence level to the discourse level, such as the stylistic structure of the
entire piece of the work, for example, stream of consciousness, or ﬂashbacks.

Beyond the intrinsic personal styles, for pragmatic uses, style further becomes a
protocol to regularize the manner of communication. For example, for academic writ-
ing, the protocol requires formality and professionalism. Hovy (1987) deﬁnes style by its
pragmatic aspects, including both personal (e.g., personality, gender) and interpersonal
(e.g., humor, romance) aspects. Most existing literature also takes these well-deﬁned
categories of styles.

Data-Driven Deﬁnition of Style as the Scope of this Survey. This survey aims to provide an
overview of existing neural TST approaches. To be concise, we will limit the scope to
the most common settings of existing literature. Speciﬁcally, most deep learning work
on TST adopts a data-driven deﬁnition of style, and the scope of this survey covers the
styles in currently available TST datasets. The data-driven deﬁnition of style is different
from the linguistic or rule-based deﬁnition of style, which theoretically constrains what
constitutes a style and what not, such as a style guide (e.g., American Psychological
Association 2020) that requires that formal text not include any contraction, e.g., “isn’t.”
The distinction of the two deﬁntions of style is shown in Figure 1.

159

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
8
1
1
5
5
2
0
0
6
6
0
8
/
c
o

l
i

_
a
_
0
0
4
2
6
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Data-Driven StyleLinguistic StyleLinguistic styles without existinglarge datasets to match with thestyle, e.g., cheerful style.Attributes from datasets that do notmatch with existing linguistic styles butcan be used for deep learning-basedTST models, e.g., Yelp dataset.Attributes from datasets thatcorrespond to linguistic styles(often by human annotation), e.g.,Formality dataset.

Computational Linguistics

Volume 48, Number 1

With the rise of deep learning methods of TST, the data-driven deﬁnition of style
extends the linguistic style to a broader concept—the general attributes in text. It regards
“style” as the attributes that vary across datasets, as opposed to the characteristics that
stay invariant (Mou and Vechtomova 2020). The reason is that deep learning models
(which are the focus of this survey) need large corpora to learn the style from, but not
all styles have well-matched large corpora. Therefore, apart from the very few manually
annotated datasets with linguistic style deﬁnitions, such as formality (Rao and Tetreault
2018) and humor & romance (Gan et al. 2017), many recent dataset collection works
automatically look for meta-information to link a corpus to a certain attribute. A typical
example is the widely used Yelp review dataset (Shen et al. 2017), where reviews with
low ratings are put into the negative corpus, and reviews with high ratings are put into
the positive corpus, although the negative vs. positive opinion is not a style that belongs
to the linguistic deﬁnition, but more of a content-related attribute.

Most methods mentioned in this survey can be applied to scenarios that follow
this data-driven deﬁnition of style. As a double-edged sword, the prerequisite for most
methods is that there exist style-speciﬁc corpora for each style of interest, either parallel
or non-parallel. Note that there can be future works that do not take such an assumption,
which will be discussed in Section 6.3.

Comparison of the Two Deﬁnitions. There are two phenomena rising from the data-driven
deﬁnition of style as opposed to the linguistic style. One is that the data-driven def-
inition of style can include a broader range of attributes including content and topic
preferences of the text. The other is that data-driven styles, if collected through auto-
matic classiﬁcation by meta-information such as ratings, user information, and source
of text, can be more ambiguous than the linguistically deﬁned styles. As shown in Jin
et al. (2019, Section 4.1.1), some automatically collected datasets have a concerningly
high undecideable rate and inter-annotator disagreement rate when the annotators are
asked to associate the dataset with human-deﬁned styles such as political slant and
gender-speciﬁc tones.

The advantage of the data-driven style is that it can marry well with deep learning
methods because most neural models learn the concept of style by learning to distin-
guish the multiple style corpora. For the (non-data-driven) linguistic style, although it
is under-explored in the existing deep learning works of TST, we provide in Section 6.3
a discussion of how potential future works can learn TST of linguistics styles with no
matched data.

2.2 Task Formulation

We deﬁne the main notations used in this survey in Table 2.

As mentioned previously in Section 2.1, most neural approaches assume a given
set of attribute values A, and each attribute value has its own corpus. For example, if
the task is about formality transfer, then for the attribute of text formality, there are
two attribute values, a = “formal” and a(cid:48) = “informal,” corresponding to a corpus X1
of formal sentences and another corpus X2 of informal sentences. The style corpora can
be parallel or non-parallel. Parallel data means that each sentence with the attribute a
is paired with a counterpart sentence with another attribute a(cid:48). In contrast, non-parallel
data only assumes mono-style corpora.

160

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
/
c
o

l
i
/

a
r
t
i
c
e
–
p
d

f
/

4
8
1
1
5
5
2
0
0
6
6
0
8
/
c
o

l
i

_
a
_
0
0
4
2
6
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Jin et al.

Deep Learning for Text Style Transfer: A Survey

Table 2
Notation of each variable and its corresponding meaning.