Deep Learning for Text Style Transfer:

Deep Learning for Text Style Transfer:
A Survey

Di Jin∗
Amazon
Alexa AI
djinamzn@amazon.com

Zhijing Jin*
Max Planck Institute for
Intelligent Systems
Empirical Inference Department
and ETH Z ¨urich Department of
Computer Science
zjin@tue.mpg.de

Zhiting Hu
UC San Diego
Halıcıo ˘glu Data Science Institute (HDSI)
zhh019@ucsd.edu

Olga Vechtomova
University of Waterloo
Faculty of Engineering
ovechtom@uwaterloo.ca

Rada Mihalcea
University of Michigan
EECS, College of Engineering
mihalcea@umich.edu

Text style transfer is an important task in natural language generation, which aims to control
certain attributes in the generated text, such as politeness, emotion, humor, and many others.
It has a long history in the field of natural language processing, and recently has re-gained
significant attention thanks to the promising performance brought by deep neural models. In this
article, we present a systematic survey of the research on neural text style transfer, spanning over
100 representative articles since the first neural text style transfer work in 2017. We discuss the

∗ Equal contribution.

Submission received: 25 April 2021; revised version received: 30 August 2021; accepted for publication:
4 December 2021.

https://doi.org/10.1162/COLI a 00426

© 2022 Association for Computational Linguistics
Published under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
(CC BY-NC-ND 4.0) license

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
8
1
1
5
5
2
0
0
6
6
0
8
/
c
o

l
i

_
a
_
0
0
4
2
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 48, Number 1

task formulation, existing datasets and subtasks, evaluation, as well as the rich methodologies
in the presence of parallel and non-parallel data. We also provide discussions on a variety of
important topics regarding the future development of this task.1

1. Introduction

Language is situational. Every utterance fits in a specific time, place, and scenario,
conveys specific characteristics of the speaker, and typically has a well-defined intent.
For example, someone who is uncertain is more likely to use tag questions (e.g., “This
is true, isn’t it?”) than declarative sentences (e.g., “This is definitely true.”). Similarly, a
professional setting is more likely to include formal statements (e.g., “Please consider
taking a seat.”) as compared to an informal situation (e.g., “Come and sit!”). For artificial
intelligence systems to accurately understand and generate language, it is necessary
to model language with style/attribute,2 which goes beyond merely verbalizing the
semantics in a non-stylized way. The values of the attributes can be drawn from a wide
range of choices depending on pragmatics, such as the extent of formality, politeness,
simplicity, personality, emotion, partner effect (e.g., reader awareness), genre of writing
(e.g., fiction or non-fiction), and so on.

The goal of TST is to automatically control the style attributes of text while preserv-
ing the content. TST has a wide range of applications, as outlined by McDonald and
Pustejovsky (1985) and Hovy (1987). The style of language is crucial because it makes
natural language processing more user-centered. TST has many immediate applications.
For instance, one such application is intelligent bots, for which users prefer distinct and
consistent persona (e.g., empathetic) instead of emotionless or inconsistent persona.
Another application is the development of intelligent writing assistants; for example,
non-expert writers often need to polish their writing to better fit their purpose, for
example, more professional, polite, objective, humorous, or other advanced writing
requirements, which may take years of experience to master. Other applications include
automatic text simplification (where the target style is “simple”), debiasing online text
(where the target style is “objective”), fighting against offensive language (where the
target style is “non-offensive”), and so on.

To formally define TST, let us denote the target utterance as x(cid:48) and the target
discourse style attribute as a(cid:48). TST aims to model p(x(cid:48)|a, x), where x is a given text
carrying a source attribute value a. Consider the previous example of text expressed
by two different extents of formality:

Source sentence x:
Target sentence x(cid:48):

Source attribute a:
“Come and sit!”
“Please consider taking a seat.” Target attribute a(cid:48):

Informal
Formal

In this case, a TST model should be able to modify the formality and generate the
formal sentence x(cid:48) = “Please consider taking a seat.” given the informal input x = “Come
and sit!”. Note that the key difference of TST from another NLP task, style-conditioned
language modeling, is that the latter is conditioned on only a style token, whereas TST

1 Our curated paper list is at https://github.com/zhijing-jin/Text_Style_Transfer_Survey.
2 Note that we interchangeably use the terms style and attribute in this survey. Attribute is a broader

terminology that can include content preferences, e.g., sentiment, topic, and so on. This survey uses style
in the same broad way, following the common practice in recent papers (see Section 2.1).

156

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
8
1
1
5
5
2
0
0
6
6
0
8
/
c
o

l
i

_
a
_
0
0
4
2
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Jin et al.

Deep Learning for Text Style Transfer: A Survey

takes as input both the target style attribute a(cid:48) and a source sentence x that constrains
the content.

Crucial to the definition of style transfer is the distinction of “style” and “content,”
for which there are two common practices. The first one is by linguistic definition, where
non-functional linguistic features are classified into the style (e.g., formality), and the
semantics are classified into the content. In contrast, the second practice is data-driven—
given two corpora (e.g., a positive review set and a negative review set), the invariance
between the two corpora is the content, whereas the variance is the style (e.g., sentiment,
topic) (Mou and Vechtomova 2020).

Driven by the growing needs for TST, active research in this field has emerged,
from the traditional linguistic approaches, to the more recent neural network–based
approaches. Traditional approaches rely on term replacement and templates. For ex-
ample, early work in NLG for weather forecasts builds domain-specific templates to
express different types of weather with different levels of uncertainty for different users
(Sripada et al. 2004; Reiter et al. 2005; Belz 2008; Gkatzia, Lemon, and Rieser 2017).
Research that more explicitly focuses on TST starts from the frame language-based
systems (McDonald and Pustejovsky 1985), and schema-based NLG systems (Hovy
1987, 1990) which generate text with pragmatic constraints such as formality under
small-scale well-defined schema. Most of this earlier work required domain-specific
templates, hand-featured phrase sets that express a certain attribute (e.g., friendly), and
sometimes a look-up table of expressions with the same meaning but multiple different
attributes (Bateman and Paris 1989; Stamatatos et al. 1997; Power, Scott, and Bouayad-
Agha 2003; Reiter, Robertson, and Osman 2003; Sheikha and Inkpen 2011; Mairesse and
Walker 2011).

With the success of deep learning in the last decade, a variety of neural methods
have been recently proposed for TST. If parallel data are provided, standard sequence-
to-sequence models are often directly applied (Rao and Tetreault 2018) (see Section 4).
However, most use cases do not have parallel data, so TST on non-parallel corpora has
become a prolific research area (see Section 5). The first line of approaches disentangle
text into its content and attribute in the latent space, and apply generative modeling
(Hu et al. 2017; Shen et al. 2017). This trend was then joined by another distinctive line
of approach, prototype editing (Li et al. 2018), which extracts a sentence template and
its attribute markers to generate the text. Another paradigm soon followed, namely,
pseudo-parallel corpus construction to train the model as if in a supervised way with
the pseudo-parallel data (Zhang et al. 2018d; Jin et al. 2019). These three directions, (1)
disentanglement, (2) prototype editing, and (3) pseudo-parallel corpus construction,
are further advanced with the emergence of Transformer-based models (Sudhakar,
Upadhyay, and Maheswaran 2019; Malmi, Severyn, and Rothe 2020).

Given the advances in TST methodologies, it now starts to expand its impact to
downstream applications, such as persona-based dialog generation (Niu and Bansal
2018; Huang et al. 2018), stylistic summarization (Jin et al. 2020a), stylized language
modeling to imitate specific authors (Syed et al. 2020), online text debiasing (Pryzant
et al. 2020; Ma et al. 2020), simile generation (Chakrabarty, Muresan, and Peng 2020),
and many others.

Motivation of a Survey on TST. The increasing interest in modeling the style of text can
be regarded as a trend reflecting the fact that NLP researchers start to focus more on
user-centeredness and personalization. However, despite the growing interest in TST,
the existing literature shows a large diversity in the selection of benchmark datasets,

157

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
8
1
1
5
5
2
0
0
6
6
0
8
/
c
o

l
i

_
a
_
0
0
4
2
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 48, Number 1

Table 1
Overview of the survey.

Motivation

Data

Method

Extended Applications

• Artistic writing
• Communication
• Mitigating social

issues

• Toxicity

Tasks
• Formality
• Politeness • Authorship
• Gender
• Humor
• Romance
• Biasedness • Political slant

• Simplicity
• Sentiment
• Topic

Key Properties
• Parallel vs. non-parallel
• Uni- vs. bi-directional
• Dataset size
• Large vs. small word

overlap

On Parallel Data
• Multi-tasking
• Inference techniques
• Data augmentation

On Non-Parallel Data
• Disentanglement
• Prototype editing
• Pseudo data construction

Helping Other NLP Tasks
• Paraphrasing
• Data augmentation
• Adversarial robustness
• Persona-consistent dialog
• Anonymization
• Summarization
• Style-specific MT

methodological frameworks, and evaluation metrics. Thus, the aim of this survey is
to provide summaries and potential standardizations on some important aspects of
TST, such as the terminology, problem definition, benchmark datasets, and evaluation
metrics. We also aim to provide different perspectives on the methodology of TST,
and suggest some potential cross-cutting research questions for our proposed research
agenda of the field. As shown in Table 1, the key contributions targeted by this survey
are as follows:

1. We conduct the first comprehensive review that covers most existing

works (more than 100 papers) on deep learning-based TST.

2. We provide an overview of the task setting, terminology definition,

benchmark datasets (Section 2), and evaluation metrics for which we
proposed standard practices that can be helpful for future works
(Section 3).

3. We categorize the existing approaches on parallel data (Section 4) and
non-parallel data (Section 5) for which we distill some unified
methodological frameworks.

4. We discuss a potential research agenda for TST (Section 6), including
expanding the scope of styles, improving the methodology, loosening
dataset assumptions, and improving evaluation metrics.

5. We provide a vision for how to broaden the impact of TST (Section 7),
including connecting to more NLP tasks, and more specialized
downstream applications, as well as considering some important ethical
impacts.

Paper Selection. The neural TST papers reviewed in this survey are mainly from top
conferences in NLP and artificial intelligence (AI), including ACL, EMNLP, NAACL,
COLING, CoNLL, NeurIPS, ICML, ICLR, AAAI, and IJCAI. Other than conference
papers, we also include some non-peer-reviewed preprint papers that can offer some

158

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
8
1
1
5
5
2
0
0
6
6
0
8
/
c
o

l
i

_
a
_
0
0
4
2
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Jin et al.

Deep Learning for Text Style Transfer: A Survey

Figure 1
Venn diagram of the linguistic definition of style and data-driven definition of style.

insightful information about the field. The major factors for selecting non-peer-reviewed
preprint papers include novelty and completeness, among others.

2. What Is Text Style Transfer?

This section provides an overview of the style transfer task. Section 2.1 goes through
the definition of styles and the scope of this survey. Section 2.2 gives a task formulation
and introduces the notations that will be used across the survey. Finally, Section 2.3 lists
all the common subtasks for neural TST which can save the literature review efforts for
future researchers.

2.1 How to Define Style?

Linguistic Definition of Style. An intuitive notion of style refers to the manner in which
the semantics is expressed (McDonald and Pustejovsky 1985). Just as everyone has
their own signatures, style originates as the characteristics inherent to every person’s
utterance, which can be expressed through the use of certain stylistic devices such as
metaphors, as well as choice of words, syntactic structures, and so on. Style can also
go beyond the sentence level to the discourse level, such as the stylistic structure of the
entire piece of the work, for example, stream of consciousness, or flashbacks.

Beyond the intrinsic personal styles, for pragmatic uses, style further becomes a
protocol to regularize the manner of communication. For example, for academic writ-
ing, the protocol requires formality and professionalism. Hovy (1987) defines style by its
pragmatic aspects, including both personal (e.g., personality, gender) and interpersonal
(e.g., humor, romance) aspects. Most existing literature also takes these well-defined
categories of styles.

Data-Driven Definition of Style as the Scope of this Survey. This survey aims to provide an
overview of existing neural TST approaches. To be concise, we will limit the scope to
the most common settings of existing literature. Specifically, most deep learning work
on TST adopts a data-driven definition of style, and the scope of this survey covers the
styles in currently available TST datasets. The data-driven definition of style is different
from the linguistic or rule-based definition of style, which theoretically constrains what
constitutes a style and what not, such as a style guide (e.g., American Psychological
Association 2020) that requires that formal text not include any contraction, e.g., “isn’t.”
The distinction of the two defintions of style is shown in Figure 1.

159

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
8
1
1
5
5
2
0
0
6
6
0
8
/
c
o

l
i

_
a
_
0
0
4
2
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Data-Driven StyleLinguistic StyleLinguistic styles without existinglarge datasets to match with thestyle, e.g., cheerful style.Attributes from datasets that do notmatch with existing linguistic styles butcan be used for deep learning-basedTST models, e.g., Yelp dataset.Attributes from datasets thatcorrespond to linguistic styles(often by human annotation), e.g.,Formality dataset.

Computational Linguistics

Volume 48, Number 1

With the rise of deep learning methods of TST, the data-driven definition of style
extends the linguistic style to a broader concept—the general attributes in text. It regards
“style” as the attributes that vary across datasets, as opposed to the characteristics that
stay invariant (Mou and Vechtomova 2020). The reason is that deep learning models
(which are the focus of this survey) need large corpora to learn the style from, but not
all styles have well-matched large corpora. Therefore, apart from the very few manually
annotated datasets with linguistic style definitions, such as formality (Rao and Tetreault
2018) and humor & romance (Gan et al. 2017), many recent dataset collection works
automatically look for meta-information to link a corpus to a certain attribute. A typical
example is the widely used Yelp review dataset (Shen et al. 2017), where reviews with
low ratings are put into the negative corpus, and reviews with high ratings are put into
the positive corpus, although the negative vs. positive opinion is not a style that belongs
to the linguistic definition, but more of a content-related attribute.

Most methods mentioned in this survey can be applied to scenarios that follow
this data-driven definition of style. As a double-edged sword, the prerequisite for most
methods is that there exist style-specific corpora for each style of interest, either parallel
or non-parallel. Note that there can be future works that do not take such an assumption,
which will be discussed in Section 6.3.

Comparison of the Two Definitions. There are two phenomena rising from the data-driven
definition of style as opposed to the linguistic style. One is that the data-driven def-
inition of style can include a broader range of attributes including content and topic
preferences of the text. The other is that data-driven styles, if collected through auto-
matic classification by meta-information such as ratings, user information, and source
of text, can be more ambiguous than the linguistically defined styles. As shown in Jin
et al. (2019, Section 4.1.1), some automatically collected datasets have a concerningly
high undecideable rate and inter-annotator disagreement rate when the annotators are
asked to associate the dataset with human-defined styles such as political slant and
gender-specific tones.

The advantage of the data-driven style is that it can marry well with deep learning
methods because most neural models learn the concept of style by learning to distin-
guish the multiple style corpora. For the (non-data-driven) linguistic style, although it
is under-explored in the existing deep learning works of TST, we provide in Section 6.3
a discussion of how potential future works can learn TST of linguistics styles with no
matched data.

2.2 Task Formulation

We define the main notations used in this survey in Table 2.

As mentioned previously in Section 2.1, most neural approaches assume a given
set of attribute values A, and each attribute value has its own corpus. For example, if
the task is about formality transfer, then for the attribute of text formality, there are
two attribute values, a = “formal” and a(cid:48) = “informal,” corresponding to a corpus X1
of formal sentences and another corpus X2 of informal sentences. The style corpora can
be parallel or non-parallel. Parallel data means that each sentence with the attribute a
is paired with a counterpart sentence with another attribute a(cid:48). In contrast, non-parallel
data only assumes mono-style corpora.

160

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
8
1
1
5
5
2
0
0
6
6
0
8
/
c
o

l
i

_
a
_
0
0
4
2
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Jin et al.

Deep Learning for Text Style Transfer: A Survey

Table 2
Notation of each variable and its corresponding meaning.

Category

Notation

Meaning

Attribute

Sentence

Model

Embedding

a
a(cid:48)
A
ai

x
x(cid:48)
Xi
xi
(cid:98)x(cid:48)

E
G
fc
θE
θG
θfc

z
a

An attribute value, e.g., the formal style
An attribute value different from a
A predefined set of attribute values
The i-th attribute value in A

A sentence with attribute value a
A sentence with attribute value a(cid:48)
A corpus of sentences with attribute value ai
A sentence from the corpus Xi
Attribute-transferred sentence of x learned by the model

Encoder of a TST model
Generator of a TST model
Attribute classifier
Parameters of the encoder
Parameters of the generator
Parameters of the attribute classifier

Latent representation of text, i.e., z
Latent representation of the attribute value in text

∆= E(x)

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u
/
c
o

l
i
/

l

a
r
t
i
c
e

p
d

f
/

/

/

/

4
8
1
1
5
5
2
0
0
6
6
0
8
/
c
o

l
i

_
a
_
0
0
4
2
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

2.3 Existing Subtasks with Datasets

We list the common subtasks and corresponding datasets for neural TST in Table 3. The
attributes of interest vary from style features (e.g., formality and politeness) to content
preferences (e.g., sentiment and topics). Each task of which will be elaborated below.

Formality. Adjusting the extent of formality in text was first proposed by Hovy (1987).
It is one of the most distinctive stylistic aspects that can be observed through many
linguistic phenomena, such as more full names (e.g., “television”) instead of abbrevia-
tions (e.g., “TV”), and more nouns (e.g., “solicitation”) instead of verbs (e.g., “request”).
The formality dataset, Grammarly’s Yahoo Answers Formality Corpus (GYAFC) (Rao
and Tetreault 2018), contains 50K formal-informal pairs retrieved by first getting 50K
informal sentences from the Yahoo Answers corpus, and then recruiting crowdsource
workers to rewrite them in a formal way. Briakou et al. (2021b) extend the formality
dataset to a multilingual version with three more languages, Brazilian Portuguese,
French, and Italian.

Politeness. Politeness transfer (Madaan et al. 2020) aims to control the politeness in
text. For example, “Could you please send me the data?” is a more polite expression
than “send me the data!”. Madaan et al. (2020) compiled a dataset of 1.39 million
automatically labeled instances from the raw Enron corpus (Shetty and Adibi 2004).
As politeness is culture-dependent, this dataset mainly focuses on politeness in North
American English.

161

Computational Linguistics

Volume 48, Number 1

Table 3
List of common subtasks of TST and their corresponding attribute values and datasets. For
datasets with multiple attribute-specific corpora, we report their sizes by the number of
sentences of the smallest of all corpora. We also report whether the dataset is parallel (Pa?).

Task

Attribute Values

Datasets

Style Features

Formality

Informal↔Formal

GYAFC3 (Rao and Tetreault 2018)
XFORMAL4 (Briakou et al. 2021b)

Politeness

Impolite→Polite

Politeness5 (Madaan et al. 2020)

Gender

Humor &
Romance

Masculine↔Feminine

Yelp Gender6 (Prabhumoye et al. 2018)

Factual↔Humorous↔
Romantic

FlickrStyle7 (Gan et al. 2017)

Biasedness

Biased→Neutral

Wiki Neutrality8 (Pryzant et al. 2020)

Toxicity

Offensive→Non-
offensive

Twitter (dos Santos, Melnyk, and Padhi 2018)
Reddit (dos Santos, Melnyk, and Padhi 2018)
Reddit Politics (Tran, Zhang, and
Soleymani 2020)

Authorship

Shakespearean↔Modern Shakespeare (Xu et al. 2012)
Different Bible transla-
tors

Bible9 (Carlson, Riddell, and Rockmore 2018)

Simplicity

Complicated→Simple

Engagingness

Plain→Attractive

Content Preferences

Sentiment

Positive↔Negative

PWKP (Zhu, Bernhard, and Gurevych 2010)
Expert (den Bercken, Sips, and Lofi 2019)
MIMIC-III10 (Weng, Chung, and
Szolovits 2019)
MSD11 (Cao et al. 2020)

Math12 (Koncel-Kedziorski et al. 2016)
TitleStylist13 (Jin et al. 2020a)

Yelp14 (Shen et al. 2017)
Amazon15 (He and McAuley 2016)

Topic

Politics

Entertainment↔Politics

Yahoo! Answers16 (Huang et al. 2020)

Democratic↔Republican Political17 (Voigt et al. 2018)

Size

Pa?

50K
1K

(cid:51)
(cid:51)

(cid:55)

1M
2.5M (cid:55)
(cid:51)

5K

181K (cid:51)

58K
224K (cid:55)
350K

18K
28M

(cid:51)

108K (cid:51)
(cid:51)
2.2K
(cid:55)
59K

114K (cid:51)
(cid:51)
<1K 146K (cid:55) 250K (cid:55) 277K 153K (cid:55) 540K (cid:55) 3GYAFC data: https:>Download pdf