Syntax-Guided Controlled Generation of Paraphrases

Syntax-Guided Controlled Generation of Paraphrases

Ashutosh Kumar1 Kabir Ahuja2∗ Raghuram Vadapalli3∗ Partha Talukdar1

1Indian Institute of Science, Bangalore
2Microsoft Research, Bangalore
3Google, London
ashutosh@iisc.ac.in, kabirahuja2431@gmail.com
raghuram.4350@gmail.com, ppt@iisc.ac.in

Abstract

Given a sentence (e.g., ‘‘I like mangoes’’) and
a constraint (e.g., sentiment flip), the goal of
controlled text generation is to produce a
sentence that adapts the input sentence to meet
the requirements of the constraint (e.g., ‘‘I
hate mangoes’’). Going beyond such simple
constraints, recent work has started explor-
ing the incorporation of complex syntactic-
guidance as constraints in the task of controlled
paraphrase generation.
In these methods,
syntactic-guidance is sourced from a separate
exemplar sentence. However,
these prior
works have only utilized limited syntactic
information available in the parse tree of the
exemplar sentence. We address this limita-
tion in the paper and propose Syntax Guided
Controlled Paraphraser (SGCP), an end-to-end
framework for syntactic paraphrase genera-
tion. We find that SGCP can generate syntax-
conforming sentences while not compromising
on relevance. We perform extensive automated
and human evaluations over multiple real-
world English language datasets to demon-
strate the efficacy of SGCP over state-of-the-art
baselines. To drive future research, we have
made SGCP’s source code available.1

1 Introduction

Controlled text generation is the task of producing
a sequence of coherent words based on given
constraints. These constraints can range from
simple attributes like tense, sentiment polarity,
and word-reordering (Hu et al., 2017; Shen et al.,
2017; Yang et al., 2018) to more complex syntactic
information. For example, given a sentence ‘‘The
movie is awful!’’ and a simple constraint like flip

∗ This research was conducted during the authors

internship at Indian Institute of Science.

1https://github.com/malllabiisc/SGCP.

330

sentiment to positive, a controlled text generator
is expected to produce the sentence ‘‘The movie is
fantastic!’’.

These constraints are important in not only
providing information about what
to say but
also how to say it. Without any constraint, the
ubiquitous sequence-to-sequence neural models
often tend to produce degenerate outputs and
favor generic utterances (Vinyals and Le, 2015; Li
et al., 2016). Although simple attributes are helpful
to say,
in addressing what
they provide very
little information about how to say it. Syntactic
control over generation helps in filling this gap by
providing that missing information.

Incorporating complex syntactic information
has shown promising results in neural machine
translation (Stahlberg et al., 2016; Aharoni and
Goldberg, 2017; Yang et al., 2019), data-to-text
generation (Peng et al., 2019), abstractive text-
summarization (Cao et al., 2018), and adversarial
text generation (Iyyer et al., 2018). Additionally,
recent work (Iyyer et al., 2018; Kumar et al., 2019)
has shown that augmenting lexical and syntactical
variations in the training set can help in building
better performing and more robust models.

In this paper, we focus on the task of syn-
tactically controlled paraphrase generation, that
is, given an input sentence and a syntactic
exemplar, produce a sentence that conforms to
the syntax of the exemplar while retaining the
meaning of the original input sentence. While
syntactically controlled generation of paraphrases
finds applications in multiple domains like data-
augmentation and text passivization, we highlight
its importance in the particular task of text
in Siddharthan
simplification. As pointed out
(2014), depending on the literacy skill of an
individual, certain syntactical forms of English

Transactions of the Association for Computational Linguistics, vol. 8, pp. 330–345, 2020. https://doi.org/10.1162/tacl a 00318
Action Editor: Asli Celikyilmaz. Submission batch: 12/2019; Revision batch: 2/2020; Published 6/2020.
c(cid:13) 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

l

a
c
_
a
_
0
0
3
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

sentences are easier to comprehend than others. As
an example, consider the following two sentences:

S1 Because it is raining today, you should carry

an umbrella.

SOURCE
EXEMPLAR

– how do i predict the stock market ?
– can a brain transplant be done ?

SCPN
CGEN
SGCP
(Ours)

– how can the stock and start ?
– can the stock market actually happen ?

– can i predict the stock market ?

S2 You should carry an umbrella today, because

it is raining.

that permit pre-posed adverbial
Connectives
clauses have been found to be difficult for third
to fifth grade readers, even when the order of
mention coincides with the causal (and temporal)
order (Anderson and Davison, 1986; Levy, 2003).
Hence, they prefer sentence S2. However, various
other studies (Clark and Clark, 1968; Katz and
Brent, 1968; Irwin, 1980) have suggested that
for older school children, college students, and
adults, comprehension is better for the cause-effect
presentation, hence sentence S1. Thus, modifying
a sentence, syntactically, would help in better
comprehension based on literacy skills.

Prior work in syntactically controlled para-
phrase generation addressed this task by con-
ditioning the semantic input on either the features
learned from a linearized constituency-based parse
tree (Iyyer et al., 2018), or the latent syntactic
information (Chen et al., 2019a) learned from
exemplars
through variational auto-encoders.
Linearizing parse trees typically results in loss of
essential dependency information. On the other
hand, as noted in Shi et al. (2016), an autoencoder-
based approach might not offer rich enough
syntactic information as guaranteed by actual
constituency parse trees. Moreover, as noted
in Chen et al.
(2019a), SCPN (Iyyer et al.,
2018), and CGEN (Chen et al., 2019a) tend to
generate sentences of the same length as the
exemplar. This is an undesirable characteristic
because it often results in producing sentences that
end abruptly, thereby compromising on gramma-
ticality and semantics. Please see Table 1 for
sample generations using each of the models.

To address these gaps, we propose Syntax
Guided Controlled Paraphraser
(SGCP) which
uses full exemplar syntactic tree information.
Additionally, our model provides
an easy
mechanism to incorporate different
levels of
syntactic control (granularity) based on the height
of the tree being considered. The decoder in
our framework is augmented with rich enough
information to be able to produce
syntactical

SOURCE

EXEMPLAR

– what are some of the mobile apps you ca n’t live
without and why ?
– which is the best resume you have come across ?

SCPN
CGEN
SGCP
(Ours)

– what are the best ways to lose weight ?
– which is the best mobile app you ca n’t ?
– which is the best app you ca n’t live without and
why ?

Table 1: Sample syntactic paraphrases generated
by SCPN (Iyyer et al., 2018), CGEN (Chen et al.,
2019a), SGCP (Ours). We observe that SGCP is able
to generate syntax conforming paraphrases without
compromising much on relevance.

syntax conforming sentences while not losing out
on semantics and grammaticality.

The main contributions of this work are as

follows:

1. We propose SGCP, an end-to-end model to
generate syntactically controlled paraphrases
at different
levels of granularity using a
parsed exemplar.

2. We provide a new decoding mechanism to
incorporate syntactic information from the
exemplar sentence’s syntactic parse.

3. We provide a dataset formed from Quora
Question Pairs2 for evaluating the models.
We also perform extensive experiments to
demonstrate the efficacy of our model using
multiple automated metrics as well as human
evaluations.

2 Related Work

Controllable Text Generation.
is an important
problem in NLP that has received significant atten-
tion in recent times. Prior work include generating
text using models conditioned on attributes like
formality, sentiment, or tense (Hu et al., 2017;
Shen et al., 2017; Yang et al., 2018) as well as
on syntactical templates (Iyyer et al., 2018; Chen
et al., 2019a). These systems find applications in
adversarial sample generation (Iyyer et al., 2018),

2https://www.kaggle.com/c/quora-

question-pairs.

331

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

l

a
c
_
a
_
0
0
3
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

text summarization, and table-to-text generation
(Peng et al., 2019). While achieving state-of-the-
art in their respective domains, these systems
typically rely on a known finite set of attributes
thereby making them quite restrictive in terms of
the styles they can offer.

Paraphrase Generation. While generation of
paraphrases has been addressed in the past using
traditional methods (McKeown, 1983; Barzilay
and Lee, 2003; Quirk et al., 2004; Hassan et al.,
2007; Zhao et al., 2008; Madnani and Dorr, 2010;
Wubben et al., 2010), they have recently been
superseded by deep learning-based approaches
(Prakash et al., 2016; Gupta et al., 2018; Li
et al., 2019, 2018; Kumar et al., 2019). The
primary task of all these methods (Prakash et al.,
2016; Gupta et al., 2018; Li et al., 2018) is to
generate the most semantically similar sentence
and they typically rely on beam search to obtain
any kind of lexical diversity. Kumar et al. (2019)
try to tackle the problem of achieving lexical,
and limited syntactical diversity using submodular
optimization but do not provide any syntactic
control over the type of utterance that might be
desired. These methods are therefore restrictive
in terms of the syntactical diversity that they can
offer.

Controlled Paraphrase Generation. Our task
is similar in spirit to Iyyer et al. (2018) and
Chen et al. (2019a), which also deals with the
task of syntactic paraphrase generation. However,
the approach taken by them is different from
ours in at least two aspects. Firstly, SCPN (Iyyer
et al., 2018) uses an attention-based (Bahdanau
et al., 2014) pointer-generator network (See et al.,
2017) to encode input sentences and a linearized
constituency tree to produce paraphrases. Because
of the linearization of syntactic tree, considerable
dependency-based information is generally lost.
Our model,
instead, directly encodes the tree
structure to produce a paraphrase. Secondly,
the inference (or generation) process in SCPN
is computationally very expensive, because it
involves a two-stage generation process. In the
first stage, they generate full parse trees from
incomplete templates, and then from full parse
trees to final generations. In contrast, the inference
in our method involves a single-stage process,
wherein our model takes as input a semantic
source, a syntactic tree and the level of syntactic

style that needs to be transferred, to obtain the
generations. Additionally, we also observed that
the model does not perform well in low resource
settings. This, again, can be attributed to the
compounding implicit noise in the training due
to linearized trees and generation of full linearized
trees before obtaining the final paraphrases.

Chen et al.

(2019a) propose a syntactic
exemplar-based method for controlled paraphrase
generation using an approach based on latent
variable probabilistic modeling, neural variational
inference, and multi-task learning. This,
in
principle, is very similar to Chen et al. (2019b). As
opposed to our model, which provides different
levels of syntactic control of
the exemplar-
based generation, this approach is restrictive in
terms of the flexibility it can offer. Also, as
noted in Shi et al. (2016), an autoencoder-based
approach might not offer rich enough syntactic
information as offered by actual constituency
parse trees. Additionally, VAEs (Kingma and
Welling, 2014) are generally unstable and harder
to train (Bowman et al., 2016; Gupta et al., 2018)
than seq2seq-based approaches.

3 SGCP: Proposed Method

In this section, we describe the inputs and various
architectural components essential for building
SGCP, an end-to-end trainable model. Our model,
as shown in Figure 1, comprises a sentence
encoder (3.2), syntactic tree encoder (3.3), and
a syntactic-paraphrase-decoder (3.4).

3.1 Inputs

Given an input sentence X and a syntactic
exemplar Y , our goal is to generate a sentence
Z that conforms to the syntax of Y while retaining
the meaning of X.

The semantic encoder (Section 3.2) works
on sequence of input tokens, and the syntactic
encoder (Section 3.3) operates on constituency-
based parse trees. We parse the syntactic exemplar
Y 3 to obtain its constituency-based parse tree. The
leaf nodes of the constituency-based parse tree
consists of token for the sentence Y. These tokens,
in some sense, carry the semantic information of
sentence Y, which we do not need for generating
paraphrases. In order to prevent any meaning

3Obtained using the Stanford CoreNLP toolkit (Manning

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

l

a
c
_
a
_
0
0
3
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

et al., 2014).

332

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

l

a
c
_
a
_
0
0
3
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 1: Architecture of SGCP (proposed method). SGCP aims to paraphrase an input sentence, while conforming
to the syntax of an exemplar sentence (provided along with the input). The input sentence is encoded using
the Sentence Encoder (Section 3.2) to obtain a semantic signal ct. The Syntactic Encoder (Section 3.3) takes a
constituency parse tree (pruned at height H) of the exemplar sentence as an input, and produces representations for
all the nodes in the pruned tree. Once both of these are encoded, the Syntactic Paraphrase Decoder (Section 3.4)
uses pointer-generator network, and at each time step takes the semantic signal ct, the decoder recurrent state st,
embedding of the previous token and syntactic signal hY
to generate a new token. Note that the syntactic signal
t
remains the same for each token in a span (shown in figure above curly braces; please see Figure 2 for more
details). The gray shaded region (not part of the model) illustrates a qualitative comparison of the exemplar syntax
tree and the syntax tree obtained from the generated paraphrase. Please refer to Section 3 for details.

propagation from exemplar sentence Y into the
generation, we remove these leaf/terminal nodes
from its constituency parse. The tree thus obtained
is denoted as CY.

The syntactic encoder, additionally, takes as
input H, which governs the level of syntactic
control needed to be induced. The utility of H will
be described in Section 3.3.

3.2 Semantic Encoder

The semantic encoder, a multilayered Gated
Recurrent Unit (GRU), receives tokenized sen-
tence X = {x1, . . . , xTX } as input and computes
the contextualized hidden state representation
hX
t

for each token using:

t = GRU(hX
hX

t−1, e(xt)),

(1)

where e(xt) represents the learnable embedding
of the token xt and t ∈ {1, . . . , TX}. Note that
we use byte-pair encoding (Sennrich et al., 2016)
for word/token segmentation.

3.3 Syntactic Encoder

This encoder provides the necessary syntactic
guidance for
the generation of paraphrases.
Formally, let constituency tree CY = {V, E, Y},
where V is the set of nodes, E the set of edges,
and Y the labels associated with each node.

We calculate the hidden-state representation
hY
v of each node v ∈ V using the hidden-state
representation of its parent node pa(v) and the
embedding associated with its label yv as follows:

v = GeLU(WpahY
hY

pa(v) + Wve(yv) + bv),

(2)

where e(yv) is the embedding of the node label
yv, and Wpa, Wv, bv are learnable parameters. This
approach can be considered similar to TreeLSTM
(Tai et al., 2015). We use GeLU activation func-
tion (Hendrycks and Gimpel, 2016) rather than
the standard tanh or relu, because of superior
empirical performance.

As indicated in Section 3.1, syntactic encoder
takes as input the height H, which governs the
level of syntactic control. We randomly prune the

333

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

l

a
c
_
a
_
0
0
3
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 2: The constituency parse tree serves as an input to the syntactic encoder (Section 3.3). The first step is
to remove the leaf nodes which contain meaning representative tokens (Here: What is the best language . . . ).
H denotes the height to which the tree can be pruned and is an input to the model. Figure 2(a) shows the full
constituency parse tree annotated with vector a for different heights. Figure 2(b) shows the same tree pruned at
height H = 3 with its corresponding a vector. The vector a serves as an signalling vector (Section 3.4.2) which
helps in deciding the syntactic signal to be passed on to the decoder. Please refer Section 3 for details.

tree CY to height H ∈ {3, . . . , Hmax}, where Hmax
is the height of the full constituency tree CY . As an
example, in Figure 2b, we prune the constituency-
based parse tree of the exemplar sentence, to
height H = 3. The leaf nodes for this tree have the
labels WP, VBZ, NP, and . Although we
calculate the hidden-state representation of all the
nodes, only the terminal nodes are responsible
for providing the syntactic signal to the decoder
(Section 3.4).

We maintain a queue LY

H of such terminal node
representations where elements are inserted from
left to right for a given H. Specifically, for the
particular example given in Figure 2b,

H = [hY
LY

WP, hY

VBZ, hY

NP, hY

]

We emphasize the fact that the length of the queue
|LY

H | is a function of height H.

3.4 Syntactic Paraphrase Decoder

Having obtained the semantic and syntactic
representations, the decoder is tasked with the
generation of syntactic paraphrases. This can
be modeled as finding the best Z = Z ∗ that
maximizes the probability P(Z|X, Y ), which can
further be factorized as:

Z ∗ = arg max

z

TZ

(zt|z1, . . . , zt−1, X, Y ),

(3)

t=1
Y

334

where TZ is the maximum length up to which
decoding is required.

In the subsequent sections, we use t to denote

the decoder time step.

3.4.1 Using Semantic Information
At each decoder
distribution αt
hidden states hX

the attention
is calculated over the encoder
i , obtained using Equation 1, as:

time step t,

i = v⊺tanh(WhhX
et

i + Wsst + battn)
αt = softmax(et),

(4)

where st is the decoder cell-state and v, Wh, Ws,
battn are learnable parameters.

The attention distribution provides a way to
jointly align and train sequence to sequence
models by producing a weighted sum of the se-
mantic encoder hidden states, known as context-
vector ct, given by:

ct =

ihX
αt
i

(5)

i
X
ct serves as the semantic signal which is essential
for generating meaning preserving sentences.

3.4.2 Using Syntactic Information
During training, each terminal node in the tree CY,
pruned at H, is equipped with information about
the span of words it needs to generate. At each

time step t, only one terminal node representation
v ∈ LY
hY
H is responsible for providing the
syntactic signal which we call hY
t . This hidden-
state representation to be used is governed
through an signalling vector a = (a1, . . . , aTz ),
where each ai ∈ {0, 1}. 0 indicates that
the
decoder should keep on using the same hidden-
representation hY
H that is currently being
used, and 1 indicates that the next element (hidden-
representation) in the queue LY
H should be used
for decoding.

v ∈ LY

where pop removes and returns the next element
in the queue, st is the decoder state, and e(z′
t) is
the embedding of the input token at time t during
decoding.

3.4.3 Overall
The semantic signal ct, together with decoder
state st, embedding of the input token e(z′
t) and
the syntactic signal hY
is fed through a GRU
t
followed by softmax of the output to produce a
vocabulary distribution as:

The utility of a can be best understood through
Figure 2b. Consider the syntactic tree pruned at
height H = 3. For this example,
H = [hY
LY

VBZ, hY

NP, hY

WP, hY

]

and

a = (1, 1, 1, 0, 0, 0, 0, 0, 1)
ai = 1 provides a signal to pop an element
from the queue LY
H while ai = 0 provides a
signal to keep on using the last popped element.
This element is then used to guide the decoder
syntactically by providing a signal in the form of
hidden-state representation (Equation 8).

H to pop hY

Specifically, in this example, the a1 = 1 signals
LY
H to pop hY
WP to provide syntactic guidance
to the decoder for generating the first
token.
a2 = 1 signals LY
VBZ to provide
syntactic guidance to the decoder for generating
the second token. a3 = 1 helps in obtaining hY
NP
from LY
H to provide guidance to generate the
third token. As described earlier, a4, . . . , a8 = 0
indicates, that the same representation hY
NP should
be used for syntactically guiding tokens z4, . . . , z8.
Finally a9 = 1 helps in retrieving hY
for
guiding decoder to generate token z9. Note that
|LY

H | =
Although a is provided to the model during
training, this information might not be available
during inference. Providing a during generation
makes the model restrictive and might result
in producing ungrammatical sentences. SGCP is
tasked to learn a proxy for the signalling vector a,
using transition probability vector p.

Tz
i=1 ai

P

At each time step t, we calculate pt ∈ (0, 1),
which determines the probability of changing the
syntactic signal using:
pt = σ(Wbop([ct; hY

t)]) + bbop),

t ; st; e(z′

(6)

Pvocab = softmax(W ([ct; hY

t ; st; e(z′

t)]) + b),

(8)
where [; ] represents concatenation of constituent
elements, and W, b are trainable parameters.

We augment this with the copying mechanism
(Vinyals et al., 2015) as in the pointer-generator
network (See et al., 2017). Usage of such a
mechanism offers a probability distribution over
the extended vocabulary (the union of vocabulary
words and words present in the source sentence)
as follows:

P(z) = pgenPvocab(z) + (1 − pgen)

αt
i

pgen = σ(w⊺

c ct + w⊺

s st + w⊺

x e(z′

i:zi=z
X
t) + bgen)

(9)

where wc, ws, wx and bgen are learnable parame-
ters, e(z′
t) is the input token embedding to the
decoder at time step t, and αt
i is the element cor-
responding to the ith co-ordinate in the attention
distribution as defined in Equation 4.

The overall objective can be obtained by
taking negative log-likelihood of the distributions
obtained in Equation 6 and Equation 9.

L = −

1
T

T

[log P(z∗
t )

t=0
X
+ at log(pt)
+ (1 − at) log(1 − pt)]

(10)

where at is the tth element of the vector a.

4 Experiments

Our experiments are geared towards answering
the following questions:

hY
t+1 =

hY
t
pop(LY

pt < 0.5 H) otherwise ( Q1. Is SGCP able to generate syntax conforming sentences without losing out on meaning? (Section 5.1, 5.4) (7) 335 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 1 8 1 9 2 3 7 4 9 / / t l a c _ a _ 0 0 3 1 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Q2. What level of syntactic control does SGCP offer? (Section 5.2, 5.3, 5.2) Q3. How does SGCP compare against prior models, qualitatively? (Section 5.4) Q4. Are the improvements achieved by SGCP statistically significant? (Section 5.1) Based on these questions, we outline the methods compared (Section 4.1), along with the datasets (Section 4.2) used, evaluation criteria (Section 4.3) and the experimental setup (Section 4.4). 4.1 Methods Compared As in Chen et al. (2019a), we first highlight the results of the two direct return-input baselines. 1. Source-as-Output: Baseline where the output is the semantic input. 2. Exemplar-as-Output: Baseline where the output is the syntactic exemplar. We compare the following competitive methods: 3. SCPN (Iyyer et al., 2018) is a sequence-to- sequence based model comprising two encoders built with LSTM (Hochreiter and Schmidhuber, 1997) to encode semantics and syntax respectively. Once the encoding is obtained, to the LSTM-based decoder, which is augmented with soft-attention (Bahdanau et al., 2014) over encoded states as well as a copying mechanism (See et al., 2017) to deal with out-of-vocabulary tokens.4 it serves as an input 4. CGEN (Chen et al., 2019a) is a VAE (Kingma and Welling, 2014) model with two encoders to project semantic input and syntactic input to a latent space. They obtain a syntactic embedding from one encoder, using a standard Gaussian prior. To obtain the semantic representation, they use von Mises- Fisher prior, which can be thought of as a Gaussian distribution on a hypersphere. They train the model using a multi-task paradigm, incorporating paraphrase generation loss and word position loss. We considered their best model, VGVAE + LC + WN + WPL, which incorporates the above objectives. 4Note that the results for SCPN differ from the ones shown in Iyyer et al. (2018). This is because the dataset used in Iyyer et al. (2018) is at least 50 times larger than the largest dataset (ParaNMT-small) in this work. 336 5. SGCP (Section 3) is a sequence-and-tree-to- sequence based model that encodes semantics and tree-level syntax to produce paraphrases. It uses a GRU-based (Chung et al., 2014) decoder with soft-attention on semantic encodings and a begin of phrase (bop) gate to select a leaf node in the exemplar syntax tree. We compare the following two variants of SGCP: (a) SGCP-F: Uses full constituency parse tree information of the exemplar for generating paraphrases. (a) SGCP-R: SGCP can produce multiple paraphrases by pruning the exemplar tree at various heights. This variant first generates five candidate generations, corresponding to five different heights of the exemplar tree, namely, {Hmax, Hmax − 1, Hmax − 2, Hmax − 3, Hmax − 4}, for each (source, exemplar) pair. From these candidates, the one with the highest ROUGE-1 score with the source sentence is selected as the final generation. Note that, except for the return-input baselines, all methods use beam search during inference. 4.2 Datasets We train the models and evaluate them on the following datasets: (1) ParaNMT-small (Chen et al., 2019a) contains 500K sentence-paraphrase pairs for training, and 1,300 manually labeled sentence- exemplar-reference, which is further split into 800 test data points and 500 dev. data points, respectively. is a subset of As in Chen et al. (2019a), our model uses only (sentence, paraphrase) during training. The paraphrase itself serves as the exemplar input during training. This dataset the original ParaNMT-50M dataset (Wieting and Gimpel, 2018). ParaNMT-50M is a data set generated automatically through backtranslation of original English sentences. It is inherently noisy because of imperfect neural machine translation quality, with many sentences being non-grammatical and some even being non-English sentences. Because of such noisy data points, it is optimistic to assume that the corresponding constituency parse tree would be well aligned. To that end, we propose to l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 1 8 1 9 2 3 7 4 9 / / t l a c _ a _ 0 0 3 1 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 use the following additional dataset, which is more well-formed and has more human intervention than the ParaNMT-50M dataset. (2) QQP-Pos: The original Quora Question Pairs (QQP) dataset contains about 400K sentence pairs labeled positive if they are duplicates of each other and negative otherwise. The dataset is composed of about 150K positive and 250K negative pairs. We select those positive pairs that contain both sentences with a maximum token length of 30, leaving us with ∼146K pairs. We call this dataset QQP-Pos. Similar to ParaNMT-small, we use only the sentence-paraphrase pairs as training set and sentence-exemplar-reference triples for testing and validation. We randomly choose 140K sentence-paraphrase pairs as the training set Ttrain, and the remaining 6K pairs Teval are used to form the evaluation set E. Additionally, let {{X, Z} : (X, Z) ∈ Teval}. Note that Teset = Teset is a set of sentences while Teval is a set of sentence-paraphrase pairs. S Let E = φ be the initial evaluation set. For selecting exemplar for each each sentence- paraphrase pair (X, Z) ∈ Teval, we adopt the following procedure: Step 1: For a given (X, Z) ∈ Teval, construct an exemplar candidate set C = Teset − {X, Z}. |C| ≈ 12, 000. Step 2: Retain only those sentences C ∈ C whose sentence length (= number of tokens) differ by at most two when compared to the paraphrase Z. This is done since sentences with similar constituency-based parse tree structures tend to have similar token lengths. Step 3: Remove those candidates C ∈ C, which are very similar to the source sentence X, that is, BLEU(X, C) > 0.6.

Step 4: From the remaining instances in C,
choose that sentence C as the exemplar Y
which has the least Tree-Edit distance with
the paraphrase Z of the selected pair, namely,
TED(Z, C). This ensures that
Y = argmin
C∈C

the constituency-based parse tree of the
exemplar Y is quite similar to that of Z,
in terms of Tree-Edit distance.

Step 5: E := E ∪ (X, Y, Z).

Step 6: Repeat procedure for all other pairs in

Teval.

From the obtained evaluation set E, we
randomly choose 3K triplets for the test set Ttest,
and remaining 3K for the validation set V.

4.3 Evaluation

It should be noted that there is no single fully
reliable metric for evaluating syntactic paraphrase
generation. Therefore, we evaluate on the follow-
ing metrics to showcase the efficacy of syntactic
paraphrasing models.

1. Automated Evaluation.

(i) Alignment based metrics: We compute
BLEU (Papineni et al., 2002), METEOR
(Banerjee and Lavie, 2005), ROUGE-1,
ROUGE-2, and ROUGE-L (Lin, 2004)
scores between the generated and the refer-
ence paraphrases in the test set.

(ii) Syntactic Transfer: We evaluate the
syntactic transfer using Tree-edit distance
(Zhang and Shasha, 1989) between the parse
trees of:

(a) the generated and the syntactic exemplar

in the test set – TED-E
and
generated

(b) the

the

reference

paraphrase in the test set – TED-R

(iii) Model-based evaluation: Because our
goal is to generate paraphrases of the input
sentences, we need some measure to deter-
mine if the generations indeed convey the
same meaning as the original text. To achieve
this, we adopt a model-based evaluation
metric as used by Shen et al. (2017) for
Text Style Transfer and Isola et al. (2017) for
Image Transfer. Specifically, classifiers are
trained on the task of Paraphrase Detection
and then used as Oracles to evaluate the
generations of our model and the baselines.
We fine-tune two RoBERTa (Liu et al., 2019)
based sentence pair classifiers, one on Quora
Question Pairs (Classifier-1) and other on
ParaNMT + PAWS5 datasets (Classifier-2),

5Because the ParaNMT dataset only contains paraphrase
pairs, we augment it with the PAWS (Zhang et al., 2019)
dataset to acquire negative samples.

337

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

l

a
c
_
a
_
0
0
3
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Model

BLEU↑ METEOR↑ ROUGE-1↑

ROUGE-2↑

ROUGE-L↑

TED-R↓

TED-E↓

PDS↑

QQP-Pos

Source-as-Output
Exemplar-as-Output

SCPN (Iyyer et al., 2018)
CGEN (Chen et al., 2019a)

SGCP-F
SGCP-R

Source-as-Output
Exemplar-as-Output

SCPN (Iyyer et al., 2018)
CGEN (Chen et al., 2019a)

SGCP-F
SGCP-R

17.2
16.8

15.6
34.9

36.7
38.0

18.5
3.3

6.4
13.6

15.3
16.4

31.1
17.6

19.6
37.4

39.8
41.3

28.8
12.1

14.6
24.8

25.9
27.2

51.9
38.2

40.6
62.6

66.9
68.1

50.6
24.4

30.3
44.8

46.6
49.6

26.2
20.5

20.5
42.7

45.0
45.7

ParaNMT-small

23.1
7.5

11.2
21.0

21.8
22.9

52.9
43.2

44.6
65.4

69.6
70.2

47.7
29.1

34.6
48.3

49.7
50.5

16.2
4.8

9.1
6.7

4.8
6.8

12.0
5.9

6.2
6.7

6.1
8.7

16.6
0.0

8.0
6.0

1.8
5.9

13.0
0.0

1.4
3.3

1.4
7.0

99.8
10.7

27.0
65.4

75.0
87.7

99.0
14.0

15.4
70.2

76.6
83.5

Table 2: Results on QQP and ParaNMT-small dataset. Higher↑ BLEU, METEOR, ROUGE, and PDS
is better whereas lower↓ TED score is better. SGCP-R selects the best candidate out of many, resulting
in performance boost for semantic preservation (shown in box). We bold the statistically significant
results of SGCP-F, only, for a fair comparison with the baselines. Note that Source-as-Output, and
Exemplar-as-Output are only dataset quality indicators and not the competitive baselines. Please see
Section 5 for details.

which achieve accuracies of 90.2% and
94.0% on their respective test sets.6

Once trained, we use Classifier-1 to evaluate
generations on QQP-Pos and Classifier-2 on
ParaNMT-small.

We first generate syntactic paraphrases using
all the models (Section 4.1) on the test splits
of QQP-Pos and ParaNMT-small datasets.
We then pair
the source sentence with
their corresponding generated paraphrases
and send them as input to the classifiers.
The Paraphrase Detection score, denoted as
PDS in Table 2, is defined as, the ratio
of the number of generations predicted as
paraphrases of their corresponding source
sentences by the classifier to the total number
of generations.

2. Human Evaluation.

Although TED is sufficient
to highlight
syntactic transfer,
there has been some
scepticism regarding automated metrics for
paraphrase quality (Reiter, 2018). To address
this issue, we perform human evaluation
on 100 randomly selected data points from
the test set. In the evaluation, three judges

6Because the test set of QQP is not public, the 90.2%
number was computed on the available dev set (not used for
model selection).

(non-researchers proficient in the English
language) were asked to assign scores to
generated sentences based on the semantic
similarity with the given source sentence.
The annotators were shown a source sentence
and the corresponding outputs of the systems
in random order. The scores ranged from
1 (doesn’t capture meaning at all) to 4
(perfectly captures the meaning of the source
sentence).

4.4 Setup

(a) Pre-processing. Because our model needs
access to constituency parse trees, we tokenize
and parse all our data points using the fully
parallelizable Stanford CoreNLP Parser (Manning
et al., 2014) to obtain their respective parse trees.
This is done prior to training in order to prevent
any additional computational costs that might be
incurred because of repeated parsing of the same
data points during different epochs.

(b) Implementation Details. We train both our
models using the Adam Optimizer (Kingma and
Ba, 2014) with an initial learning rate of 7e-
5. We use a bidirectional
three-layered GRU
for encoding the tokenized semantic input and a
standard pointer-generator network with GRU for
decoding. The token embedding is learnable with
dimension 300. To reduce the training complexity

338

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

l

a
c
_
a
_
0
0
3
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Source
Template Exemplar

SCPN (Iyyer et al., 2018)
CGEN (Chen et al., 2019a)
SGCP-F (Ours)
SGCP-R (Ours)

Source
Template Exemplar

SCPN (Iyyer et al., 2018)
CGEN (Chen et al., 2019a)
SGCP-F (Ours)
SGCP-R (Ours)

Source
Template Exemplar

SCPN (Iyyer et al., 2018)
CGEN (Chen et al., 2019a)
SGCP-F (Ours)
SGCP-R (Ours)

what should be done to get rid of laziness ?
how can i manage my anger ?

how can i get rid ?
how can i get rid of ?
how can i stop my laziness ?
how do i get rid of laziness ?

what books should entrepreneurs read on entrepreneurship ?
what is the best programming language for beginners to learn ?

what are the best books books to read to read ?
what ’s the best book for entrepreneurs read to entrepreneurs ?
what is a best book idea that entrepreneurs to read ?
what is a good book that entrepreneurs should read ?

how do i get on the board of directors of a non profit or a for profit organisation ?
what is the best way to travel around the world for free ?

what is the best way to prepare for a girl of a ?
what is the best way to get a non profit on directors ?
what is the best way to get on the board of directors ?
what is the best way to get on the board of directors of a non profit or a for profit organisation ?

Table 3: Sample generations of the competitive models. Please refer to Section 5.5 for details.

of the model, the maximum sequence length is
kept at 60. The vocabulary size is kept at 24K for
QQP and 50K for ParaNMT-small.

SGCP needs access to the level of syntactic
granularity for decoding, depicted as H in
Figure 2. During training, we keep on varying
it randomly from 3 to Hmax, changing it with each
training epoch. This ensures that our model is able
to generalize because of an implicit regularization
attained using this procedure. At each time-step of
the decoding process, we keep a teacher forcing
ratio of 0.9.

5 Results

5.1 Semantic Preservation and

Syntactic Transfer

1. Automated Metrics: As can be observed in
Table 2, our method(s) (SGCP-F/R (Section 4.1))
are able to outperform the existing baselines on
both the datasets. Source-as-Output is independent
of the exemplar sentence being used and since a
sentence is a paraphrase of itself, the paraphrastic
scores are generally high while the syntactic
scores are below par. An opposite is true
for Exemplar-as-Output. These baselines also
serve as dataset quality indicators. It can be
seen that source is semantically similar while
being syntactically different from target sentence
whereas the opposite is true when exemplar is
compared to target sentences. Additionally, source
sentences are syntactically and semantically
different from exemplar sentences as can be

observed from TED-E and PDS scores. This
helps in showing that the dataset has rich enough
syntactic diversity to learn from.

Through TED-E scores it can be seen that
SGCP-F is able to adhere to the syntax of the
exemplar template to a much larger degree than
the baseline models. This verifies that our model
is able to generate meaning preserving sentences
while conforming to the syntax of the exemplars
when measured using standard metrics.

It can also be seen that SGCP-R tends to perform
better than SGCP-F in terms of paraphrastic scores
while taking a hit on the syntactic scores. This
makes sense, intuitively, because in some cases
SGCP-R tends to select lower H values for syntactic
granularity. This can also be observed from the
example given in Table 6 where H = 6 is more
favorable than H = 7, because of better meaning
retention.

Although CGEN performs close to our model in
terms of BLEU, ROUGE, and METEOR scores
on ParaNMT-small dataset, its PDS is still much
lower than that of our model, suggesting that our
model is better at capturing the original meaning
of the source sentence. In order to show that the
results are not coincidental, we test the statistical
significance of our model. We follow the non-
parametric Pitman’s permutation test (Dror et al.,
2018) and observe that our model is statistically
significant when the significance level (α) is
taken to be 0.05. Note that this holds true for
all metric on both the datasets except ROUGE-2
on ParaNMT-small.

339

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

l

a
c
_
a
_
0
0
3
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

QQP-Pos

SCPN

1.63

ParaNMT-small

1.24

CGEN

SGCP-F

SGCP-R

2.47

1.89

2.70

2.07

2.99

2.26

Table 4: A comparison of human evaluation
scores for comparing quality of paraphrases
generated using all models. Higher score is better.
Please refer to Section 5.1 for details.

2. Human Evaluation: Table 4 shows the
results of human assessment. It can be seen that
annotators, generally tend to rate SGCP-F and SGCP-
R (Section 4.1) higher than the baseline models,
thereby highlighting the efficacy of our models.
This evaluation additionally shows that automated
metrics are somewhat consistent with the human
evaluation scores.

5.2 Syntactic Control

1. Syntactical Granularity: Our model can
work with different levels of granularity for the
exemplar syntax, namely, different tree heights of
the exemplar tree can be used for decoding the
output.

As can been seen in Table 6, at height 4 the
syntax tree provided to the model is not enough
to generate the full sentence that captures the
meaning of the original sentence. As we increase
the height to 5, it is able to capture the semantics
better by predicting some of
in the sentence.
We see that at heights 6 and 7 SGCP is able
to capture both semantics and syntax of the
source and exemplar, respectively. However, as
we provide the complete height of the tree (i.e.,
7), it further tries to follow the syntactic input
more closely leading to sacrifice in the overall
relevance since the original sentence is about pure
substances and not a pure substance. It can be
inferred from this example that because a source
sentence and exemplar’s syntax might not be fully
compatible with each other, using the complete
syntax tree can potentially lead to loss of relevance
and grammaticality. Hence by choosing different
levels of syntactic granularity, one can address the
issue of compatibility to a certain extent.

2. Syntactic Variety: Table 5 shows sample
generations of our model on multiple exemplars
for a given source sentence. It can be observed
that SGCP can generate high-quality outputs for a
variety of different template exemplars even the

ones which differ a lot from the original sentence
in terms of their syntax. A particularly interesting
exemplar is what is chromosomal mutation ?
what are some examples ?. Here, SGCP is able to
generate a sentence with two question marks while
preserving the essence of the source sentence. It
should also be noted that the exemplars used in
Table 5 were selected manually from the test sets,
considering only their qualitative compatibility
with the source sentence. Unlike the procedure
used for the creation of QQP-Pos dataset, the final
paraphrases were not kept in hand while selecting
the exemplars. In real-world settings, where a
gold paraphrase won’t be present, these results
are indicative of the qualitative efficacy of our
method.

5.3 SGCP-R Analysis

ROUGE-based selection from the candidates
favors paraphrases that have higher n-gram
overlap with their respective source sentences,
hence may capture source’s meaning better.
This hypothesis can be directly observed from
the results in Tables 2 and 4, where we see
higher values on automated semantic and human
evaluation scores. Although this helps in obtaining
better semantic generation,
tends to result
in higher TED values. One possible reason is
that, when provided with the complete tree, fine-
grained information is available to the model for
decoding and it forces the generations to adhere
to the syntactic structure. In contrast, at lower
heights, the model is provided with lesser syntactic
information but equivalent semantic information.

it

5.4 Qualitative Analysis

As can be seen from Table 7, SGCP not only
incorporates the best aspects of both the prior
models, namely SCPN and CGEN, but also utilizes
the complete syntactic information obtained using
the constituency-based parse trees of the exemplar.
From the generations in Table 3, we can see that
our model is able to capture both the semantics of
the source text as well as the syntax of template.
SCPN, evidently, can produce outputs with the
template syntax, but it does so at the cost of
semantics of the source sentence. This can also be
verified from the results in Table 2, where SCPN
performs poorly on PDS as compared with other
models. In contrast, CGEN and SGCP retain much
better semantic information, as is desirable. While
generating sentences, CGEN often abruptly ends the

340

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

l

a
c
_
a
_
0
0
3
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

SYNTACTIC EXEMPLAR

SGCP GENERATIONS

how can i get a domain for free ?

how can i develop a career in software ?

SOURCE: how do i develop my career in software ?

what is the best way to register a company ?

what is the best way to develop career in software ?

what are good places to visit in new york ?

what are good ways to develop my career in software ?

can i make 800,000 a month betting on horses ?

can i develop my career in software ?

what is chromosomal mutation ? what are some examples ?

what is good career ? what are some of the ways to develop my career in
software ?

is delivery free on quikr ?

is career useful in software ?

is it possible to mute a question on quora ?

is it possible to develop my career in software ?

Table 5: Sample SGCP-R generations with a single source sentence and multiple syntactic exemplars.
Please refer to Section 5.4 for details.

S
E

H = 4
H = 5
H = 6
H = 7

what are pure substances ? what are some examples ?
what are the characteristics of the elizabethan theater ?

what are pure substances ?
what are some of pure substances ?
what are some examples of pure substances ?
what are some examples of a pure substance ?

Table 6: Sample generations with different levels
of syntactic control. S and E stand for source and
exemplar, respectively. Please refer to Section 5.2
for details.

Single-Pass

Syntactic Signal

Granularity

SCPN

CGEN

SGCP

Linearized Tree

POS Tags (During
training)

Constituency Parse
Tree

Table 7: Comparison of different syntactically
controlled paraphrasing methods. Please refer to
Section 5.4 for details.

5.5 Limitations and Future Directions

and can freely generate

All natural language English sentences cannot
necessarily be converted to any desirable syntax.
We note that SGCP does not take into account the
compatibility of source sentence and template
syntax
exemplars
conforming paraphrases. This, at times, leads to
imperfect paraphrase conversion and nonsensical
sentences like example 6 in Table 5 (is career
useful
in software ?). Identifying compatible
exemplars is an important but separate task in
itself, which we defer to future work.

is that

Another

important aspect

the task
of paraphrase generation is inherently domain
agnostic. It is easy for humans to adapt to new
domains for paraphrasing. However, because of
the nature of the formulation of the problem in
NLP, all the baselines, as well as our model(s),
suffer from dataset bias and are not directly
applicable to new domains. A prospective future
direction can be to explore it from the lens of
domain independence.

sentence, as in example 1 in Table 3, truncating
the penultimate token with of. The problem of
abrupt ending due to insufficient syntactic input
length was highlighted in Chen et al. (2019a)
and we observe similar trends. SGCP, on the other
hand, generates more relevant and grammatical
sentences.

Based on empirical evidence, SGCP alleviates
shortcoming, possibly due to dynamic
this
syntactic control and decoding. This can be
seen in, for example, example 3 in Table 3
where CGEN truncates the sentence abruptly
(penultimate token = directors) but SGCP is able to
generate relevant sentence without compromising
on grammaticality.

Analyzing the utility of controlled paraphrase
generations for the task of data augmentation is
another interesting possible direction.

6 Conclusion

In this paper we proposed SGCP, an end-to-
end framework for
the task of syntactically
controlled paraphrase generation. SGCP generates
paraphrase of an input sentence while conforming
to the syntax of an exemplar sentence provided
along with the input. SGCP comprises a GRU-
based sentence encoder, a modified RNN-based
tree encoder, and a pointer-generator–based
to previous work
novel decoder. In contrast

341

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

l

a
c
_
a
_
0
0
3
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3






that focuses on a limited amount of syntactic
control, our model can generate paraphrases at
different levels of granularity of syntactic control
without compromising on relevance. Through
extensive evaluations on real-world datasets, we
demonstrate SGCP’s efficacy over state-of-the-art
baselines.

We believe that the above approach can be
useful for a variety of text generation tasks
including syntactic exemplar-based abstractive
summarization, text simplification and data-to-
text generation.

Acknowledgments

This research is supported in part by the Ministry
of Human Resource Development (Government
of
India). We thank the action editor Asli
Celikyilmaz and the three anonymous reviewers
for their helpful suggestions in preparing the
manuscript. We also thank Chandrahas for his
indispensable comments on earlier drafts of this
paper.

References

Roee Aharoni and Yoav Goldberg. 2017. Towards
string-to-tree neural machine translation. In
Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics
(Volume 2: Short Papers), pages 132–140. Van-
couver, Canada. Association for Computational
Linguistics.

Richard Chase Anderson and Alice Davison. 1986.
Conceptual and empirical bases of readability
formulas. Center for the Study of Reading
Technical Report; no. 392.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2014. Neural machine translation by
jointly learning to align and translate. arXiv
preprint arXiv:1409.0473.

Satanjeev Banerjee and Alon Lavie. 2005.
METEOR: An automatic metric for MT evalua-
tion with improved correlation with human
judgments. In Proceedings of the ACL Work-
shop on Intrinsic and Extrinsic Evaluation
Measures for Machine Translation and/or
Summarization, pages 65–72.

342

Regina Barzilay and Lillian Lee. 2003. Learning
to paraphrase: An unsupervised approach using
multiple-sequence alignment. In Proceedings
of the 2003 Conference of the North American
Chapter of the Association for Computational
Linguistics on Human Language Technology-
Volume 1, pages 16–23. Association for Com-
putational Linguistics.

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals,
Andrew Dai, Rafal Jozefowicz, and Samy
Bengio. 2016. Generating sentences from a
continuous space. In Proceedings of The 20th
SIGNLL Conference on Computational Natural
Language Learning, pages 10–21, Berlin,
Germany. Association
for Computational
Linguistics.

Ziqiang Cao, Wenjie Li, Sujian Li, and Furu
Wei. 2018. Retrieve,
rerank and rewrite:
Soft template based neural summarization. In
Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 152–161.

Mingda Chen, Qingming Tang, Sam Wiseman,
and Kevin Gimpel. 2019a. Controllable para-
phrase generation with a syntactic exemplar.
In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 5972–5984, Florence, Italy. Association
for Computational Linguistics.

Mingda Chen, Qingming Tang, Sam Wiseman,
and Kevin Gimpel. 2019b. A multi-task
approach for disentangling syntax and seman-
tics in sentence representations. In Proceedings
of the 2019 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume
Short Papers),
pages 2453–2464, Minneapolis, Minnesota.
Association for Computational Linguistics.

(Long

and

1

Junyoung Chung, Caglar Gulcehre, Kyunghyun
Cho, and Yoshua Bengio. 2014. Empirical
evaluation of gated recurrent neural networks
on sequence modeling. In NIPS 2014 Workshop
on Deep Learning, December 2014.

Herbert H. Clark and Eve V. Clark. 1968.
Semantic distinctions and memory for complex
sentences. Quarterly Journal of Experimental
Psychology, 20(2):129–138.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

l

a
c
_
a
_
0
0
3
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Rotem Dror, Gili Baumer, Segev Shlomov, and
Roi Reichart. 2018. The hitchhiker’s guide
to testing statistical significance in natural
language processing. In Proceedings of the 56th
Annual Meeting of the Association for Comput-
ational Linguistics (Volume 1: Long Papers),
pages 1383–1392, Melbourne, Australia. Asso-
ciation for Computational Linguistics.

Ankush Gupta, Arvind Agarwal, Prawaan Singh,
and Piyush Rai. 2018. A deep generative
framework for paraphrase generation.
In
Thirty-Second AAAI Conference on Artificial
Intelligence.

Samer Hassan, Andras Csomai, Carmen Banea,
Ravi Sinha, and Rada Mihalcea. 2007. Unt:
Subfinder: Combining knowledge sources for
automatic lexical substitution. In Proceedings
of the 4th International Workshop on Semantic
Evaluations, pages 410–413. Association for
Computational Linguistics.

Dan Hendrycks and Kevin Gimpel. 2016.
linear units (gelus). arXiv

Gaussian error
preprint arXiv:1606.08415.

Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. Neural Computation,
9(8):1735–1780.

Zhiting Hu, Zichao Yang, Xiaodan Liang,
Ruslan Salakhutdinov, and Eric P Xing. 2017.
Toward controlled generation of
In
Proceedings of the 34th International Confer-
ence
70,
on Machine Learning-Volume
pages 1587–1596. JMLR.org.

text.

Judith W. Irwin. 1980. The effects of explicitness
and clause order on the comprehension of
reversible causal relationships. Reading Re-
search Quarterly, pages 477–488.

Phillip Isola,

Jun-Yan Zhu, Tinghui Zhou,
and Alexei A. Efros. 2017. Image-to-image
translation with conditional adversarial net-
works. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition,
pages 1125–1134.

Mohit Iyyer, John Wieting, Kevin Gimpel, and
Luke Zettlemoyer. 2018. Adversarial example
generation with syntactically controlled para-
phrase networks. In Proceedings of the 2018

Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long Papers),
1875–1885, New
pages
Orleans, Louisiana. Association for Com-
putational Linguistics.

Evelyn Walker Katz and Sandor B. Brent. 1968.
Understanding connectives. Journal of Memory
and Language, 7(2):501.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:
A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.

Diederik P. Kingma and Max Welling. 2014.
Auto-encoding variational bayes. In Proceed-
ings of ICLR.

Ashutosh Kumar, Satwik Bhattamishra, Manik
Bhandari, and Partha Talukdar. 2019. Submod-
ular optimization-based diverse paraphrasing
and its effectiveness in data augmentation.
In Proceedings of
the 2019 Conference of
the North American Chapter of the Associ-
ation for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and
Short Papers), pages 3609–3619, Minneapo-
lis, Minnesota. Association for Computational
Linguistics.

Elena T. Levy. 2003. The roots of coherence
in discourse. Human Development, 46(4):
169–188.

objective

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng
Gao, and Bill Dolan. 2016. A diversity-
function for neural
promoting
conversation models. In Proceedings of
the
2016 Conference of
the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 110–119.

Zichao Li, Xin Jiang, Lifeng Shang, and Hang
Li. 2018. Paraphrase generation with deep
reinforcement learning. In Proceedings of the
2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 3865–3878,
Brussels, Belgium. Association for Computa-
tional Linguistics.

Zichao Li, Xin Jiang, Lifeng Shang, and Qun
Liu. 2019. Decomposable neural paraphrase

343

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

l

a
c
_
a
_
0
0
3
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

generation. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Linguistics, pages 3403–3414, Florence, Italy.
Association for Computational Linguistics.

Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Sum-
marization Branches Out. https://www.
aclweb.org/anthology/W04-1013.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. Roberta: A robustly optimized
bert pretraining approach. arXiv preprint arXiv:
1907.11692.

Nitin Madnani and Bonnie J. Dorr. 2010. Gen-
erating phrasal and sentential paraphrases: A
survey of data-driven methods. Computational
Linguistics, 36(3):341–387.

Christopher D. Manning, Mihai Surdeanu, John
Bauer, Jenny Finkel, Steven J. Bethard, and
David McClosky. 2014. The Stanford CoreNLP
natural language processing toolkit. In Asso-
ciation for Computational Linguistics (ACL)
System Demonstrations, pages 55–60.

Kathleen R. McKeown. 1983. Paraphrasing
questions using given and new information.
Computational Linguistics, 9(1):1–10.

Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. BLEU: A method
for automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics,
pages 311–318, Philadelphia, Pennsylvania,
USA. Association for Computational Linguistics.

Hao Peng, Ankur Parikh, Manaal Faruqui,
Bhuwan Dhingra, and Dipanjan Das. 2019.
Text generation with exemplar-based adaptive
decoding. In Proceedings of the 2019 Confer-
ence of the North American Chapter of the
Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long and Short Papers), pages 2555–2565,
Minneapolis, Minnesota. Association
for
Computational Linguistics.

Aaditya Prakash, Sadid A. Hasan, Kathy Lee,
Joey Liu,

Vivek Datla, Ashequl Qadir,

and Oladimeji Farri. 2016. Neural paraphrase
generation with stacked residual LSTM net-
works. In Proceedings of COLING 2016, the
26th International Conference on Comput-
ational
Papers,
pages 2923–2934, Osaka, Japan. The COLING
2016 Organizing Committee.

Linguistics:

Technical

Chris Quirk, Chris Brockett, and William Dolan.
2004. Monolingual machine translation for
paraphrase generation. In Proceedings of the
2004 Conference on Empirical Methods in
Natural Language Processing, pages 142–149.

Ehud Reiter. 2018. A structured review of the
validity of BLEU. Computational Linguistics,
44(3):393–401.

Abigail See, Peter J. Liu, and Christopher D.
Manning. 2017, Jul. Get to the point: Sum-
marization with pointer-generator networks. In
Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1073–1083,
Vancouver, Canada.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of rare
words with subword units. In Proceedings of
the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long
Papers), pages 1715–1725, Berlin, Germany.

Tianxiao Shen, Tao Lei, Regina Barzilay, and
Tommi Jaakkola. 2017. Style transfer from non-
parallel text by cross-alignment. In Advances
in Neural Information Processing Systems,
pages 6830–6841.

Xing Shi, Inkit Padhi, and Kevin Knight. 2016.
Does string-based neural MT learn source
syntax? In Proceedings of the 2016 Conference
on Empirical Methods in Natural Language
Processing, pages 1526–1534, Austin, Texas.
Association for Computational Linguistics.

Advaith Siddharthan. 2014. A survey of research
on text simplification. ITL-International Jour-
nal of Applied Linguistics, 165(2):259–298.

Felix Stahlberg, Eva Hasler, Aurelien Waite, and
Bill Byrne. 2016. Syntactically guided neural
machine translation. In Proceedings of
the
54th Annual Meeting of the Association for

344

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

l

a
c
_
a
_
0
0
3
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Natural Language Generation Conference,
pages 203–207. Association for Computational
Linguistics.

Xuewen Yang, Yingru Liu, Dongliang Xie, Xin
Wang, and Niranjan Balasubramanian. 2019.
Latent part-of-speech sequences for neural
machine translation.

Zichao Yang, Zhiting Hu, Chris Dyer, Eric P.
Xing, and Taylor Berg-Kirkpatrick. 2018. Un-
supervised text style transfer using language
In Advances in
models as discriminators.
Systems,
Neural
pages 7287–7298.

Information Processing

Kaizhong Zhang and Dennis Shasha. 1989. Simple
fast algorithms for the editing distance between
trees and related problems. SIAM Journal on
Computing, 18(6):1245–1262.

Yuan Zhang, Jason Baldridge, and Luheng He.
2019. PAWS: Paraphrase adversaries from
word scrambling. In Proceedings of the 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long and Short Papers), pages 1298–1308.

Shiqi Zhao, Cheng Niu, Ming Zhou, Ting Liu,
and Sheng Li. 2008. Combining multiple
resources to improve SMT-based paraphras-
ing model. Proceedings of ACL-08: HLT.
pages 1021–1029.

Computational Linguistics (Volume 2: Short
Papers), pages 299–305.

Kai Sheng Tai, Richard Socher, and Christopher
D. Manning. 2015. Improved semantic repre-
sentations from tree-structured long short-term
memory networks. In Proceedings of the 53rd
Annual Meeting of the Association for Com-
putational Linguistics and the 7th Interna-
tional Joint Conference on Natural Language
Processing
Long Papers),
pages 1556–1566.

(Volume

1:

Oriol Vinyals, Meire Fortunato, and Navdeep
Jaitly. 2015. Pointer networks. In Advances
in Neural Information Processing Systems,
pages 2692–2700.

Oriol Vinyals and Quoc Le. 2015. A neural
conversational model. arXiv preprint arXiv:
1506.05869.

John Wieting and Kevin Gimpel. 2018. Paranmt-
50m: Pushing the limits of paraphrastic sen-
tence embeddings with millions of machine
translations. In Proceedings of the 56th Annual
Meeting of the Association for Computational
Linguistics
Long Papers),
pages 451–462.

(Volume

1:

Sander Wubben, Antal Van Den Bosch, and
Emiel Krahmer. 2010. Paraphrase generation as
monolingual translation: Data and evaluation.
the 6th International
In Proceedings of

345

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

l

a
c
_
a
_
0
0
3
1
8
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3Syntax-Guided Controlled Generation of Paraphrases image

Download pdf