Syntax-Guided Controlled Generation of Paraphrases

Syntax-Guided Controlled Generation of Paraphrases

Ashutosh Kumar1 Kabir Ahuja2∗ Raghuram Vadapalli3∗ Partha Talukdar1

1Indian Institute of Science, Bangalore
2Microsoft Research, Bangalore
3谷歌, 伦敦
ashutosh@iisc.ac.in, kabirahuja2431@gmail.com
raghuram.4350@gmail.com, ppt@iisc.ac.in

抽象的

Given a sentence (例如, ‘‘I like mangoes’’) 和
a constraint (例如, sentiment flip), the goal of
controlled text generation is to produce a
sentence that adapts the input sentence to meet
the requirements of the constraint (例如, ‘‘I
hate mangoes’’). Going beyond such simple
constraints, recent work has started explor-
ing the incorporation of complex syntactic-
guidance as constraints in the task of controlled
paraphrase generation.
In these methods,
syntactic-guidance is sourced from a separate
exemplar sentence. 然而,
these prior
works have only utilized limited syntactic
information available in the parse tree of the
exemplar sentence. We address this limita-
tion in the paper and propose Syntax Guided
Controlled Paraphraser (SGCP), an end-to-end
framework for syntactic paraphrase genera-
的. We find that SGCP can generate syntax-
conforming sentences while not compromising
on relevance. We perform extensive automated
and human evaluations over multiple real-
world English language datasets to demon-
strate the efficacy of SGCP over state-of-the-art
基线. To drive future research, 我们有
made SGCP’s source code available.1

1 介绍

Controlled text generation is the task of producing
a sequence of coherent words based on given
constraints. These constraints can range from
simple attributes like tense, sentiment polarity,
and word-reordering (Hu et al., 2017; Shen et al.,
2017; 杨等人。, 2018) to more complex syntactic
信息. 例如, given a sentence ‘‘The
movie is awful!’’ and a simple constraint like flip

∗ This research was conducted during the authors

internship at Indian Institute of Science.

1https://github.com/malllabiisc/SGCP.

330

sentiment to positive, a controlled text generator
is expected to produce the sentence ‘‘The movie is
fantastic!’’.

These constraints are important in not only
providing information about what
to say but
also how to say it. Without any constraint, 这
ubiquitous sequence-to-sequence neural models
often tend to produce degenerate outputs and
favor generic utterances (Vinyals and Le, 2015; 李
等人。, 2016). Although simple attributes are helpful
to say,
in addressing what
they provide very
little information about how to say it. 句法
control over generation helps in filling this gap by
providing that missing information.

Incorporating complex syntactic information
has shown promising results in neural machine
翻译 (Stahlberg et al., 2016; Aharoni and
Goldberg, 2017; 杨等人。, 2019), data-to-text
一代 (彭等人。, 2019), abstractive text-
summarization (Cao et al., 2018), and adversarial
text generation (Iyyer et al., 2018). 此外,
recent work (Iyyer et al., 2018; Kumar et al., 2019)
has shown that augmenting lexical and syntactical
variations in the training set can help in building
better performing and more robust models.

在本文中, we focus on the task of syn-
tactically controlled paraphrase generation, 那
是, given an input sentence and a syntactic
exemplar, produce a sentence that conforms to
the syntax of the exemplar while retaining the
meaning of the original input sentence. 尽管
syntactically controlled generation of paraphrases
finds applications in multiple domains like data-
augmentation and text passivization, we highlight
its importance in the particular task of text
in Siddharthan
simplification. As pointed out
(2014), depending on the literacy skill of an
个人, certain syntactical forms of English

计算语言学协会会刊, 卷. 8, PP. 330–345, 2020. https://doi.org/10.1162/tacl 00318
动作编辑器: Asli Celikyilmaz. 提交批次: 12/2019; 修改批次: 2/2020; 已发表 6/2020.
C(西德:13) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

A
C
_
A
_
0
0
3
1
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

sentences are easier to comprehend than others. 作为
an example, consider the following two sentences:

S1 Because it is raining today, you should carry

an umbrella.

来源
EXEMPLAR

– how do i predict the stock market ?
– can a brain transplant be done ?

SCPN
CGEN
SGCP
(Ours)

– how can the stock and start ?
– can the stock market actually happen ?

– can i predict the stock market ?

S2 You should carry an umbrella today, 因为

it is raining.

that permit pre-posed adverbial
Connectives
clauses have been found to be difficult for third
to fifth grade readers, even when the order of
mention coincides with the causal (and temporal)
命令 (Anderson and Davison, 1986; 征收, 2003).
因此, they prefer sentence S2. 然而, 各种各样的
other studies (Clark and Clark, 1968; Katz and
Brent, 1968; Irwin, 1980) have suggested that
for older school children, college students, 和
adults, comprehension is better for the cause-effect
presentation, hence sentence S1. 因此, modifying
a sentence, syntactically, would help in better
comprehension based on literacy skills.

Prior work in syntactically controlled para-
phrase generation addressed this task by con-
ditioning the semantic input on either the features
learned from a linearized constituency-based parse
树 (Iyyer et al., 2018), or the latent syntactic
信息 (陈等人。, 2019A) learned from
exemplars
through variational auto-encoders.
Linearizing parse trees typically results in loss of
essential dependency information. On the other
手, as noted in Shi et al. (2016), an autoencoder-
based approach might not offer rich enough
syntactic information as guaranteed by actual
constituency parse trees. 而且, as noted
in Chen et al.
(2019A), SCPN (Iyyer et al.,
2018), and CGEN (陈等人。, 2019A) tend to
generate sentences of the same length as the
exemplar. This is an undesirable characteristic
because it often results in producing sentences that
end abruptly, thereby compromising on gramma-
ticality and semantics. Please see Table 1 为了
sample generations using each of the models.

To address these gaps, we propose Syntax
Guided Controlled Paraphraser
(SGCP) 哪个
uses full exemplar syntactic tree information.
此外, our model provides
an easy
mechanism to incorporate different
levels of
syntactic control (granularity) based on the height
of the tree being considered. The decoder in
our framework is augmented with rich enough
information to be able to produce
syntactical

来源

EXEMPLAR

– what are some of the mobile apps you ca n’t live
without and why ?
– which is the best resume you have come across ?

SCPN
CGEN
SGCP
(Ours)

– what are the best ways to lose weight ?
– which is the best mobile app you ca n’t ?
– which is the best app you ca n’t live without and
为什么 ?

桌子 1: Sample syntactic paraphrases generated
by SCPN (Iyyer et al., 2018), CGEN (陈等人。,
2019A), SGCP (Ours). We observe that SGCP is able
to generate syntax conforming paraphrases without
compromising much on relevance.

syntax conforming sentences while not losing out
on semantics and grammaticality.

The main contributions of this work are as

如下:

1. We propose SGCP, an end-to-end model to
generate syntactically controlled paraphrases
at different
levels of granularity using a
parsed exemplar.

2. We provide a new decoding mechanism to
incorporate syntactic information from the
exemplar sentence’s syntactic parse.

3. We provide a dataset formed from Quora
Question Pairs2 for evaluating the models.
We also perform extensive experiments to
demonstrate the efficacy of our model using
multiple automated metrics as well as human
evaluations.

2 相关工作

Controllable Text Generation.
is an important
problem in NLP that has received significant atten-
tion in recent times. Prior work include generating
text using models conditioned on attributes like
形式, 情绪, or tense (Hu et al., 2017;
Shen et al., 2017; 杨等人。, 2018) 也
on syntactical templates (Iyyer et al., 2018; 陈
等人。, 2019A). These systems find applications in
adversarial sample generation (Iyyer et al., 2018),

2https://www.kaggle.com/c/quora-

question-pairs.

331

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

A
C
_
A
_
0
0
3
1
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

text summarization, and table-to-text generation
(彭等人。, 2019). While achieving state-of-the-
art in their respective domains, these systems
typically rely on a known finite set of attributes
thereby making them quite restrictive in terms of
the styles they can offer.

Paraphrase Generation. While generation of
paraphrases has been addressed in the past using
traditional methods (McKeown, 1983; Barzilay
和李, 2003; 奎克等人。, 2004; Hassan et al.,
2007; 赵等人。, 2008; Madnani and Dorr, 2010;
Wubben et al., 2010), they have recently been
superseded by deep learning-based approaches
(Prakash et al., 2016; Gupta et al., 2018; 李
等人。, 2019, 2018; Kumar et al., 2019). 这
primary task of all these methods (Prakash et al.,
2016; Gupta et al., 2018; 李等人。, 2018) is to
generate the most semantically similar sentence
and they typically rely on beam search to obtain
any kind of lexical diversity. Kumar et al. (2019)
try to tackle the problem of achieving lexical,
and limited syntactical diversity using submodular
optimization but do not provide any syntactic
control over the type of utterance that might be
desired. These methods are therefore restrictive
in terms of the syntactical diversity that they can
offer.

Controlled Paraphrase Generation. Our task
is similar in spirit to Iyyer et al. (2018) 和
Chen et al. (2019A), which also deals with the
task of syntactic paraphrase generation. 然而,
the approach taken by them is different from
ours in at least two aspects. Firstly, SCPN (伊耶尔
等人。, 2018) uses an attention-based (Bahdanau
等人。, 2014) pointer-generator network (See et al.,
2017) to encode input sentences and a linearized
constituency tree to produce paraphrases. 因为
of the linearization of syntactic tree, considerable
dependency-based information is generally lost.
Our model,
反而, directly encodes the tree
structure to produce a paraphrase. 第二,
the inference (or generation) process in SCPN
is computationally very expensive, 因为它
involves a two-stage generation process. 在里面
first stage, they generate full parse trees from
incomplete templates, and then from full parse
trees to final generations. 相比之下, the inference
in our method involves a single-stage process,
wherein our model takes as input a semantic
来源, a syntactic tree and the level of syntactic

style that needs to be transferred, to obtain the
几代人. 此外, we also observed that
the model does not perform well in low resource
settings. 这, 再次, can be attributed to the
compounding implicit noise in the training due
to linearized trees and generation of full linearized
trees before obtaining the final paraphrases.

Chen et al.

(2019A) propose a syntactic
exemplar-based method for controlled paraphrase
generation using an approach based on latent
variable probabilistic modeling, neural variational
inference, and multi-task learning. 这,

原则, is very similar to Chen et al. (2019乙). 作为
opposed to our model, which provides different
levels of syntactic control of
the exemplar-
based generation, this approach is restrictive in
terms of the flexibility it can offer. 还, 作为
noted in Shi et al. (2016), an autoencoder-based
approach might not offer rich enough syntactic
information as offered by actual constituency
parse trees. 此外, VAEs (Kingma and
Welling, 2014) are generally unstable and harder
to train (Bowman et al., 2016; Gupta et al., 2018)
than seq2seq-based approaches.

3 SGCP: Proposed Method

在这个部分, we describe the inputs and various
architectural components essential for building
SGCP, an end-to-end trainable model. Our model,
如图 1, comprises a sentence
encoder (3.2), syntactic tree encoder (3.3), 和
a syntactic-paraphrase-decoder (3.4).

3.1 Inputs

Given an input sentence X and a syntactic
exemplar Y , our goal is to generate a sentence
Z that conforms to the syntax of Y while retaining
the meaning of X.

The semantic encoder (部分 3.2) 作品
on sequence of input tokens, and the syntactic
encoder (部分 3.3) operates on constituency-
based parse trees. We parse the syntactic exemplar
是 3 to obtain its constituency-based parse tree. 这
leaf nodes of the constituency-based parse tree
consists of token for the sentence Y. These tokens,
in some sense, carry the semantic information of
sentence Y, which we do not need for generating
paraphrases. In order to prevent any meaning

3Obtained using the Stanford CoreNLP toolkit (曼宁

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

A
C
_
A
_
0
0
3
1
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

等人。, 2014).

332

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

A
C
_
A
_
0
0
3
1
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 1: Architecture of SGCP (proposed method). SGCP aims to paraphrase an input sentence, while conforming
to the syntax of an exemplar sentence (provided along with the input). The input sentence is encoded using
the Sentence Encoder (部分 3.2) to obtain a semantic signal ct. The Syntactic Encoder (部分 3.3) takes a
constituency parse tree (pruned at height H) of the exemplar sentence as an input, and produces representations for
all the nodes in the pruned tree. Once both of these are encoded, the Syntactic Paraphrase Decoder (部分 3.4)
uses pointer-generator network, and at each time step takes the semantic signal ct, the decoder recurrent state st,
embedding of the previous token and syntactic signal hY
to generate a new token. Note that the syntactic signal
t
remains the same for each token in a span (shown in figure above curly braces; please see Figure 2 for more
细节). The gray shaded region (not part of the model) illustrates a qualitative comparison of the exemplar syntax
tree and the syntax tree obtained from the generated paraphrase. Please refer to Section 3 欲了解详情.

propagation from exemplar sentence Y into the
一代, we remove these leaf/terminal nodes
from its constituency parse. The tree thus obtained
is denoted as CY.

The syntactic encoder, additionally, takes as
input H, which governs the level of syntactic
control needed to be induced. The utility of H will
be described in Section 3.3.

3.2 Semantic Encoder

The semantic encoder, a multilayered Gated
Recurrent Unit (GRU), receives tokenized sen-
tence X = {x1, . . . , xTX } as input and computes
the contextualized hidden state representation
hX
t

for each token using:

t = GRU(hX
hX

t−1, e(xt)),

(1)

where e(xt) represents the learnable embedding
of the token xt and t ∈ {1, . . . , TX}. 注意
we use byte-pair encoding (Sennrich et al., 2016)
for word/token segmentation.

3.3 Syntactic Encoder

This encoder provides the necessary syntactic
guidance for
the generation of paraphrases.
正式地, let constituency tree CY = {V, 乙, 是},
where V is the set of nodes, E the set of edges,
and Y the labels associated with each node.

We calculate the hidden-state representation
hY
v of each node v ∈ V using the hidden-state
representation of its parent node pa(v) 和
embedding associated with its label yv as follows:

v = GeLU(WpahY
hY

pa(v) + Wve(yv) + bv),

(2)

where e(yv) is the embedding of the node label
yv, and Wpa, Wv, bv are learnable parameters. 这
approach can be considered similar to TreeLSTM
(Tai et al., 2015). We use GeLU activation func-
的 (Hendrycks and Gimpel, 2016) 而不是
the standard tanh or relu, because of superior
empirical performance.

As indicated in Section 3.1, syntactic encoder
takes as input the height H, which governs the
level of syntactic control. We randomly prune the

333

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

A
C
_
A
_
0
0
3
1
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 2: The constituency parse tree serves as an input to the syntactic encoder (部分 3.3). The first step is
to remove the leaf nodes which contain meaning representative tokens (这里: What is the best language . . . ).
H denotes the height to which the tree can be pruned and is an input to the model. 数字 2(A) shows the full
constituency parse tree annotated with vector a for different heights. 数字 2(乙) shows the same tree pruned at
height H = 3 with its corresponding a vector. The vector a serves as an signalling vector (部分 3.4.2) 哪个
helps in deciding the syntactic signal to be passed on to the decoder. Please refer Section 3 欲了解详情.

tree CY to height H ∈ {3, . . . , Hmax}, where Hmax
is the height of the full constituency tree CY . 作为一个
例子, in Figure 2b, we prune the constituency-
based parse tree of the exemplar sentence, 到
height H = 3. The leaf nodes for this tree have the
labels WP, VBZ, NP, 和 . Although we
calculate the hidden-state representation of all the
节点, only the terminal nodes are responsible
for providing the syntactic signal to the decoder
(部分 3.4).

We maintain a queue LY

H of such terminal node
representations where elements are inserted from
left to right for a given H. 具体来说, 为了
particular example given in Figure 2b,

H = [hY
LY

WP, hY

VBZ, hY

NP, hY

]

We emphasize the fact that the length of the queue
|LY

H | is a function of height H.

3.4 Syntactic Paraphrase Decoder

Having obtained the semantic and syntactic
陈述, the decoder is tasked with the
generation of syntactic paraphrases. This can
be modeled as finding the best Z = Z ∗ that
maximizes the probability P(Z|X, 是 ), which can
further be factorized as:

Z ∗ = arg max

z

TZ

(zt|z1, . . . , zt−1, X, 是 ),

(3)

t=1

334

where TZ is the maximum length up to which
decoding is required.

In the subsequent sections, we use t to denote

the decoder time step.

3.4.1 Using Semantic Information
At each decoder
distribution αt
hidden states hX

the attention
is calculated over the encoder
我 , obtained using Equation 1, 作为:

time step t,

i = v⊺tanh(WhhX
et

我 + Wsst + battn)
αt = softmax(et),

(4)

where st is the decoder cell-state and v, Wh, Ws,
battn are learnable parameters.

The attention distribution provides a way to
jointly align and train sequence to sequence
models by producing a weighted sum of the se-
mantic encoder hidden states, known as context-
vector ct, 给出的:

ct =

ihX
αt

(5)


X
ct serves as the semantic signal which is essential
for generating meaning preserving sentences.

3.4.2 Using Syntactic Information
During training, each terminal node in the tree CY,
pruned at H, is equipped with information about
the span of words it needs to generate. 在每一个

time step t, only one terminal node representation
v ∈ LY
hY
H is responsible for providing the
syntactic signal which we call hY
t . This hidden-
state representation to be used is governed
through an signalling vector a = (a1, . . . , aTz ),
where each ai ∈ {0, 1}. 0 表明

decoder should keep on using the same hidden-
representation hY
H that is currently being
用过的, 和 1 indicates that the next element (隐-
表示) in the queue LY
H should be used
for decoding.

v ∈ LY

where pop removes and returns the next element
in the queue, st is the decoder state, and e(z′
t) 是
the embedding of the input token at time t during
decoding.

3.4.3 全面的
The semantic signal ct, together with decoder
state st, embedding of the input token e(z′
t) 和
the syntactic signal hY
is fed through a GRU
t
followed by softmax of the output to produce a
vocabulary distribution as:

The utility of a can be best understood through
Figure 2b. Consider the syntactic tree pruned at
height H = 3. For this example,
H = [hY
LY

VBZ, hY

NP, hY

WP, hY

]

a = (1, 1, 1, 0, 0, 0, 0, 0, 1)
ai = 1 provides a signal to pop an element
from the queue LY
H while ai = 0 提供了一个
signal to keep on using the last popped element.
This element is then used to guide the decoder
syntactically by providing a signal in the form of
hidden-state representation (方程 8).

H to pop hY

具体来说, in this example, the a1 = 1 signals
LY
H to pop hY
WP to provide syntactic guidance
to the decoder for generating the first
代币.
a2 = 1 signals LY
VBZ to provide
syntactic guidance to the decoder for generating
the second token. a3 = 1 helps in obtaining hY
NP
from LY
H to provide guidance to generate the
third token. As described earlier, a4, . . . , a8 = 0
indicates, that the same representation hY
NP should
be used for syntactically guiding tokens z4, . . . , z8.
Finally a9 = 1 helps in retrieving hY
为了
guiding decoder to generate token z9. 注意
|LY

H | =
Although a is provided to the model during
训练, this information might not be available
during inference. Providing a during generation
makes the model restrictive and might result
in producing ungrammatical sentences. SGCP is
tasked to learn a proxy for the signalling vector a,
using transition probability vector p.

Tz
i=1 ai

At each time step t, we calculate pt ∈ (0, 1),
which determines the probability of changing the
syntactic signal using:
pt = σ(Wbop([ct; hY

t)]) + bbop),

t ; st; e(z′

(6)

Pvocab = softmax(瓦 ([ct; hY

t ; st; e(z′

t)]) + 乙),

(8)
在哪里 [; ] represents concatenation of constituent
元素, 和W, b are trainable parameters.

We augment this with the copying mechanism
(Vinyals et al., 2015) as in the pointer-generator
网络 (See et al., 2017). Usage of such a
mechanism offers a probability distribution over
the extended vocabulary (the union of vocabulary
words and words present in the source sentence)
as follows:

磷(z) = pgenPvocab(z) + (1 − pgen)

αt

pgen = σ(w⊺

c ct + w⊺

s st + w⊺

x e(z′

我:zi=z
X
t) + bgen)

(9)

where wc, ws, wx and bgen are learnable parame-
特尔斯, e(z′
t) is the input token embedding to the
decoder at time step t, and αt
i is the element cor-
responding to the ith co-ordinate in the attention
distribution as defined in Equation 4.

The overall objective can be obtained by
taking negative log-likelihood of the distributions
obtained in Equation 6 and Equation 9.

L = −

1
时间

时间

[log P(z∗
t )

t=0
X
+ at log(点)
+ (1 − at) 日志(1 − pt)]

(10)

where at is the tth element of the vector a.

4 实验

Our experiments are geared towards answering
the following questions:

hY
t+1 =

hY
t
pop(LY

点 < 0.5 H) otherwise ( Q1. Is SGCP able to generate syntax conforming sentences without losing out on meaning? (Section 5.1, 5.4) (7) 335 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 1 8 1 9 2 3 7 4 9 / / t l a c _ a _ 0 0 3 1 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Q2. What level of syntactic control does SGCP offer? (Section 5.2, 5.3, 5.2) Q3. How does SGCP compare against prior models, qualitatively? (Section 5.4) Q4. Are the improvements achieved by SGCP statistically significant? (Section 5.1) Based on these questions, we outline the methods compared (Section 4.1), along with the datasets (Section 4.2) used, evaluation criteria (Section 4.3) and the experimental setup (Section 4.4). 4.1 Methods Compared As in Chen et al. (2019a), we first highlight the results of the two direct return-input baselines. 1. Source-as-Output: Baseline where the output is the semantic input. 2. Exemplar-as-Output: Baseline where the output is the syntactic exemplar. We compare the following competitive methods: 3. SCPN (Iyyer et al., 2018) is a sequence-to- sequence based model comprising two encoders built with LSTM (Hochreiter and Schmidhuber, 1997) to encode semantics and syntax respectively. Once the encoding is obtained, to the LSTM-based decoder, which is augmented with soft-attention (Bahdanau et al., 2014) over encoded states as well as a copying mechanism (See et al., 2017) to deal with out-of-vocabulary tokens.4 it serves as an input 4. CGEN (Chen et al., 2019a) is a VAE (Kingma and Welling, 2014) model with two encoders to project semantic input and syntactic input to a latent space. They obtain a syntactic embedding from one encoder, using a standard Gaussian prior. To obtain the semantic representation, they use von Mises- Fisher prior, which can be thought of as a Gaussian distribution on a hypersphere. They train the model using a multi-task paradigm, incorporating paraphrase generation loss and word position loss. We considered their best model, VGVAE + LC + WN + WPL, which incorporates the above objectives. 4Note that the results for SCPN differ from the ones shown in Iyyer et al. (2018). This is because the dataset used in Iyyer et al. (2018) is at least 50 times larger than the largest dataset (ParaNMT-small) in this work. 336 5. SGCP (Section 3) is a sequence-and-tree-to- sequence based model that encodes semantics and tree-level syntax to produce paraphrases. It uses a GRU-based (Chung et al., 2014) decoder with soft-attention on semantic encodings and a begin of phrase (bop) gate to select a leaf node in the exemplar syntax tree. We compare the following two variants of SGCP: (a) SGCP-F: Uses full constituency parse tree information of the exemplar for generating paraphrases. (a) SGCP-R: SGCP can produce multiple paraphrases by pruning the exemplar tree at various heights. This variant first generates five candidate generations, corresponding to five different heights of the exemplar tree, namely, {Hmax, Hmax − 1, Hmax − 2, Hmax − 3, Hmax − 4}, for each (source, exemplar) pair. From these candidates, the one with the highest ROUGE-1 score with the source sentence is selected as the final generation. Note that, except for the return-input baselines, all methods use beam search during inference. 4.2 Datasets We train the models and evaluate them on the following datasets: (1) ParaNMT-small (Chen et al., 2019a) contains 500K sentence-paraphrase pairs for training, and 1,300 manually labeled sentence- exemplar-reference, which is further split into 800 test data points and 500 dev. data points, respectively. is a subset of As in Chen et al. (2019a), our model uses only (sentence, paraphrase) during training. The paraphrase itself serves as the exemplar input during training. This dataset the original ParaNMT-50M dataset (Wieting and Gimpel, 2018). ParaNMT-50M is a data set generated automatically through backtranslation of original English sentences. It is inherently noisy because of imperfect neural machine translation quality, with many sentences being non-grammatical and some even being non-English sentences. Because of such noisy data points, it is optimistic to assume that the corresponding constituency parse tree would be well aligned. To that end, we propose to l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 1 8 1 9 2 3 7 4 9 / / t l a c _ a _ 0 0 3 1 8 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 use the following additional dataset, which is more well-formed and has more human intervention than the ParaNMT-50M dataset. (2) QQP-Pos: The original Quora Question Pairs (QQP) dataset contains about 400K sentence pairs labeled positive if they are duplicates of each other and negative otherwise. The dataset is composed of about 150K positive and 250K negative pairs. We select those positive pairs that contain both sentences with a maximum token length of 30, leaving us with ∼146K pairs. We call this dataset QQP-Pos. Similar to ParaNMT-small, we use only the sentence-paraphrase pairs as training set and sentence-exemplar-reference triples for testing and validation. We randomly choose 140K sentence-paraphrase pairs as the training set Ttrain, and the remaining 6K pairs Teval are used to form the evaluation set E. Additionally, let {{X, Z} : (X, Z) ∈ Teval}. Note that Teset = Teset is a set of sentences while Teval is a set of sentence-paraphrase pairs. S Let E = φ be the initial evaluation set. For selecting exemplar for each each sentence- paraphrase pair (X, Z) ∈ Teval, we adopt the following procedure: Step 1: For a given (X, Z) ∈ Teval, construct an exemplar candidate set C = Teset − {X, Z}. |C| ≈ 12, 000. Step 2: Retain only those sentences C ∈ C whose sentence length (= number of tokens) differ by at most two when compared to the paraphrase Z. This is done since sentences with similar constituency-based parse tree structures tend to have similar token lengths. Step 3: Remove those candidates C ∈ C, which are very similar to the source sentence X, that is, BLEU(X, C) > 0.6.

Step 4: From the remaining instances in C,
choose that sentence C as the exemplar Y
which has the least Tree-Edit distance with
the paraphrase Z of the selected pair, 即,
TED(Z, C). This ensures that
Y = argmin
C∈C

the constituency-based parse tree of the
exemplar Y is quite similar to that of Z,
in terms of Tree-Edit distance.

Step 5: 乙 := E ∪ (X, 是, Z).

Step 6: Repeat procedure for all other pairs in

Teval.

From the obtained evaluation set E, 我们
randomly choose 3K triplets for the test set Ttest,
and remaining 3K for the validation set V.

4.3 评估

It should be noted that there is no single fully
reliable metric for evaluating syntactic paraphrase
一代. 所以, we evaluate on the follow-
ing metrics to showcase the efficacy of syntactic
paraphrasing models.

1. Automated Evaluation.

(我) Alignment based metrics: We compute
蓝线 (Papineni et al., 2002), METEOR
(Banerjee and Lavie, 2005), ROUGE-1,
ROUGE-2, and ROUGE-L (林, 2004)
scores between the generated and the refer-
ence paraphrases in the test set.

(二) Syntactic Transfer: We evaluate the
syntactic transfer using Tree-edit distance
(Zhang and Shasha, 1989) between the parse
trees of:

(A) the generated and the syntactic exemplar

in the test setTED-E

生成的

(乙) 这

reference

paraphrase in the test setTED-R

(三、) Model-based evaluation: Because our
goal is to generate paraphrases of the input
句子, we need some measure to deter-
mine if the generations indeed convey the
same meaning as the original text. To achieve
这, we adopt a model-based evaluation
metric as used by Shen et al. (2017) 为了
Text Style Transfer and Isola et al. (2017) 为了
Image Transfer. 具体来说, classifiers are
trained on the task of Paraphrase Detection
and then used as Oracles to evaluate the
generations of our model and the baselines.
We fine-tune two RoBERTa (刘等人。, 2019)
based sentence pair classifiers, one on Quora
Question Pairs (Classifier-1) and other on
ParaNMT + PAWS5 datasets (Classifier-2),

5Because the ParaNMT dataset only contains paraphrase
对, we augment it with the PAWS (张等人。, 2019)
dataset to acquire negative samples.

337

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

A
C
_
A
_
0
0
3
1
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

模型

BLEU↑ METEOR↑ ROUGE-1↑

ROUGE-2↑

ROUGE-L↑

TED-R↓

TED-E↓

PDS↑

QQP-Pos

Source-as-Output
Exemplar-as-Output

SCPN (Iyyer et al., 2018)
CGEN (陈等人。, 2019A)

SGCP-F
SGCP-R

Source-as-Output
Exemplar-as-Output

SCPN (Iyyer et al., 2018)
CGEN (陈等人。, 2019A)

SGCP-F
SGCP-R

17.2
16.8

15.6
34.9

36.7
38.0

18.5
3.3

6.4
13.6

15.3
16.4

31.1
17.6

19.6
37.4

39.8
41.3

28.8
12.1

14.6
24.8

25.9
27.2

51.9
38.2

40.6
62.6

66.9
68.1

50.6
24.4

30.3
44.8

46.6
49.6

26.2
20.5

20.5
42.7

45.0
45.7

ParaNMT-small

23.1
7.5

11.2
21.0

21.8
22.9

52.9
43.2

44.6
65.4

69.6
70.2

47.7
29.1

34.6
48.3

49.7
50.5

16.2
4.8

9.1
6.7

4.8
6.8

12.0
5.9

6.2
6.7

6.1
8.7

16.6
0.0

8.0
6.0

1.8
5.9

13.0
0.0

1.4
3.3

1.4
7.0

99.8
10.7

27.0
65.4

75.0
87.7

99.0
14.0

15.4
70.2

76.6
83.5

桌子 2: Results on QQP and ParaNMT-small dataset. Higher↑ BLEU, METEOR, ROUGE, and PDS
is better whereas lower↓ TED score is better. SGCP-R selects the best candidate out of many, resulting
in performance boost for semantic preservation (shown in box). We bold the statistically significant
results of SGCP-F, 仅有的, for a fair comparison with the baselines. Note that Source-as-Output, 和
Exemplar-as-Output are only dataset quality indicators and not the competitive baselines. Please see
部分 5 欲了解详情.

which achieve accuracies of 90.2% 和
94.0% on their respective test sets.6

Once trained, we use Classifier-1 to evaluate
generations on QQP-Pos and Classifier-2 on
ParaNMT-small.

We first generate syntactic paraphrases using
all the models (部分 4.1) on the test splits
of QQP-Pos and ParaNMT-small datasets.
We then pair
the source sentence with
their corresponding generated paraphrases
and send them as input to the classifiers.
The Paraphrase Detection score, denoted as
PDS in Table 2, is defined as, the ratio
of the number of generations predicted as
paraphrases of their corresponding source
sentences by the classifier to the total number
of generations.

2. Human Evaluation.

Although TED is sufficient
to highlight
syntactic transfer,
there has been some
scepticism regarding automated metrics for
paraphrase quality (赖特, 2018). To address
this issue, we perform human evaluation
在 100 randomly selected data points from
the test set. In the evaluation, three judges

6Because the test set of QQP is not public, 这 90.2%
number was computed on the available dev set (not used for
model selection).

(non-researchers proficient in the English
语言) were asked to assign scores to
generated sentences based on the semantic
similarity with the given source sentence.
The annotators were shown a source sentence
and the corresponding outputs of the systems
in random order. The scores ranged from
1 (doesn’t capture meaning at all) 到 4
(perfectly captures the meaning of the source
句子).

4.4 Setup

(A) Pre-processing. Because our model needs
access to constituency parse trees, we tokenize
and parse all our data points using the fully
parallelizable Stanford CoreNLP Parser (曼宁
等人。, 2014) to obtain their respective parse trees.
This is done prior to training in order to prevent
any additional computational costs that might be
incurred because of repeated parsing of the same
data points during different epochs.

(乙) Implementation Details. We train both our
models using the Adam Optimizer (Kingma and
Ba, 2014) with an initial learning rate of 7e-
5. We use a bidirectional
three-layered GRU
for encoding the tokenized semantic input and a
standard pointer-generator network with GRU for
decoding. The token embedding is learnable with
dimension 300. To reduce the training complexity

338

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

A
C
_
A
_
0
0
3
1
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

来源
Template Exemplar

SCPN (Iyyer et al., 2018)
CGEN (陈等人。, 2019A)
SGCP-F (Ours)
SGCP-R (Ours)

来源
Template Exemplar

SCPN (Iyyer et al., 2018)
CGEN (陈等人。, 2019A)
SGCP-F (Ours)
SGCP-R (Ours)

来源
Template Exemplar

SCPN (Iyyer et al., 2018)
CGEN (陈等人。, 2019A)
SGCP-F (Ours)
SGCP-R (Ours)

what should be done to get rid of laziness ?
how can i manage my anger ?

how can i get rid ?
how can i get rid of ?
how can i stop my laziness ?
how do i get rid of laziness ?

what books should entrepreneurs read on entrepreneurship ?
what is the best programming language for beginners to learn ?

what are the best books books to read to read ?
what ’s the best book for entrepreneurs read to entrepreneurs ?
what is a best book idea that entrepreneurs to read ?
what is a good book that entrepreneurs should read ?

how do i get on the board of directors of a non profit or a for profit organisation ?
what is the best way to travel around the world for free ?

what is the best way to prepare for a girl of a ?
what is the best way to get a non profit on directors ?
what is the best way to get on the board of directors ?
what is the best way to get on the board of directors of a non profit or a for profit organisation ?

桌子 3: Sample generations of the competitive models. Please refer to Section 5.5 欲了解详情.

of the model, the maximum sequence length is
kept at 60. The vocabulary size is kept at 24K for
QQP and 50K for ParaNMT-small.

SGCP needs access to the level of syntactic
granularity for decoding, depicted as H in
数字 2. During training, we keep on varying
it randomly from 3 to Hmax, changing it with each
training epoch. This ensures that our model is able
to generalize because of an implicit regularization
attained using this procedure. At each time-step of
the decoding process, we keep a teacher forcing
的比率 0.9.

5 结果

5.1 Semantic Preservation and

Syntactic Transfer

1. Automated Metrics: As can be observed in
桌子 2, our method(s) (SGCP-F/R (部分 4.1))
are able to outperform the existing baselines on
both the datasets. Source-as-Output is independent
of the exemplar sentence being used and since a
sentence is a paraphrase of itself, the paraphrastic
scores are generally high while the syntactic
scores are below par. An opposite is true
for Exemplar-as-Output. These baselines also
serve as dataset quality indicators. It can be
seen that source is semantically similar while
being syntactically different from target sentence
whereas the opposite is true when exemplar is
compared to target sentences. 此外, 来源
sentences are syntactically and semantically
different from exemplar sentences as can be

observed from TED-E and PDS scores. 这
helps in showing that the dataset has rich enough
syntactic diversity to learn from.

Through TED-E scores it can be seen that
SGCP-F is able to adhere to the syntax of the
exemplar template to a much larger degree than
the baseline models. This verifies that our model
is able to generate meaning preserving sentences
while conforming to the syntax of the exemplars
when measured using standard metrics.

It can also be seen that SGCP-R tends to perform
better than SGCP-F in terms of paraphrastic scores
while taking a hit on the syntactic scores. 这
makes sense, intuitively, because in some cases
SGCP-R tends to select lower H values for syntactic
granularity. This can also be observed from the
example given in Table 6 where H = 6 is more
favorable than H = 7, because of better meaning
retention.

Although CGEN performs close to our model in
terms of BLEU, ROUGE, and METEOR scores
on ParaNMT-small dataset, its PDS is still much
lower than that of our model, suggesting that our
model is better at capturing the original meaning
of the source sentence. In order to show that the
results are not coincidental, we test the statistical
significance of our model. We follow the non-
parametric Pitman’s permutation test (Dror et al.,
2018) and observe that our model is statistically
significant when the significance level (A) 是
taken to be 0.05. Note that this holds true for
all metric on both the datasets except ROUGE-2
on ParaNMT-small.

339

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

A
C
_
A
_
0
0
3
1
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

QQP-Pos

SCPN

1.63

ParaNMT-small

1.24

CGEN

SGCP-F

SGCP-R

2.47

1.89

2.70

2.07

2.99

2.26

桌子 4: A comparison of human evaluation
scores for comparing quality of paraphrases
generated using all models. Higher score is better.
Please refer to Section 5.1 欲了解详情.

2. Human Evaluation: 桌子 4 shows the
results of human assessment. It can be seen that
annotators, generally tend to rate SGCP-F and SGCP-
右 (部分 4.1) higher than the baseline models,
thereby highlighting the efficacy of our models.
This evaluation additionally shows that automated
metrics are somewhat consistent with the human
evaluation scores.

5.2 Syntactic Control

1. Syntactical Granularity: Our model can
work with different levels of granularity for the
exemplar syntax, 即, different tree heights of
the exemplar tree can be used for decoding the
输出.

As can been seen in Table 6, at height 4 这
syntax tree provided to the model is not enough
to generate the full sentence that captures the
meaning of the original sentence. As we increase
the height to 5, it is able to capture the semantics
better by predicting some of
in the sentence.
We see that at heights 6 和 7 SGCP is able
to capture both semantics and syntax of the
source and exemplar, 分别. 然而, 作为
we provide the complete height of the tree (IE。,
7), it further tries to follow the syntactic input
more closely leading to sacrifice in the overall
relevance since the original sentence is about pure
substances and not a pure substance. It can be
inferred from this example that because a source
sentence and exemplar’s syntax might not be fully
compatible with each other, using the complete
syntax tree can potentially lead to loss of relevance
and grammaticality. Hence by choosing different
levels of syntactic granularity, one can address the
issue of compatibility to a certain extent.

2. Syntactic Variety: 桌子 5 shows sample
generations of our model on multiple exemplars
for a given source sentence. It can be observed
that SGCP can generate high-quality outputs for a
variety of different template exemplars even the

ones which differ a lot from the original sentence
in terms of their syntax. A particularly interesting
exemplar is what is chromosomal mutation ?
what are some examples ?. 这里, SGCP is able to
generate a sentence with two question marks while
preserving the essence of the source sentence. 它
should also be noted that the exemplars used in
桌子 5 were selected manually from the test sets,
considering only their qualitative compatibility
with the source sentence. Unlike the procedure
used for the creation of QQP-Pos dataset, the final
paraphrases were not kept in hand while selecting
the exemplars. In real-world settings, where a
gold paraphrase won’t be present, these results
are indicative of the qualitative efficacy of our
方法.

5.3 SGCP-R Analysis

ROUGE-based selection from the candidates
favors paraphrases that have higher n-gram
overlap with their respective source sentences,
hence may capture source’s meaning better.
This hypothesis can be directly observed from
the results in Tables 2 和 4, where we see
higher values on automated semantic and human
evaluation scores. Although this helps in obtaining
better semantic generation,
tends to result
in higher TED values. One possible reason is
那, when provided with the complete tree, fine-
grained information is available to the model for
decoding and it forces the generations to adhere
to the syntactic structure. 相比之下, at lower
heights, the model is provided with lesser syntactic
information but equivalent semantic information.

5.4 Qualitative Analysis

As can be seen from Table 7, SGCP not only
incorporates the best aspects of both the prior
型号, namely SCPN and CGEN, but also utilizes
the complete syntactic information obtained using
the constituency-based parse trees of the exemplar.
From the generations in Table 3, we can see that
our model is able to capture both the semantics of
the source text as well as the syntax of template.
SCPN, evidently, can produce outputs with the
template syntax, but it does so at the cost of
semantics of the source sentence. This can also be
verified from the results in Table 2, where SCPN
performs poorly on PDS as compared with other
型号. 相比之下, CGEN and SGCP retain much
better semantic information, as is desirable. 尽管
generating sentences, CGEN often abruptly ends the

340

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

A
C
_
A
_
0
0
3
1
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

SYNTACTIC EXEMPLAR

SGCP GENERATIONS

how can i get a domain for free ?

how can i develop a career in software ?

来源: how do i develop my career in software ?

what is the best way to register a company ?

what is the best way to develop career in software ?

what are good places to visit in new york ?

what are good ways to develop my career in software ?

can i make 800,000 a month betting on horses ?

can i develop my career in software ?

what is chromosomal mutation ? what are some examples ?

what is good career ? what are some of the ways to develop my career in
软件 ?

is delivery free on quikr ?

is career useful in software ?

is it possible to mute a question on quora ?

is it possible to develop my career in software ?

桌子 5: Sample SGCP-R generations with a single source sentence and multiple syntactic exemplars.
Please refer to Section 5.4 欲了解详情.

S

H = 4
H = 5
H = 6
H = 7

what are pure substances ? what are some examples ?
what are the characteristics of the elizabethan theater ?

what are pure substances ?
what are some of pure substances ?
what are some examples of pure substances ?
what are some examples of a pure substance ?

桌子 6: Sample generations with different levels
of syntactic control. S and E stand for source and
exemplar, 分别. Please refer to Section 5.2
欲了解详情.

Single-Pass

Syntactic Signal

粒度

SCPN

CGEN

SGCP

Linearized Tree

POS Tags (期间
训练)

Constituency Parse
Tree

桌子 7: Comparison of different syntactically
controlled paraphrasing methods. Please refer to
部分 5.4 欲了解详情.

5.5 Limitations and Future Directions

and can freely generate

All natural language English sentences cannot
necessarily be converted to any desirable syntax.
We note that SGCP does not take into account the
compatibility of source sentence and template
syntax
exemplars
conforming paraphrases. 这, 有时, leads to
imperfect paraphrase conversion and nonsensical
sentences like example 6 表中 5 (is career
有用
in software ?). Identifying compatible
exemplars is an important but separate task in
本身, which we defer to future work.

就是它

其他

important aspect

the task
of paraphrase generation is inherently domain
agnostic. It is easy for humans to adapt to new
domains for paraphrasing. 然而, because of
the nature of the formulation of the problem in
自然语言处理, all the baselines, as well as our model(s),
suffer from dataset bias and are not directly
applicable to new domains. A prospective future
direction can be to explore it from the lens of
domain independence.

句子, as in example 1 表中 3, truncating
the penultimate token with of. The problem of
abrupt ending due to insufficient syntactic input
length was highlighted in Chen et al. (2019A)
and we observe similar trends. SGCP, 在另一
手, generates more relevant and grammatical
句子.

Based on empirical evidence, SGCP alleviates
shortcoming, possibly due to dynamic

syntactic control and decoding. This can be
seen in, 例如, 例子 3 表中 3
where CGEN truncates the sentence abruptly
(penultimate token = directors) but SGCP is able to
generate relevant sentence without compromising
on grammaticality.

Analyzing the utility of controlled paraphrase
generations for the task of data augmentation is
another interesting possible direction.

6 结论

In this paper we proposed SGCP, an end-to-
end framework for
the task of syntactically
controlled paraphrase generation. SGCP generates
paraphrase of an input sentence while conforming
to the syntax of an exemplar sentence provided
along with the input. SGCP comprises a GRU-
based sentence encoder, a modified RNN-based
tree encoder, and a pointer-generator–based
to previous work
novel decoder. 相比之下

341

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

A
C
_
A
_
0
0
3
1
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3






that focuses on a limited amount of syntactic
控制, our model can generate paraphrases at
different levels of granularity of syntactic control
without compromising on relevance. Through
extensive evaluations on real-world datasets, 我们
demonstrate SGCP’s efficacy over state-of-the-art
基线.

We believe that the above approach can be
useful for a variety of text generation tasks
including syntactic exemplar-based abstractive
summarization, text simplification and data-to-
text generation.

致谢

This research is supported in part by the Ministry
of Human Resource Development (政府

印度). We thank the action editor Asli
Celikyilmaz and the three anonymous reviewers
for their helpful suggestions in preparing the
manuscript. We also thank Chandrahas for his
indispensable comments on earlier drafts of this
纸.

参考

Roee Aharoni and Yoav Goldberg. 2017. Towards
string-to-tree neural machine translation. 在
Proceedings of the 55th Annual Meeting of the
计算语言学协会
(体积 2: Short Papers), pages 132–140. 货车-
couver, 加拿大. Association for Computational
语言学.

Richard Chase Anderson and Alice Davison. 1986.
Conceptual and empirical bases of readability
formulas. Center for the Study of Reading
技术报告; 不. 392.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
本吉奥. 2014. Neural machine translation by
jointly learning to align and translate. arXiv
preprint arXiv:1409.0473.

Satanjeev Banerjee and Alon Lavie. 2005.
METEOR: An automatic metric for MT evalua-
tion with improved correlation with human
判断. In Proceedings of the ACL Work-
shop on Intrinsic and Extrinsic Evaluation
Measures for Machine Translation and/or
Summarization, pages 65–72.

342

Regina Barzilay and Lillian Lee. 2003. 学习
to paraphrase: An unsupervised approach using
multiple-sequence alignment. In Proceedings
的 2003 Conference of the North American
Chapter of the Association for Computational
Linguistics on Human Language Technology-
体积 1, pages 16–23. Association for Com-
putational Linguistics.

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals,
Andrew Dai, Rafal Jozefowicz, and Samy
本吉奥. 2016. Generating sentences from a
continuous space. In Proceedings of The 20th
SIGNLL Conference on Computational Natural
Language Learning, pages 10–21, 柏林,
德国. 协会
for Computational
语言学.

Ziqiang Cao, Wenjie Li, Sujian Li, and Furu
Wei. 2018. Retrieve,
rerank and rewrite:
Soft template based neural summarization. 在
Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 152–161.

Mingda Chen, Qingming Tang, Sam Wiseman,
and Kevin Gimpel. 2019A. Controllable para-
phrase generation with a syntactic exemplar.
In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 5972–5984, Florence, 意大利. 协会
for Computational Linguistics.

Mingda Chen, Qingming Tang, Sam Wiseman,
and Kevin Gimpel. 2019乙. A multi-task
approach for disentangling syntax and seman-
tics in sentence representations. In Proceedings
的 2019 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
体积
Short Papers),
pages 2453–2464, 明尼阿波利斯, Minnesota.
计算语言学协会.

(长的

1

Junyoung Chung, Caglar Gulcehre, Kyunghyun
给, and Yoshua Bengio. 2014. Empirical
evaluation of gated recurrent neural networks
on sequence modeling. In NIPS 2014 作坊
on Deep Learning, 十二月 2014.

Herbert H. Clark and Eve V. 克拉克. 1968.
Semantic distinctions and memory for complex
句子. Quarterly Journal of Experimental
心理学, 20(2):129–138.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

A
C
_
A
_
0
0
3
1
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Rotem Dror, Gili Baumer, Segev Shlomov, 和
Roi Reichart. 2018. The hitchhiker’s guide
to testing statistical significance in natural
语言处理. In Proceedings of the 56th
Annual Meeting of the Association for Comput-
ational Linguistics (体积 1: Long Papers),
pages 1383–1392, 墨尔本, 澳大利亚. Asso-
ciation for Computational Linguistics.

Ankush Gupta, Arvind Agarwal, Prawaan Singh,
and Piyush Rai. 2018. A deep generative
framework for paraphrase generation.

Thirty-Second AAAI Conference on Artificial
智力.

Samer Hassan, Andras Csomai, Carmen Banea,
Ravi Sinha, and Rada Mihalcea. 2007. Unt:
Subfinder: Combining knowledge sources for
automatic lexical substitution. In Proceedings
of the 4th International Workshop on Semantic
Evaluations, pages 410–413. 协会
计算语言学.

Dan Hendrycks and Kevin Gimpel. 2016.
linear units (gelus). arXiv

Gaussian error
preprint arXiv:1606.08415.

Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. 神经计算,
9(8):1735–1780.

Zhiting Hu, Zichao Yang, Xiaodan Liang,
Ruslan Salakhutdinov, and Eric P Xing. 2017.
Toward controlled generation of

Proceedings of the 34th International Confer-
恩斯
70,
on Machine Learning-Volume
pages 1587–1596. JMLR.org.

文本.

Judith W. Irwin. 1980. The effects of explicitness
and clause order on the comprehension of
reversible causal relationships. Reading Re-
search Quarterly, pages 477–488.

Phillip Isola,

Jun-Yan Zhu, Tinghui Zhou,
and Alexei A. Efros. 2017. Image-to-image
translation with conditional adversarial net-
作品. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition,
pages 1125–1134.

Mohit Iyyer, John Wieting, Kevin Gimpel, 和
Luke Zettlemoyer. 2018. Adversarial example
generation with syntactically controlled para-
phrase networks. 在诉讼程序中 2018

Conference of the North American Chapter of
the Association for Computational Linguistics:
人类语言技术, 体积 1
(Long Papers),
1875–1885, 新的
页面
Orleans, Louisiana. Association for Com-
putational Linguistics.

Evelyn Walker Katz and Sandor B. Brent. 1968.
Understanding connectives. Journal of Memory
and Language, 7(2):501.

Diederik P. Kingma and Jimmy Ba. 2014. 亚当:
A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.

Diederik P. Kingma and Max Welling. 2014.
Auto-encoding variational bayes. In Proceed-
ings of ICLR.

Ashutosh Kumar, Satwik Bhattamishra, Manik
Bhandari, and Partha Talukdar. 2019. Submod-
ular optimization-based diverse paraphrasing
and its effectiveness in data augmentation.
在诉讼程序中
这 2019 Conference of
the North American Chapter of the Associ-
ation for Computational Linguistics: 人类
语言技术, 体积 1 (Long and
Short Papers), pages 3609–3619, Minneapo-
利斯, Minnesota. Association for Computational
语言学.

Elena T. 征收. 2003. The roots of coherence
in discourse. Human Development, 46(4):
169–188.

客观的

Jiwei Li, 米歇尔·加莱, Chris Brockett, Jianfeng
高, and Bill Dolan. 2016. A diversity-
function for neural
promoting
conversation models. 在诉讼程序中

2016 Conference of
the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
pages 110–119.

Zichao Li, Xin Jiang, Lifeng Shang, and Hang
李. 2018. Paraphrase generation with deep
reinforcement learning. 在诉讼程序中
2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 3865–3878,
布鲁塞尔, 比利时. Association for Computa-
tional Linguistics.

Zichao Li, Xin Jiang, Lifeng Shang, and Qun
刘. 2019. Decomposable neural paraphrase

343

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

A
C
_
A
_
0
0
3
1
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

一代. In Proceedings of the 57th Annual
Meeting of the Association for Computational
语言学, pages 3403–3414, Florence, 意大利.
计算语言学协会.

Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Sum-
marization Branches Out. https://万维网.
aclweb.org/anthology/W04-1013.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. Roberta: A robustly optimized
bert pretraining approach. arXiv 预印本 arXiv:
1907.11692.

Nitin Madnani and Bonnie J. 多尔. 2010. Gen-
erating phrasal and sentential paraphrases: A
survey of data-driven methods. 计算型
语言学, 36(3):341–387.

Christopher D. 曼宁, Mihai Surdeanu, 约翰
Bauer, Jenny Finkel, Steven J. Bethard, 和
David McClosky. 2014. The Stanford CoreNLP
natural language processing toolkit. In Asso-
ciation for Computational Linguistics (前交叉韧带)
系统演示, pages 55–60.

Kathleen R. McKeown. 1983. Paraphrasing
questions using given and new information.
计算语言学, 9(1):1–10.

Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. 蓝线: A method
for automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics,
pages 311–318, 费城, 宾夕法尼亚州,
美国. 计算语言学协会.

Hao Peng, Ankur Parikh, Manaal Faruqui,
Bhuwan Dhingra, and Dipanjan Das. 2019.
Text generation with exemplar-based adaptive
decoding. 在诉讼程序中 2019 Confer-
ence of the North American Chapter of the
计算语言学协会:
人类语言技术, 体积 1
(Long and Short Papers), pages 2555–2565,
明尼阿波利斯, Minnesota. 协会
为了
计算语言学.

Aaditya Prakash, Sadid A. 哈桑, Kathy Lee,
Joey Liu,

Vivek Datla, Ashequl Qadir,

and Oladimeji Farri. 2016. Neural paraphrase
generation with stacked residual LSTM net-
作品. COLING 论文集 2016, 这
26th International Conference on Comput-
ational
文件,
pages 2923–2934, 大阪, 日本. The COLING
2016 Organizing Committee.

语言学:

Technical

Chris Quirk, Chris Brockett, and William Dolan.
2004. Monolingual machine translation for
paraphrase generation. 在诉讼程序中
2004 实证方法会议
自然语言处理, pages 142–149.

Ehud Reiter. 2018. A structured review of the
validity of BLEU. 计算语言学,
44(3):393–401.

Abigail See, 彼得·J. 刘, and Christopher D.
曼宁. 2017, Jul. Get to the point: Sum-
marization with pointer-generator networks. 在
Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 1073–1083,
Vancouver, 加拿大.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of rare
words with subword units. 在诉讼程序中
the 54th Annual Meeting of the Association
for Computational Linguistics (体积 1: 长的
文件), pages 1715–1725, 柏林, 德国.

Tianxiao Shen, Tao Lei, Regina Barzilay, 和
Tommi Jaakkola. 2017. Style transfer from non-
parallel text by cross-alignment. In Advances
in Neural Information Processing Systems,
pages 6830–6841.

Xing Shi, Inkit Padhi, and Kevin Knight. 2016.
Does string-based neural MT learn source
syntax? 在诉讼程序中 2016 会议
on Empirical Methods in Natural Language
加工, pages 1526–1534, Austin, 德克萨斯州.
计算语言学协会.

Advaith Siddharthan. 2014. A survey of research
on text simplification. ITL-International Jour-
nal of Applied Linguistics, 165(2):259–298.

Felix Stahlberg, Eva Hasler, Aurelien Waite, 和
Bill Byrne. 2016. Syntactically guided neural
machine translation. 在诉讼程序中

54th Annual Meeting of the Association for

344

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

A
C
_
A
_
0
0
3
1
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Natural Language Generation Conference,
pages 203–207. Association for Computational
语言学.

Xuewen Yang, Yingru Liu, Dongliang Xie, Xin
王, and Niranjan Balasubramanian. 2019.
Latent part-of-speech sequences for neural
machine translation.

Zichao Yang, Zhiting Hu, Chris Dyer, Eric P.
Xing, and Taylor Berg-Kirkpatrick. 2018. 和-
supervised text style transfer using language
In Advances in
models as discriminators.
系统,
Neural
pages 7287–7298.

Information Processing

Kaizhong Zhang and Dennis Shasha. 1989. Simple
fast algorithms for the editing distance between
trees and related problems. SIAM Journal on
计算, 18(6):1245–1262.

Yuan Zhang, Jason Baldridge, and Luheng He.
2019. PAWS: Paraphrase adversaries from
word scrambling. 在诉讼程序中 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
人类语言技术, 体积 1
(Long and Short Papers), pages 1298–1308.

Shiqi Zhao, Cheng Niu, Ming Zhou, Ting Liu,
and Sheng Li. 2008. Combining multiple
resources to improve SMT-based paraphras-
ing model. Proceedings of ACL-08: 赫勒特.
pages 1021–1029.

计算语言学 (体积 2: Short
文件), pages 299–305.

Kai Sheng Tai, Richard Socher, and Christopher
D. 曼宁. 2015. Improved semantic repre-
sentations from tree-structured long short-term
memory networks. In Proceedings of the 53rd
Annual Meeting of the Association for Com-
putational Linguistics and the 7th Interna-
tional Joint Conference on Natural Language
加工
Long Papers),
pages 1556–1566.

(体积

1:

Oriol Vinyals, Meire Fortunato, and Navdeep
Jaitly. 2015. Pointer networks. In Advances
in Neural Information Processing Systems,
pages 2692–2700.

Oriol Vinyals and Quoc Le. 2015. A neural
conversational model. arXiv 预印本 arXiv:
1506.05869.

John Wieting and Kevin Gimpel. 2018. Paranmt-
50米: Pushing the limits of paraphrastic sen-
tence embeddings with millions of machine
translations. In Proceedings of the 56th Annual
Meeting of the Association for Computational
语言学
Long Papers),
pages 451–462.

(体积

1:

Sander Wubben, Antal Van Den Bosch, 和
Emiel Krahmer. 2010. Paraphrase generation as
monolingual translation: Data and evaluation.
the 6th International
在诉讼程序中

345

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
3
1
8
1
9
2
3
7
4
9

/

/
t

A
C
_
A
_
0
0
3
1
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3Syntax-Guided Controlled Generation of Paraphrases image

下载pdf