Directed Acyclic Transformer Pre-training for High-quality - IA de Investigación especializada en el MIT

Preentrenamiento de transformador acíclico dirigido para alta calidad
Non-autoregressive Text Generation

Fei Huang Pei Ke Minlie Huang∗
The CoAI group, Universidad de Tsinghua, Beijing, Porcelana
Institute for Artificial Intelligence, Laboratorio estatal clave de tecnología y sistemas inteligentes,
Centro Nacional de Investigación de Ciencia y Tecnología de la Información de Beijing,
Departamento de Informática y Tecnología, Universidad de Tsinghua, Beijing, Porcelana
f-huang18@mails.tsinghua.edu.cn, kepei1106@outlook.com,

aihuang@tsinghua.edu.cn

Abstracto

Non-AutoRegressive (NAR) generación de texto
models have drawn much attention because
of their significantly faster decoding speed and
good generation quality in machine transla-
ción. Sin embargo, in a wider range of text gener-
ation tasks, existing NAR models lack proper
pre-training, making them still far behind
the pre-trained autoregressive models. En esto
paper, we propose Pre-trained Directed Acy-
clic Transformer (PreDAT) and a novel pre-
training task to promote prediction consistency
in NAR generation. Experiments on five text
generation tasks show that our PreDAT re-
markably outperforms existing pre-trained
NAR models (+4.2 score on average) e incluso
achieves better results than pre-trained autore-
gressive baselines in n-gram-based metrics,
junto con 17 times speedup in throughput.
Further analysis shows that PreDAT benefits
from the unbiased prediction order that alle-
viates the error accumulation problem in au-
toregressive generation, which provides new
insights into the advantages of NAR generation.1

Introducción

Pre-trained language models have been widely
applied in text generation (Radford et al., 2019;
Song et al., 2019; Lewis et al., 2020; Rafael y col.,
2020), which can effectively improve the perfor-
mance of downstream generation tasks, especially
in low-resource scenarios (Brown y cols., 2020).
Most of these pre-trained language models are
based on AutoRegressive (AR) generación, cual

∗ Corresponding author: Minlie Huang.
1Our code and pre-trained models are available at
https://github.com/thu-coai/DA-Transformer.

941

produces high-quality texts by predicting each
token one by one. Sin embargo, such a sequential
generation process suffers from high latency and
low throughput in inference, thereby largely lim-
iting the use of AR models in scenarios with
real-time requirements.

Non-AutoRegressive (NAR) generation is an
alternative text generation paradigm (Gu et al.,
2018). Unlike sequential generation in AR mod-
los, NAR models predict all tokens in parallel,
which largely accelerates the decoding process.
Although early NAR models suffer from serious
quality degradation due to the independent token
predicción, recent NAR studies have made much
progress on some generation tasks, such as ma-
chine translation (Qian et al., 2021; Gu and Kong,
2021; Huang et al., 2022a). Notablemente, Huang et al.
(2022C) propose Directed Acyclic Transformer,
which incorporates a directed acyclic graph to re-
duce the conflicts in capturing possible outputs,
achieving a comparable translation quality to the
AR models.

Despite the success of NAR generation in ma-
chine translation, it is still challenging to apply
NAR models to a wider range of generation tasks,
mainly due to the lack of appropriate pre-training.
Although some previous studies have explored
pre-training methods such as directly fine-tuning
BERT for NAR generation (Guo et al., 2020b;
Su et al., 2021; Jiang et al., 2021) or pre-training
NAR models from scratch (Qi et al., 2021; li
et al., 2022), their models still have a significant
quality gap compared with AR ones. We argue
that these methods do not fully exploit the char-
acteristic of NAR generation, thereby restricting
downstream performance. Específicamente, we dis-
cuss two main issues: (1) Previous pre-training

Transacciones de la Asociación de Lingüística Computacional, volumen. 11, páginas. 941–959, 2023. https://doi.org/10.1162/tacl a 00582
Editor de acciones: Alexander Rush. Lote de envío: 11/2022; Lote de revisión: 2/2023; Publicado 8/2023.
C(cid:3) 2023 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

tasks are ineffective in promoting sentence-level
prediction consistency, making it hard for their
models to predict a whole sentence simultane-
ously while preserving the fluency in down-
stream NAR generation. (2) Previous pre-training
tasks fail to address the multi-modality problem
(Gu et al., 2018), which has proved to be a funda-
mental and important challenge in training NAR
modelos (Huang et al., 2022b).

en este documento, we introduce PreDAT, a Pre-
trained Directed Acyclic Transformer for high-
quality non-autoregressive text generation. Nosotros
utilize the architecture of Directed Acyclic Trans-
former and further propose a novel pre-training
tarea, Double-Source Text Infilling (DSTI), aiming
to address the above issues in pre-trained NAR
modelos. Específicamente, DSTI contains two steps: Él
corrupts a sentence and scatters the tokens into
two sequences, which are fed into the encoder and
decoder as two sources of information; then the
model is trained to recover the corrupted fragments
non-autoregressively. During the pre-training, nuestro
model predicts long sentence fragments (acerca de
15 tokens) from nearby contexts, which promotes
prediction consistency and bidirectional depen-
dencies. Además, DSTI designs a strategy for
creating pre-training data pairs that allow the out-
put sequences to have flexible lengths, which well
incorporates various alignment-based NAR train-
ing objectives to alleviate the multi-modality prob-
lem (Libovick´y and Helcl, 2018; Ghazvininejad
et al., 2020; Du et al., 2021; Huang et al., 2022C).
Automatic evaluation shows that PreDAT is
effective and efficient on five text generation
tareas. It remarkably outperforms previous pre-
trained NAR models (+4.2 score on average) y
even achieves better results than pre-trained AR
baselines in n-gram-based metrics (+0.7 score on
promedio), along with a 17x speedup in through-
put. To our knowledge, PreDAT is the first NAR
model that outperforms pre-trained AR models
on various generation tasks in automatic evalua-
ción. Further ablation studies verify that our pre-
training task designs, including the long fragment
prediction and alignment-based training objec-
tives, are crucial for success.

To better understand the advantages and weak-
nesses of NAR generation, we use automatic and
manual methods to investigate the generated texts
in downstream tasks. We find that PreDAT can
alleviate the error accumulation in AR generation
and improve the relevance to the input, thereby

leading to a better performance in n-gram-based
métrica. Sin embargo, we also find that NAR mod-
los, including PreDAT, are still weaker than AR
models in preserving the consistency among gen-
erated tokens, leading to grammatical errors such
as wrong word choices. We believe that these
findings can provide novel insights for future
NAR studies.

2 Trabajo relacionado

Pre-trained Language Models (PLM)
In re-
cent years, PLMs have made significant progress
in natural language generation (Radford et al.,
2019; Song et al., 2019; Lewis et al., 2020;
Rafael y col., 2020). These PLMs are pre-trained on
a large corpus of unlabeled data, where the knowl-
edge can be transferred to downstream tasks, re-
sulting in improved generation quality.

Non-Autoregressive Generation Although NAR
generación (Gu et al., 2018) remarkably speeds up
the inference, Huang et al. (2022b) point out that
it theoretically suffers from serious information
dropping, previously known as the multi-modality
problema. To alleviate the problem, previous stud-
ies propose methods including (1) iterative re-
finement (Lee et al., 2018; Gu et al., 2019;
Ghazvininejad et al., 2019; Guo et al., 2020a;
Huang et al., 2022d); (2) knowledge distillation
(Kim and Rush, 2016; Ding et al., 2022, 2021a,b;
Shao et al., 2022); (3) dependency enhancement
(Sun et al., 2019; Qian et al., 2021; Huang et al.,
2022a; Bao et al., 2022); o (4) alignment-based
objectives (Ghazvininejad et al., 2020; Du et al.,
2021; Libovick´y and Helcl, 2018; Huang et al.,
2022C).

There are also studies combining PLMs and
NAR generation. Por ejemplo, some methods
fine-tune existing pre-trained models directly
(Jiang et al., 2021) or with an adapter (Guo et al.,
2020b; Su et al., 2021). Some others combine AR
and NAR prediction (Qi et al., 2021) or involve
an early exiting mechanism (Le et al., 2022) en
pre-training.

Compared with these studies, our method has
two significant differences: (1) Previous methods
either predict short spans (p.ej., BERT) or incor-
porate unidirectional AR prediction (Qi et al.,
2021), which hardly contribute to NAR genera-
tion that predicts a whole sentence with bidirec-
tional attention. A diferencia de, we train our model

942

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

to predict long fragments simultaneously, dirigir-
ing to better consistency among generated tokens.
(2) Previous methods use a token-level loss that
forces the model to predict a same-length se-
quence to match the target, which over-penalizes
the position shift error (Ghazvininejad et al., 2020)
and worsens the multi-modality problem. We in-
troduce an up-sampling strategy to obtain longer
output sequences, which well incorporates previ-
ous alignment-based NAR losses to address the
above problems.

3 Preliminaries: Directed
Acyclic Transformer

Directed Acyclic Transformer (DAT, Huang et al.,
2022C; see also Figure 1) is an NAR model that
effectively alleviates the multi-modality problem.
It introduces a longer decoding sequence and an
alignment-based objective to reduce the conflicts
in capturing multiple possible outputs. Specifi-
cally, given the input X = {x1, · · · , xM } y
the target sequence Y = {y1, · · · , entonces }, DAT pro-
duces a feature sequence V = {v1, v2, · · · , vL}
organized in a Directed Acyclic Graph (DAG),
where Y is aligned to a sub-sequence of V (equiv-
alently, assigned to a path of the DAG). Notablemente,
L is usually much larger than N to allow for
more flexible alignments. In DAT training, el
alignment-based objective marginalizes the prob-
abilities of all possible alignments that produce
the target Y , formally as

LDAT(V, Y ) = − log Pθ(Y |X)
= − log

(cid:2)

Pθ(Y |A, X)Pθ(A|X), (1)

A∈Γ
Pθ(yi|vai ),

Pθ(Y |A, X) =

Pθ(A|X) =

(cid:3)l

yo=1
(cid:3)L−1

yo=1

Cifra 1: Preliminaries: Directed Acyclic Transformer
(DAT). To alleviate the multi-modality problem, DAT
predicts a feature sequence V organized in a directed
acyclic graph (DAG) and then adopts an alignment-
based objective that aligns the target Y to the feature
sequence V , represented by LDAT(V, Y ).

Compared with previous NAR models, DAT
explicitly models the dependencies between tok-
ens by the position transitions and is able to
store multiple modalities on different paths of the
DAG, thereby remarkably improving the gener-
ation performance. Además, various decoding
algorithms such as beam search and Nucleus sam-
pling (Holtzman et al., 2020) can be utilized to
boost the generation quality or diversity.

eso

Besides LDAT, there are other alignment-based
objectives
succeed in alleviating the
multi-modality problem in NAR generation, semejante
as AXE (Ghazvininejad et al., 2020), OaXE
(Du et al., 2021), and CTC (Graves et al., 2006;
Libovick´y and Helcl, 2018). En general, estos
objectives are also obtained by aligning the target
Y with the feature sequence V , thus denoted by
l(V, Y ).

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Pθ(ai+1|ai),

4 Proposed Method

where A = {a1, · · · , aL} is feature indexes on
the aligned path, and Γ contains all possible
(cid:5)
(cid:4)
l
. Pθ(yi|vai) represents
paths with the size of
norte
token probabilities predicted from the feature vai,
and Pθ(ai+1|ai) represents transition probabili-
ties revealing how likely ai+1 follows ai in a
camino. Since it is impossible to enumerate the huge
number of paths in Equation 1, a dynamic pro-
gramming algorithm can be adopted to address
the problem, whose details can be found in the
original paper (Huang et al., 2022C).

En esta sección, we introduce PreDAT, Pretrained
Directed Acyclic Transformer. We first propose
the pre-training task (Sección 4.1) and then de-
scribe the fine-tuning and inference strategies
(Sección 4.2).

4.1 Pre-training Task

Our pre-training task, Double-Source Text In-
filling (DSTI), is a self-supervised pre-training
task that aims to promote prediction consistency
and bidirectional dependencies for NAR mod-
los. Our task scatters part of a sentence into two

943

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 2: An overview of Double-Source Text Infilling (DSTI). (a) Data Preparation: DSTI first creates the en-
coder input X and the target Y by span masking, and then obtains the decoder input Z by up-sampling, assigning,
and re-masking. (b) Pre-training Process: The NAR model is trained to predict the unseen fragments in Y in
parallel, with X and Z as inputs. The training objective is the sum of alignment-based NAR losses, cual
are obtained by aligning each target fragment (p.ej., Y1:4) to the feature sequence on the corresponding masked
segments (p.ej., V1:6).

sequences, feeds them into the encoder and de-
coder as two sources of information, y luego
trains the model to predict long unseen fragments
in a non-autoregressive fashion. Although DSTI
is compatible with various NAR architectures
and losses, we mainly focus on DAT due to its
superior performance.

As shown in Figure 2, nuestro

task takes a
piece of text from the pre-training corpus and
into a triple (X, z, Y ), dónde
decomposes it
X = {x1, · · · , xM } is the encoder input, Z =
{z1, · · · , zL} is the decoder input, and Y =
{y1, · · · , entonces } is the target. The data preparation
consists of two stages.

Stage 1: Creating Encoder Input We utilize
span masking (Rafael y col., 2020) to obtain the
encoder input X and the target Y . Específicamente,
we randomly mask tokens in the original sen-
tence, and then replace consecutive masks into
a single special token representing the span ID.
Then the prediction target Y is constructed by
concatenating the masked spans with the span IDs
as delimiters.

Specially, we force each masked span to be
long enough (acerca de 15 tokens) because the NAR
model has to generate a whole sentence simul-
taneously in inference, where predicting short
spans is unhelpful in preserving sentence-level
consistencia.

Stage 2: Creating Decoder Input The decoder
input Z plays two roles in our pre-training: (1)

It reveals some target tokens to promote bidirec-
tional dependencies in the decoder. (2) It deter-
mines the length of the predicted feature sequence.
To incorporate the alignment-based NAR losses
that require a longer feature sequence than the
objetivo (such as DAT and CTC), we create the
decoder input Z by an up-sampling step. Then we
assign a part of the target tokens to appropriate
positions in Z, where the unseen tokens will be
used as prediction targets. Específicamente, creating Z
follows three steps: up-sampling, assigning, y
re-masking.

For up-sampling, we decide the length of Z
based on the target length. Formalmente, tenemos
l := λN , where λ is an up-sampling ratio. En
DAT, varying L can bring different DAG sizes
and structures, where we sample λ from a uni-
form distribution to diversify the DAG structures
in pre-training. After determining the length, el
span IDs are put into Z according to the up-
sampling ratio, which will not be modified in the
later steps.

For assigning, we distribute the target tokens
in Z, regardless of whether the token will appear
in the final input. Formalmente, we use an assignment
secuencia {ai}1≤i≤N indicating that zai := yi. Todo
other positions in Z are masked. For obtaining
the sequence {ai}, a straightforward strategy is
to use uniform assignment, such that every two
consecutive target tokens are separated by a con-
stant number of [Mask]. In the pilot experiment,
we find it better to use the strategy of glancing
training (Huang et al., 2022C; Qian et al., 2021),

944

which first predicts a DAG with a fully masked
Z and then assigns the target tokens on the posi-
tions that form the most probable path of the DAG.
For re-masking, we determine the tokens fi-
nally appearing in Z and then mask the remaining
unos. Formalmente, we randomly sample a fixed pro-
portion of tokens to form a set R, where zai := yi
if i ∈ R, and all the other tokens in Z are masked.

Training Objective Our objective is to recon-
struct the unseen target fragments according to
the given context, similar to masked language
modelling (Devlin et al., 2019) but with a sig-
nificant difference. Instead of using a token-level
loss that forces each masked position to predict
the corresponding target token, we obtain the sum
of alignment-based losses that aligns each unseen
target fragment to the feature sequence predicted
on the corresponding masked segments. Nota
that the feature sequence is longer than the tar-
get fragment, which brings a larger DAG with a
higher capacity to capture multiple possible in-
filling results.

Específicamente, the decoder input consists of sev-
eral consecutive masked segments segmented by
the observed token or span IDs. Each masked
segment will produce a feature sequence Vai:aj ,
which is then aligned to the corresponding target
fragments Yi:j for the DAT loss. The final loss
is equal to the sum of the DAT loss of all frag-
mentos. Formalmente,

(cid:6)

V =
L =

(cid:7)

v1, · · · , v|z|

(cid:2)

= fθ(X, z),
LDAT(Vai:aj , Hacer:j),

i,j∈frag(R)

where frag(R) consists of all pairs (i, j) representar-
senting the start and end position of unseen frag-
mentos, and LDAT is defined in Equation (1).

Notablemente, our idea can be applied to other
alignment-based NAR losses, such as CTC loss
(Graves et al., 2006), which also trains the model
by aligning the target fragment to a longer pre-
dicted feature sequence. We verify the generality
of DSTI with various loss functions in Section 5.4.

4.2 Fine-tuning and Inference

We generally follow the original training method
(Huang et al., 2022C) to fine-tune our PreDAT on
the downstream datasets while introducing some
improvements: We add a target length predictor
for better adaption to tasks with various ratios of

945

Cifra 3: Illustrations of (a) vanilla decoding and (b)
overlapped decoding. Overlapped decoding reduces the
GPU idle time, leading to higher decoding throughput.

input and target lengths, and further propose a
trick to improve the decoding throughput.

Length Prediction The original DAT simply
sets the feature length L to be a constant multiple
of the input length, which in most cases of ma-
chine translation, satisfies the constraint that the
feature sequence should be longer than the target
length. Sin embargo, the targets in our downstream
tasks can be arbitrarily long, making this strategy
improper.

To better apply PreDAT to various genera-
tion tasks without the constraint of input and tar-
get length, we introduce a length predictor during
fine-tuning and inference. Específicamente, in fine-
tuning, we use a similar up-sampling strategy as
the pre-training to obtain the decoder input length,
es decir., λ times the target length. Then we adopt a
length predictor on the top of the encoder and train
it to predict the target length as a classification.
In inference, we obtain the predicted length from
the predictor, and then multiply it with ˆλ to ob-
tain the decoder input length, where ˆλ is a hyper-
parameter tuned on the validation set that controls
the length of generated sentences.

Overlapped Decoding PreDAT predicts the
DAG in parallel on GPU, and then executes a de-
coding algorithm (p.ej., beam search; Huang et al.,
2022C) on CPUs to obtain the most likely output
from the DAG. As shown in Figure 3, we overlap
the GPU and CPU execution, which reduces the
GPU idle time and utilizes multiple CPU cores to
parallelly process the batches, leading to remark-
ably higher decoding throughput while not affect-
ing the latency.

5 experimentos

5.1 Detalles de implementacion

Model Configurations Our PreDAT is based on
a 6-layer encoder-decoder Transformer (Vaswani

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

et al., 2017) with a hidden size of 768, following
the base version of AR and NAR baselines.

Pre-Training We pretrain PreDAT with DSTI
on 16GB English corpus from Wikipedia and
BookCorpus (Zhu et al., 2015), with the vocab-
ulary of bert-base-uncased. In stage 1, we take
a sequence with about 600 tokens and mask 6
equal-length spans that account for 15% tokens.
In stage 2, we sample λ uniformly from [4, 8]
and mask 90% tokens in the re-masking step. Y-
less otherwise specified, we pre-train PreDAT for
500k update steps with a batch size of 256 muestras
and use Adam optimizer (Kingma and Ba, 2015)
with a learning rate of 2e-4. We utilize Light-
Seq (Wang y cols., 2022) to accelerate the training
(not used in inference), and the pre-training lasts
aproximadamente 72 hours on 8 Nvidia A100-40G
GPUs.

Fine-Tuning We fine-tune PreDAT on down-
stream datasets with the DAT loss and glancing
training (Qian et al., 2021; Huang et al., 2022C)
without knowledge distillation. According to the
average sample lengths of each dataset, cada
mini-batch has approximately 4k target tokens
for PersonaChat, XSUM, SQuAD1.1, and 8k tar-
get tokens for ROCStory and Quora. We use the
early-stop trick according to the performance on
the validation set. It usually takes less than 60k
steps on SQuAD1.1, Quora, and PersonaChat, y
100k steps on XSUM and ROCStory. We tune the
glancing ratio from {0.3, 0.5}, and learning rate
de {1e-5, 2e-5, 5e-5, 1e-4, 2e-4}. We evaluate
the model every 5k steps on the validation set
and obtain the final model by averaging the five
best checkpoints.

Inference We utilize lookahead decoding (de-
fault unless otherwise specified) and beamsearch
(Huang et al., 2022C) to decode a sequence from
predicted DAG. We use a beam size of 200 y
incorporate a 5-gram LM in the beam search. Para
open-ended generation, we further employ Nu-
cleus sampling (Holtzman et al., 2020).

For these three decoding strategies, we prevent
any repeated tri-gram in expanding the decod-
ing path on the DAG, which is inspired by a
similar strategy used in autoregressive decod-
En g (Paulus et al., 2018). Además, nosotros también
prevent consecutive uni-gram and bi-gram rep-
etitions, which are common errors in PreDAT’s
outputs.

# Samples

Length

Tarea

Dataset
SQuAD1.1♠
XSUM♠
Quora♥
PersonaChat♠ Dialog Generation
ROCStory♣
Story Generation

Question Generation
Summarization
Paraphrase Generation 138k/5k/4k

75k/10k/12k 149.4/11.5
204k/11k/11k 358.5/21.2
11.5/11.5
122k/15k/14k 120.8/11.8
9.2/41.6
88k/5k/5k

Mesa 1: Dataset statistics. # Samples shows the
number of samples in training/validation/test set.
Length shows the average length of input/target.
We use the processed datasets and evaluation
metrics from ♠ Liu et al. (2021), ♥ Jiang et al.
(2021), ♣ Guan et al. (2020).

5.2 Experiment Settings

Datasets and Metrics We utilize five data-
conjuntos: SQuAD1.1 (Rajpurkar et al., 2016), XSUM
(Narayan et al., 2018), Quora2, PersonaChat (zhang
et al., 2018), and ROCStory (Mostafazadeh et al.,
2016). We use the processed datasets and the
evaluation metrics from previous work, as shown
en mesa 1. Note that we use corpus BLEU
(Papineni et al., 2002) on all datasets because
the sentence BLEU may unreasonably prefer very
long outputs due to the smoothing method.3

To evaluate the decoding speedup, we use two
métrica: Latency measures the average time of
processing a single sample, and throughput mea-
sures the average speed in processing the whole
test set, where we tune the batch size to maximize
the throughput. All models except MIST are im-
plemented with Fairseq (Ott et al., 2019) + Apex,
where MIST is implemented with HuggingFace’s
transformadores (Wolf et al., 2019). For the beam
search algorithm on DAG, we adopt the C++ im-
plementation provided by Huang et al. (2022C).
The C++ optimization only affects the extra de-
coding step on the CPU, but does not speedup the
transformer model. All results of speed are eval-
uated on a workstation with an Nvidia V100-32G
GPU and 2 Intel Xeon Gold 6226R CPUs with
32 cores.

Baselines Our baselines include autoregressive
Transformador (Vaswani et al., 2017), pre-trained
AR models (MASS, Song et al., 2019; BART,
Lewis et al., 2020; ProphetNet, Qi et al.,
2020), non-pretrained NAR models (Vanilla NAT,

2https://quoradata.quora.com/First-Quora

-Dataset-Release-Question-Pairs.

3Some previous work (Liu et al., 2021) utilize nltk’s

sentence BLEU with SmoothingFunction().method7.

946

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

Pre-

SQuAD1.1

XSUM

Quora

Avg.

Latency

Throughput

entrenado? R-L

B-4 MTR

R-1

R-2

R-L

B-1

B-4 MTR

ms/sample

samples/s

Autoregressive Text Generation Models

Transformador
MASS
BART
ProphetNet

Vanilla NAT
GLAT+CTC
DSLP+CTC
LatentGLAT
BANG
MIST

PreDAT (Ours)

w/ BeamSearch
w/o Pre-training

norte
Y
Y
Y

norte
norte
norte
norte
Y
Y

Y
Y
norte

4.61

9.86 30.66 10.80 24.48 58.57 30.14 31.79 25.59

29.43
49.48 20.16 24.41 39.70 17.24 31.91 60.56 32.39 32.92 34.31 353 (1.0X)
42.55 17.06 23.19 38.79 16.16 30.61 61.56 31.57 32.42 32.66
48.00 19.58 23.94 39.89 17.12 32.07 62.59 33.80 33.95 34.55

−
−

−

−
12 (1.0X)
−
−

Non-autoregressive Text Generation Models

3.88 20.32 39.85
18.90 17.68
2.46
8.86 24.04
31.51
9.06 24.68 58.96 26.67 30.55 25.00
3.21 10.21 31.34
30.31
7.35 22.73 61.12 29.70 32.37 24.92
3.00 10.59 28.75
28.70
7.00 22.66 59.78 28.30 31.26 24.28
28.28
2.38 10.43 28.44
44.07 12.75 18.99 32.59
8.98 27.41 55.18 24.97 25.99 27.88
47.13 16.00 21.10 34.63 11.29 28.70 59.65 29.00 31.56 31.01

9.33

−

24 (14.7X) 267 (21.5X)
24 (14.7X) 265 (21.4X)
28 (12.8X) 334 (27.0X)
18 (19.6X) 360 (29.0X)
22 (15.9X) 159 (12.8X)

49.78 21.74 24.58 38.80 16.07 31.78 62.63 32.59 33.37 34.59
50.41 22.66 25.11 39.79 17.38 32.71 62.62 33.18 33.52 35.26
3.30 10.32 32.56 11.17 26.21 59.82 28.17 31.10 25.86
30.11

26 (13.8X) 278 (22.5X)
63
(5.7X) 214 (17.3X)
25 (14.3X) 272 (21.9X)

Mesa 2: Performance on closed-ended text generation datasets. Bold and underlined values indicate
the best methods in NAR models and all models, respectivamente. Latency measures the average time of
processing samples with a batch size of 1, and Throughput measures the speed of processing samples
with a large batch size (tuned to maximize the throughput), which are evaluated on the test set of XSUM.
The metrics include ROUGE-1/2/L (R-1/2/L), BLEU-1/4 (B-1/4), and METEOR (MTR).

Gu et al., 2018); GLAT+CTC, Qian et al., 2021;
DSLP+CTC, Huang et al., 2022a; LatentGLAT,
Bao et al., 2022), and pre-trained NAR mod-
los (BANG, Qi et al., 2021; MIST, Jiang et al.,
2021). All these baselines have the same num-
ber of layers and hidden sizes as our PreDAT,
except that LatentGLAT utilizes a 4-layer latent
predictor and a 4-layer decoder based on the origi-
nal implementation. Note that CTC-based models
also require an up-sampling strategy, so we add
a length predictor following the description of
Sección 4.2. Their up-sampling ratio is sampled
de [1.5, 2] in training and tuned on the vali-
dation set in inference. For AR baselines, unless
otherwise specified, we use BeamSearch with a
beam size of 5 and the tri-gram repetition preven-
tion trick (Paulus et al., 2018), and tune the length
penalty on the validation set. For NAR baselines,
we use greedy decoding and further remove con-
secutive repeated tokens after generation (Le et al.,
2019). Some results are collected from Liu et al.
(2021); Qi et al. (2021); Jiang et al. (2021).

5.3 Automatic Evaluation

Closed-Ended Text Generation We first test
PreDAT on three closed-ended text generation
tareas, including question generation, summariza-
ción, and paraphrase generation. Closed-ended text

generation tasks usually have strict semantic con-
straints on the outputs, aiming to test the model’s
ability to extract and organize information.

As shown in Table 2, PreDAT achieves sur-
prisingly good results in both speed and quality.
We highlight our advantages as follows:

• PreDAT remarkably improves the quality
of NAR generation. Compared with previ-
ous pretrained NAR models, PreDAT brings
large improvement (+4.2 scores on average)
due to our DSTI pre-training and the DAT
architecture. Además, PreDAT even out-
performs the best AR baseline by 0.7 puntuaciones.
To our knowledge, it is the first time that
an NAR model achieves comparable and
even stronger performance than AR mod-
els in n-gram-based metrics on a wide range
of text generation tasks.

• PreDAT is highly efficient. Although our
model is slightly slower than previous NAR
models due to a longer sequence prediction,
it still achieves a speedup of 5∼14 times in
latency and 17∼23 times in throughput com-
pared with AR generation. It verifies that
PreDAT can largely reduce computing con-
sumption in decoding, showing its potential
for real-time applications.

947

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

Transformador
MASS

w/ Sampling

BART

w/ Sampling

Vanilla NAT
BANG
MIST

PreDAT (Ours)
w/ Sampling
w/ BeamSearch
w/o Pre-training

Pre-

PersonaChat

ROCStory

Latency

Throughput

entrenado?

B-1

B-2

D-1

D-2

B-1

B-2

D-4

ms/sample

samples/s

Autoregressive Text Generation Models

8.07
14.70
12.13
14.69
12.31

1.43
1.20
1.85
1.39
1.97

10.04
7.58
13.09
8.85
14.50

30.68
35.02
32.56
35.45
33.95

14.67
16.96
14.97
17.22
15.28

Non-autoregressive Text Generation Models

6.37
7.33
8.86

15.05
12.29
15.39
10.38

0.43
2.12
0.54

1.33
1.77
1.15
0.52

0.96
23.02
2.56

8.31
15.62
6.30
3.29

28.44
29.38
23.57

34.11
32.52
34.61
31.81

11.29
11.78
9.09

17.17
15.61
17.84
15.41

35.18
51.20
73.72
49.03
73.62

89.13
92.10
8.15

57.50
74.37
50.55
52.97

18.37
26.82
23.90
26.84
24.00

18.33
17.38
18.55

27.06
24.23
27.31
21.96

norte
Y
Y
Y
Y

norte
Y
Y

Y
Y
Y
norte

168 (1.1X)
180 (1.0X)
130 (1.4X)
199 (0.9X)
143 (1.3X)

28 (1.1X)
25 (1.0X)
77 (3.0X)
23 (0.9X)
69 (2.7X)

23 (7.8X)
18 (10.1X)
25 (7.3X)

24 (7.6X)
24 (7.4X)
48 (3.7X)
25 (7.2X)

703 (27.7X)
649 (25.6X)
330 (13.0X)

507 (20.0X)
514 (20.3X)
318 (12.6X)
562 (22.2X)

Mesa 3: Performance on open-ended text generation datasets. Latency and Throughput are evaluated
on the test set of PersonaChat. Average scores are not shown because they cannot reflect the trade-off
between quality and diversity. We utilize corpus BLEU on all datasets, whose values may be different
from some previous results utilizing sentence BLEU (Liu et al., 2021). The metrics include BLEU-1/2
(B-1/2) and Distinct-1/2/4 (D-1/2/4).

Open-Ended Text Generation We further test
PreDAT on two open-ended text generation tasks,
dialog generation and story generation. Open-
ended text generation tasks encourage the model
to produce novel and diverse outputs, where sam-
pling decoding methods are commonly adopted to
promote generation diversity.

Por lo tanto, in addition to lookahead decoding
and beamsearch, we also introduce Nucleus sam-
pling (Holtzman et al., 2020). Específicamente, we set
pag = 0.9 and the temperature τ = 1 for PreDAT.
For MASS and BART, we also use p = 0.9,
but τ = 0.8 on PersonaChat and τ = 0.7 en
ROCStory to achieve similar diversity as PreDAT.
We present the evaluation results in Table 3
and the trade-off of quality and diversity by tun-
ing the temperature in Figure 4. Generally, el
comparison of quality metrics is similar to closed-
ended generation: PreDAT largely outperforms
NAR baselines and achieves comparable BLEU
scores to AR models. Además, we highlight two
findings:

• PreDAT generates plausible outputs

en
open-ended tasks while previous NAR mod-
els cannot. Open-ended generation tasks
usually have targets with diverse expressions,
which worsens the multi-modality problem

Cifra 4: Trade-off curves of quality and diversity
on PersonaChat. All models use Nucleus sampling
with p = 0.9 and temperature τ from {0, 0.2, 0.4, 0.6,
0.8, 1}.

and seriously degrades the NAR generation
quality. Específicamente, MIST shows very low
diversity because it generates numerous repe-
titions, and BANG shows very high diversity
because it introduces many incomprehensible
n-grams. A diferencia de, PreDAT has a rea-
sonable quality-diversity trade-off, demostración
its ability to address the serious challenges
brought by the multi-modality problem.

• PreDAT achieves a flexible quality and diver-
sity trade-off. As shown in Figure 4, Pre-
DAT is slightly better than two AR baselines

948

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 5: Average performance of previous baselines
(Gray) and NAR models pre-trained by our proposed
task with different loss functions (Red). The shown
scores are the average of automatic metrics on XSUM.

wrt.
the trade-off curves by tuning the
decoding temperature. It demonstrates that
PreDAT can meet the diversity requirement
of open-ended text generation, verifying its
generality in text generation.

Cifra 6: Comparisons of pre-training strategies by the
average validation score on SQuAD1.1 and XSUM.
All models are pre-trained for 100k steps for energy
saving. The strategies in our final model are marked
in Red.

5.4 Ablation Study

En esta sección, we conduct ablation studies to
reveal how our designs contribute to the results.

Loss Function In PreDAT, we utilize the DAT
loss to alleviate the multi-modality problem,
which plays an important role in the pre-training.
Notablemente, our pre-training task can be combined
with other NAR losses, so we compare the DAT
loss against CTC (Graves et al., 2006; Libovick´y
and Helcl, 2018) and the token-level cross-entropy
loss (CE).

Específicamente, the same loss function is applied
in both pre-training and fine-tuning to avoid dis-
crepancies between the two training stages. Para
CTC, we randomly sample the up-sampling ratio
de [1.5, 2]. For CE, we do not use up-sampling
(es decir., λ = 1) because the CE loss requires an out-
put sequence with the same length as the target.

As shown in Figure 5, we find: (1) It is impor-
tant to incorporate alignment-based NAR losses
in pre-training, where CTC and DAT losses bring
substantial improvements compared with the CE
loss. (2) The NAR model pre-trained with CE still
outperforms previous pre-trained NAR baselines,
verifying the effectiveness of our pre-training
task in preserving sentence-level consistency and
promoting bidirectional dependencies.

Pre-training Strategy Our
pre-
training task includes several strategies for con-
structing the training data pair. To evaluate the

propuesto

effects of these strategies, we design four groups
of comparisons as follows, whose results are
como se muestra en la figura 6.

(a) Stage 1: Encoder Masking. Besides Span
masking, we use two other strategies including
Token masking that independently samples masked
positions (Devlin et al., 2019), and Sequence
masking that masks a single consecutive sequence.
All strategies mask the same ratio of tokens. Nosotros
conclude that the masked spans should not be
too short (about 1∼3 tokens in token masking) o
too long (acerca de 90 tokens in sequence masking),
which prevents the NAR model from learning
prediction consistency or make the prediction too
difficult.

(b) Stage 2, Step 1: Up-sample Ratios. We com-
pare the random sampling ratio (4∼8x) against
fixed up-sampling ratios (4x and 8x). We find that
random up-sampling can diversify the DAG struc-
tura, which works as a data augmentation method
and thus benefits the downstream performance.

(C) Stage 2, Step 2: Assignment Strategies. Ser-
sides the proposed assignment strategy according
to the path probability (MaxProb), we use Uni-
form and Random assignment that assigns the
target into the decoder input uniformly or ran-
domly. We find the MaxProb assignment can
better determine the lengths of each masked seg-
ment according to the model’s own prediction,
leading to slightly better results than the other
estrategias.

949

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 7: Validation performance under various com-
binations of up-sampling strategies in pre-training and
fine-tuning. The shown score is the average of auto-
matic metrics. 4x and 8x indicates fixed up-sampling
ratios, and 4∼8x indicates random ratios sampled from
[4, 8]. All models are pre-trained for only 100k steps.

(d) Stage 2, Step 3: Re-masking Strategies.
Besides the Fixed masking strategy, we also try
Adaptive and Adaptive + Annealing masking
strategies proposed by Qian et al. (2021), dónde
they adjust the masking ratio by the difficul-
ties of the sample. It shows that these strategies
have similar performance, outperforming the fully
masked decoder input (All Masked), which ver-
ifies the importance of introducing information
in the decoder input for bidirectional dependency
modelling. Sin embargo, the adaptive masking is less
effective in pre-training, so we use the fixed
masking ratio for simplicity.

Up-sampling Ratio in Fine-tuning As de-
scribed in Section 4.2, we obtain the decoder input
length in fine-tuning by up-sampling. To inves-
tigate how the up-sampling strategies affect per-
rendimiento, we evaluate different combinations of
up-sampling ratios in pre-training and fine-tuning.
As shown in Figure 7, random up-sampling
always benefits the performance in pre-training
and fine-tuning, together bringing an improve-
ment of about 1.2 puntuaciones. It indicates that varying
the DAG size is an important trick in training
PreDAT. Además,
the up-sampling ratios in
pre-training and fine-tuning do not need to be
lo mismo, which can be helpful if smaller DAG
sizes are preferred in downstream tasks due to
limited memory budget.

Overlapped Decoding Overlapped decoding
aims to improve the decoding throughput by over-
lapping the execution of DAG prediction and
beam search decoding. To verify its effectiveness,
we evaluate the speedup with various batch sizes
on XSUM.

Cifra 8: Throughput speedups with the vanilla and
overlapped decoding on the test set of XSUM.

As shown in Figure 8, our overlapped decoding
brings a 17.3x speedup with a batch size of 32,
largely outperforming the vanilla one. Nosotros también
note that throughput starts to decline as batch
size increases, possibly because the introduced
paddings increase the consumption of invalid
computations.

5.5 Análisis

En esta sección, we investigate the reasons why
PreDAT achieves better automatic scores than
pre-trained AR baselines, which may provide
some insights for future NAR generation studies.

PreDAT Alleviates Error Accumulation. Er-
ror accumulation is a major concern of auto-
regressive generation (Bengio et al., 2015;
Ranzato et al., 2016; Arora et al., 2022), dónde
a prediction error may be propagated into later
decoding steps, leading to low-quality generated
oraciones. A diferencia de, NAR models naturally
avoid the problem due to their unbiased prediction
orden.

To verify that PreDAT has advantages in tack-
ling error accumulation, we compare PreDAT
against two AR models with different decoding
orders, a left-to-right (L2R) one and a right-to-
izquierda (R2L) uno. Específicamente, we fine-tune MASS
on the downstream datasets using the two gener-
ation orders. We find that MASS still performs
well in right-to-left decoding, with a performance
drop of less than 0.5 puntuaciones. Then we calculate
the average token prediction accuracy bucketed
by the relative position, formally defined as

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Acc(D) = Average(I( ˆY (i)

para 1 ≤ i ≤ N, 1 ≤ j ≤ | ˆY (i)|,

∈ D,

∈ Y (i)))
j
| ˆY (i)| + 1

where Acc(D) is the average prediction accu-
racy on the interval D, Y (i) is the i-th sample in

950

Dataset

Modelo

Knowledge

PARENT-T

PAG

XSUM

MASS
35.1 9.7 14.7 35.1 8.5 13.1
PreDAT 36.3 9.9 14.9 36.4 8.6 13.3

PersonaChat

19.6 17.2 17.8 13.2 11.3 11.5
MASS
PreDAT 21.1 17.7 18.5 13.8 11.0 11.5

Cifra 9: Normalized prediction accuracy ΔAcc(D)
bucketed by relative positions. The shaded area is 95%
confidence interval by bootstrap sampling (Efron and
Tibshirani, 1994). L2R: left-to-right, R2L: right-to-left.

Mesa 4: Relevance to the input on XSUM and
PersonaChat. We utilize two automatic metrics,
Knowledge F1 and PARENT-T. PAG: Precision,
R: Recordar.

el conjunto de prueba, ˆY (i) is the i-th model outputs, y
j indicates the position. Además, since predic-
tion difficulties vary with the positions (p.ej., el
last token is always punctuation), we utilize a
normalized accuracy:

ΔAcc(D) = Acc(D) − AccL2R(D) + AccR2L(D)

where AccL2R(D) and AccR2L(D) indicate the
prediction accuracy of L2R and R2L MASS.

As shown in Figure 9, we find that MASS has
a strong tendency to predict earlier generated to-
kens more accurately than later generated ones,
which applies to both left-to-right and right-to-left
modelos. A diferencia de, our PreDAT shows no sig-
nificant preference for any positions because it
predicts all tokens simultaneously, which reveals
the advantages of unbiased prediction order in
NAR generation models.

PreDAT Improves the Relevance to the Input.
Previous studies empirically found that AR gen-
erated texts may lose relevance to the input sen-
tenencias, which is also known as hallucination
(Maynez et al., 2020; Ji et al., 2022) or off-prompt
errores (Dou et al., 2022). One explanation is that
AR models may be distracted by its generated
prefixes, which can be avoided in NAR genera-
ción (Huang et al., 2021).

To verify our hypothesis, we introduce two met-
rics to evaluate the relevance to inputs: Knowl-
edge F1 (Shuster et al., 2021) and PARENT-T
(Dhingra et al., 2019; Wang y cols., 2020). Knowl-
edge F1 measures the unigram F1 between gen-
erated sentences and the input knowledge, mientras
PARENT-T measures n-gram entailment. Ambos
metrics require the extraction of knowledge pieces
that should appear in the generated sentences. Para

simplicity, we take each sentence in the passage
(of XSUM) or the persona profile (of Persona-
Chat) as a piece of knowledge and further filter
out the stop words.

As shown in Table 4, PreDAT achieves bet-
ter precision on both datasets in using the input
knowledge compared with MASS (+1.2 on av-
erage). It indicates that PreDAT is less likely to
produce irrelevant keywords, justifying our hy-
pothesis that the NAR model can better con-
centrate on the input. Sin embargo, we also notice
that PreDAT and MASS have comparable perfor-
mance on recall, showing that it is still challenging
to cover more keywords.

5.6 Manual Evaluation

Although PreDAT shows surprising performance
in automatic evaluation, it is still questionable
whether these automatic metrics are reliable when
comparing AR and NAR models. In this sec
ción, we conduct a manual evaluation that com-
pares PreDAT against pre-trained AR and NAR
baselines.

Settings We compare PreDAT against
tres
baselines, two NAR models (BANG and MIST)
and an AR model (MASS). We randomly se-
lected 150 samples in SQuAD1.1, accounting for
600 generated sentences for the four models. Para
each sample, three annotators were asked to rank
the outputs from two dimensions: grammatical-
ity measures whether the output contains any
grammatical errors, and appropriateness mea-
sures whether the output is reasonable for the
given context.

Resultados Los resultados se muestran en la tabla. 5, dónde
we highlight two findings: (1) PreDAT achieves
a significant quality improvement over previous

951

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grammaticality

Appropriateness

Win

Tie Lose

Win

Tie Lose κ

Comparison against Non-autoregressive Models

vs. BANG 75.3** 12.0 12.7
vs. MIST 66.7** 18.0 15.3

0.66
0.50

69.8** 17.3 12.9 0.59
57.1** 26.0 16.9 0.47

Comparison against Autoregressive Models

vs. MASS 15.1

47.8 37.1** 0.32

32.2

36.7 31.1 0.46

Mesa 5: Manual evaluation results on SQuAD1.1.
Fleiss’ κ is shown for iter-rater reliability (todo
are fair agreement or above). * y ** indi-
cate p-value < 0.05 and 0.01 in the sign test, respectively. NAR models, where annotators highly prefer Pre- DAT (with Win% + Tie% > 83%). (2) There is still
a quality gap between PreDAT and the AR model.
Although PreDAT achieves higher word overlap
in automatic evaluation, it exhibits poorer gram-
maticality in human ratings. A possible reason is
that PreDAT preserves better relevance to the in-
puts, leading to the higher word overlap, sin embargo,
is still weaker than AR models in preserving the
consistency among generated tokens.

Typical Errors and Case Study To better
understand how PreDAT makes errors, we inves-
tigate the typical errors in the generated outputs.
Específicamente, we randomly chose 100 muestras
from SQuAD1.1, collected the outputs of the four
modelos, and then manually annotated the errors
in these outputs.

Cifra 10 presents the proportions of error
types. In terms of grammaticality, we find that
PreDAT addresses the major problems in previous
NAR models, such as incomprehensible outputs
and repetitions, Bueno. Sin embargo, there are still some
word errors, which affect only a small fragment
of the sentence but are very obvious to human
readers, leading to the unsatisfying result. We be-
lieve the problem can be alleviated by post-editing
or iterative refinement, which we leave for fu-
ture work. In terms of appropriateness, PreDAT
has comparable performance to MASS in error
distributions, showing its ability to extract and
organize information to form appropriate outputs.
To support the above discussions, we show
some output cases in Table 6. We find that pre-
vious NAR models usually generate low-quality
textos, whereas PreDAT achieves significant im-
provement. Además, PreDAT maintains a strong

Cifra 10: Proportion of samples with different error
types in terms of grammaticality and appropriate-
ness on SQuAD1.1. Word Error: containing less than
two wrong/missing/redundant tokens. Sentence Mix-
En g: can be splitted into two (cerca de) grammatically
correct sentences with a shared fragment. Answer Un-
emparejado: the generated question is not matched with
the given answer. Wrong Information: using incorrect
or unmentioned information in the passage.

relevance to the inputs, yet it can occasionally in-
troduce grammatical errors. A diferencia de, MASS
generates plausible outputs, but they may not
always be faithful. This observation highlights
the distinctive behaviors between AR and NAR
modelos.

6 Limitaciones

Although PreDAT achieves a significant advance-
ment in NAR generation, it still faces the following
limitations:

(1) Although PreDAT achieves

superior
performance in automatic evaluation, it still signi-
ficantly underperforms AR models in grammati-
cality according to human evaluation (as discussed
en la sección 5.6). This inconsistency can be at-
tributed to the different biases of AR and NAR
modelos: AR models tend to generate fluent
outputs but may sacrifice relevance to the input,
while NAR models prioritize relevance but may
incur grammatical errors. It is important to take
the behavior into consideration when applying
PreDAT to real-world applications.

(2) PreDAT may struggle with capturing long-
range coherence, because NAR models are in-
herently weak in modeling token dependencies,
and PreDAT is pre-trained only on predicting 15-
token-long fragments. Notablemente, our experiments

952

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Passage:

Answer:
BANG
MIST
MASS
PreDAT (Ours)

Passage:

Answer:
BANG
MIST
MASS
PreDAT (Ours)

Passage:

Answer:
BANG
MIST
MASS
PreDAT (Ours)

Persona:
Historia:

BANG
MIST
MASS
PreDAT (Ours)

Persona:
Dialog History:

BANG
MIST
MASS
PreDAT (Ours)

Persona:
Dialog History:

BANG
MIST
MASS
PreDAT (Ours)

SQuAD1.1

(74 words omitted) . . . JFK and Newark Liberty were the busiest and fourth busiest U.S. gateways for
international air passengers, respectivamente, en 2012 . . . (72 words omitted)
Newark Liberty International Airport
what airport in busiest airport in the u .
what is john f . kennedy international airport john f busiest international airport and laguardia ?
what is the name of the busiest airport in new york ?
what is the name of the fourth busiest airport for international air passengers in 2012 ?

(102 words omitted) . . . The FDIC guarantees the funds of all insured accounts up to US $ 100, 000 . . . (72 words omitted) US $ 100, 000
what much the deposits of deposits of allmac deposits ?
what is the funds of all insured ins allmac accounts to ?
how much does the fdic guarantee the funds of all insured accounts ?
how much is the fdic guarantee the funds of all insured accounts ?

When one Republican presidential candidate for the 2016 election ridiculed the liberalism of «Nueva York
valores» in January 2016, Donald Trump, leading in the polls, vigorously defended his city . . . (68 palabras
omitted)
Donald Trump
who did the republican the against new values » en
who was the leader the » new york in 2016 ?
who was the republican presidential candidate for 2016 ?
who led the polls in 2016 ?

PersonaChat

(9 words omitted) . . . I like to listen to country music . . . (24 words omitted)
A: Hello I like to travel.
B: Hello, how are you tonight? I do too and love to cook.
A: I would love to see europe.
i would like to too but i am a to .
what do you do for a living . for a living ?
i have never been to europe , but i have always wanted to go to australia .
i would love to go to europe . i am listening to country music .

I am an eccentric hair stylist for dogs . . . (27 words omitted)
(24 tokens omitted) …
A: I am doing wonderful, now that I avoided the mangoes. I am allergic.
B: Oh sorry to hear that I like going out with my friends.
i do you like.
what do you do for a living for a living ?
do you have any pets ? i have a dog .
what do you like to do with fun ? i am a hair stylist .

. . . (43 words omitted)
(131 tokens omitted) …
B: I bet you can learn a lot studying ice, must be cold though.
A: Es. Some people freeze to death.
B: Yikes, too cold for me. i will stay home with my pets!
i do you do any
what do you do . pets . cómo . tú ?
do you have any hobbies besides music ?
what kind of pets do you do ?

Mesa 6: Cases of model outputs on SQuAD1.1 and PersonaChat. Grammatical errors are marked in red.
The phrases that are faithful to the input are marked in blue, whereas the unfaithful ones are marked in
brown. All generated sentences are in lowercase.

are conducted on relatively short text generation
(whose length statistics are shown in Table 1), y
the performance on longer text generation tasks
requires further investigation.

(3) Compared with AR models, PreDAT re-
quires more GPU memory during inference and
takes more time in fine-tuning (typically 2∼4
times in our experiments). This is because Pre-

DAT’s decoder has to process a much longer
secuencia.

7 Conclusión

en este documento, we propose a pre-training task to
promote sentence-level consistency and bidirec-
tional dependencies for NAR generation. Nosotros

953

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

demonstrate that combining the state-of-the-art
NAR models with appropriate pre-training can
lead to efficient and high-quality text genera-
tion on a wide range of tasks, where our PreDAT
largely outperforms previous NAR pre-trained
models in generation quality. We further show
eso, compared with AR models, PreDAT al-
leviates error accumulation and enhances rele-
vance to inputs, but still introduces non-negligible
grammatical problems, thereby providing new in-
sights into the strengths and weaknesses of NAR
generación.

Expresiones de gratitud

This paper was supported by the National Science
Foundation for Distinguished Young Scholars
(with grant no. 62125604) and the Guoqiang
Institute of Tsinghua University, with grant no.
2020GQG0005. We are grateful to the action ed-
itor and the anonymous reviewers for their valu-
able suggestions and feedback.

Referencias

Kushal Arora, Layla El Asri, Hareesh Bahuleyan,
and Jackie Chi Kit Cheung. 2022. Por qué
exposure bias matters: An imitation learning
perspective of error accumulation in language
generación. In Findings of the Association for
Ligüística computacional: LCA 2022, Dublín,
Irlanda, May 22–27, 2022, pages 700–710.
https://doi.org/10.18653/v1/2022
.findings-acl.58

Yu Bao, Hao Zhou, Shujian Huang, Dongqi
Wang, Lihua Qian, Xinyu Dai, Jiajun Chen,
and Lei Li. 2022. latent-GLAT: Glancing at
latent variables for parallel text generation. En
Actas de la 60ª Reunión Anual de
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), LCA 2022, Dublín,
Irlanda, May 22–27, 2022, pages 8398–8409.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2022
.acl-long.575

Tanto Bengio, Oriol Vinyals, Navdeep Jaitly,
and Noam Shazeer. 2015. Scheduled sampling
for sequence prediction with recurrent neural
redes. In Advances in Neural Information

Sistemas de procesamiento 28: Annual Conference on
Neural Information Processing Systems 2015,
December 7–12, 2015, Montréal, Quebec,
Canada, pages 1171–1179.

Tom B. Marrón, Benjamín Mann, Nick Ryder,
Melanie Subbiah,
Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, niño rewon, Aditya Ramesh,
Daniel M. Ziegler,
Jeffrey Wu, Clemenes
Invierno, Christopher Hesse, Marcos Chen, eric
Sigler, Mateusz Litwin, Scott Gris, Benjamín
Ajedrez, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever, y
Dario Amodei. 2020. Language models are few-
shot learners. In Advances in Neural Informa-
tion Processing Systems 33: Annual Conference
on Neural
Sistemas de procesamiento de información
2020, NeurIPS 2020, December 6–12, 2020,
virtual.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. En procedimientos de
el 2019
Conferencia del Capítulo Norteamericano
de la Asociación de Linguis Computacional-
tics: Tecnologías del lenguaje humano, NAACL-
HLT 2019, Mineápolis, Minnesota, EE.UU, June 2–7,
2019, Volumen 1 (Artículos largos y cortos),
páginas 4171–4186. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/N19-1423

Bhuwan Dhingra, Manaal Faruqui, Ankur P.
Parikh, Ming-Wei Chang, Dipanjan Das, y
William W. cohen. 2019. Handling divergent
reference texts when evaluating table-to-text
generación. In Proceedings of the 57th Con-
ference of the Association for Computational
Lingüística, LCA 2019, Florencia, Italia, Julio
28 – August 2, 2019, Volumen 1: Artículos largos,
pages 4884–4895. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/P19-1483

Liang Ding, Longyue Wang, Xuebo Liu, Derek F.
Wong, Dacheng Tao, and Zhaopeng Tu. 2021a.
Rejuvenating low-frequency words: Making
the most of parallel data in non-autoregressive
traducción. In Proceedings of the 59th Annual
Meeting of the Association for Computational

954

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Linguistics and the 11th International Joint
Conference on Natural Language Process-
En g, ACL/IJCNLP 2021,
(Volumen 1: Largo
Documentos), Virtual Event, August 1–6, 2021,
pages 3431–3441. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2021.acl-long.266

Liang Ding, Longyue Wang, Xuebo Liu, Derek
F. Wong, Dacheng Tao, and Zhaopeng Tu.
2021b. Understanding and improving lexical
choice in non-autoregressive translation. In 9th
Conferencia Internacional sobre Aprendizaje Repre-
sentaciones, ICLR 2021, Virtual Event, Austria,
May 3–7, 2021. OpenReview.net.

Liang Ding, Longyue Wang, Shuming Shi,
Dacheng Tao, and Zhaopeng Tu. 2022. Re-
distributing low-frequency words: Making the
most of monolingual data in non-autoregressive
traducción. In Proceedings of the 60th Annual
Meeting of the Association for Computational
Lingüística (Volumen 1: Artículos largos), LCA
2022, Dublín,
Irlanda, May 22–27, 2022,
pages 2417–2426. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2022.acl-long.172

Yao Dou, Maxwell Forbes, Rik Koncel-
Kedziorski, Noah A. Herrero, and Yejin Choi.
2022. Is GPT-3 text indistinguishable from hu-
man text? Scarecrow: A framework for scruti-
nizing machine text. In Proceedings of the 60th
Annual Meeting of the Association for Compu-
lingüística nacional (Volumen 1: Artículos largos),
LCA 2022, Dublín, Irlanda, May 22–27, 2022,
pages 7250–7274. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2022.acl-long.501

Cunxiao Du, Zhaopeng Tu, and Jing Jiang.
2021. Order-agnostic cross entropy for non-
autoregressive machine translation. En profesional-
ceedings of the 38th International Conference
on Machine Learning, ICML 2021, 18–24
Julio 2021, Virtual Event, volumen 139 de
Actas de investigación sobre aprendizaje automático,
pages 2849–2859. PMLR.

Bradley Efron and Robert Tibshirani. 1994. Un
Introduction to the Bootstrap, Chapman and
Hall/CRC. https://doi.org/10.1201
/9780429246593

Marjan Ghazvininejad, Vladimir Karpukhin, Luke
Zettlemoyer, and Omer Levy. 2020. Aligned

cross entropy for non-autoregressive machine
traducción. In Proceedings of the 37th Inter-
Conferencia nacional sobre aprendizaje automático,
ICML 2020, 13–18 July 2020, Virtual Event,
volumen 119 of Proceedings of Machine Learn-
ing Research, pages 3515–3523. PMLR.

Marjan Ghazvininejad, Omer Levy, Yinhan Liu,
and Luke Zettlemoyer. 2019. Mask-predict:
conditional masked
Parallel decoding of
language models. En Actas de la 2019
Jornada sobre Métodos Empíricos en Natural
El procesamiento del lenguaje y la IX Internacional
Conferencia conjunta sobre lenguaje natural Pro-
cesando, EMNLP-IJCNLP 2019, Hong Kong,
Porcelana, November 3–7, 2019, pages 6111–6120.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/D19
-1633

Alex Graves, Santiago Fern´andez, Faustino J.
Gómez, and J¨urgen Schmidhuber. 2006. Estafa-
nectionist temporal classification: Labelling un-
segmented sequence data with recurrent neural
redes. In Machine Learning, Actas
of the Twenty-Third International Conference
(ICML 2006), pittsburgh, Pensilvania, EE.UU,
June 25–29, 2006, volumen 148 of ACM
International Conference Proceeding Series,
pages 369–376. ACM. https://doi.org
/10.1145/1143844.1143891

Jiatao Gu, James Bradbury, Caiming Xiong,
Victor O. k. li, and Richard Socher. 2018.
Non-autoregressive neural machine transla-
ción. In 6th International Conference on Learn-
ing Representations, ICLR 2018, vancouver,
BC, Canada, Abril 30 – May 3, 2018, Conferir-
ence Track Proceedings. OpenReview.net.

Jiatao Gu and Xiang Kong. 2021. Fully non-
autoregressive neural machine
traducción:
Tricks of the trade. In Findings of the Asso-
ciation for Computational Linguistics: LCA /
IJCNLP 2021, Online Event, August 1–6, 2021,
volume ACL/IJCNLP 2021 of Findings of
LCA, pages 120–133. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2021.findings-acl.11

Jiatao Gu, Changhan Wang, and Junbo Zhao.
2019. Levenshtein transformer. In Advances
Sistemas de procesamiento de información
in Neural
32: Annual Conference on Neural Informa-
tion Processing Systems 2019, NeurIPS 2019,

955

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

December 8–14, 2019, vancouver, BC, Canada,
pages 11179–11189.

Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan
Zhu, and Minlie Huang. 2020. A knowledge-
enhanced pretraining model for commonsense
story generation. Transacciones de la Asociación-
ción para la Lingüística Computacional, 8:93–108.
https://doi.org/10.1162/tacl a 00302

Junliang Guo, Linli Xu, and Enhong Chen. 2020a.
Jointly masked sequence-to-sequence model
for non-autoregressive neural machine transla-
ción. In Proceedings of the 58th Annual Meeting
de la Asociación de Linguis Computacional-
tics, LCA 2020, En línea, July 5–10, 2020,
pages 376–385. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2020.acl-main.36

Junliang Guo, Zhirui Zhang, Linli Xu, Hao-Ran
Wei, Boxing Chen, and Enhong Chen. 2020b.
Incorporating BERT into parallel sequence de-
coding with adapters. En avances en neurología
Sistemas de procesamiento de información 33: Annual
Conference on Neural Information Processing
Sistemas 2020, NeurIPS 2020, December 6–12,
2020, virtual.

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes,
and Yejin Choi. 2020. The curious case of
neural text degeneration. In 8th International
Conferencia sobre Representaciones del Aprendizaje, ICLR
2020, Addis Ababa, Ethiopia, April 26–30,
2020. OpenReview.net.

Chenyang Huang, Hao Zhou, Osmar R.
Za¨ıane, Lili Mou, and Lei Li. 2022a. No-
autoregressive translation with layer-wise pre-
diction and deep supervision. The Thirty-Sixth
AAAI Conference
Intelli-
gence, AAAI 2022. https://doi.org/10
.48550/arXiv.2110.07515

on Artificial

Fei Huang, Zikai Chen, Chen Henry Wu,
Qihan Guo, Xiaoyan Zhu, and Minlie Huang.
2021. NAST: A non-autoregressive generator
with word alignment for unsupervised text
style transfer. In Findings of the Association
para Lingüística Computacional: ACL/IJCNLP
2021, Online Event, August 1–6, 2021, volumen-
ume ACL/IJCNLP 2021 of Findings of ACL,
pages 1577–1590. https://doi.org/10
.18653/v1/2021.findings-acl.138

of non-autoregressive transformers. En interna-
tional Conference on Machine Learning, ICML
2022, 17–23 July 2022, baltimore, Maryland,
EE.UU, volumen 162 de Actas de Máquina
Investigación del aprendizaje, pages 9356–9376. PMLR.

Fei Huang, Hao Zhou, Yang Liu, Hang Li,
and Minlie Huang. 2022C. Directed acyclic
transformador
for non-autoregressive machine
traducción. In International Conference on Ma-
chine Learning, ICML 2022, 17–23 July 2022,
baltimore, Maryland, EE.UU, volumen 162 de
Actas de investigación sobre aprendizaje automático,
pages 9410–9428. PMLR.

Xiao Shi Huang, Felipe P´erez, and Maksims
Volkovs. 2022d. Improving non-autoregressive
translation models without distillation. In The
Tenth International Conference on Learning
Representaciones, ICLR 2022, Virtual Event,
April 25–29, 2022. OpenReview.net.

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu,
Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang,
Andrea Madotto, and Pascale Fung. 2022.
idioma
Survey of hallucination in natural
generación. CORR, abs/2202.03629. https://
doi.org/10.1145/3571730

Ting Jiang, Shaohan Huang, Zihan Zhang,
Deqing Wang, Fuzhen Zhuang, Furu Wei,
Haizhen Huang, Liangjie Zhang, and Qi Zhang.
2021. Improving non-autoregressive generation
with mixup training. arXiv preprint, abs/2110
.11115v1. https://doi.org/10.48550
/arXiv.2110.11115

Yoon Kim and Alexander M. Rush. 2016.
Sequence-level knowledge distillation. En profesional-
cesiones de la 2016 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural,
EMNLP 2016, austin, Texas, EE.UU, Noviembre
1–4, 2016, pages 1317–1327. The Association
para Lingüística Computacional. http://doi
.org/10.18653/v1/D16-1139

Diederik P. Kingma and Jimmy Ba. 2015.
Adán: A method for stochastic optimiza-
In 3rd International Conference on
ción.
Learning Representations, ICLR 2015, san
diego, California, EE.UU, May 7–9, 2015, Conferencia
Track Proceedings. https://doi.org/10
.48550/arXiv.1412.6980

Fei Huang, Tianhua Tao, Hao Zhou, Lei Li,
and Minlie Huang. 2022b. On the learning

Jason Lee, Elman Mansimov, and Kyunghyun
Dar. 2018. Deterministic non-autoregressive

956

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

neural sequence modeling by iterative refine-
mento. En Actas de la 2018 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando, Bruselas, Bélgica, Octubre 31 –
Noviembre 4, 2018, pages 1173–1182. Associ-
ation for Computational Linguistics. http://
doi.org/10.18653/v1/D18-1149

mike lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
Exacción, Veselin Stoyanov, and Luke Zettlemoyer.
2020. BART: Denoising sequence-to-sequence
pre-training for natural language generation,
traducción, and comprehension. En procedimientos
of the 58th Annual Meeting of the Associa-
ción para la Lingüística Computacional, LCA 2020,
En línea, July 5–10, 2020, pages 7871–7880.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2020
.acl-main.703

Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun
NO, and Ji-Rong Wen. 2022. ELMER: A
non-autoregressive pre-trained language model
for efficient and effective text generation. En
Actas de la 2022 Conferencia sobre Empiri-
Métodos cal en el procesamiento del lenguaje natural,
EMNLP 2022, Abu Dhabi, United Arab Emi-
tarifas, December 7–11, 2022, pages 1044–1058.
Asociación de Lingüística Computacional.

En procedimientos de

Zhuohan Li, Zi Lin, Di He, Fei Tian, tao qin,
Liwei Wang, and Tie-Yan Liu. 2019. Hint-
based training for non-autoregressive ma-
el
chine translation.
2019 Conference on Empirical Methods in
Natural Language Processing and the 9th
International Joint Conference on Natural
Procesamiento del lenguaje, EMNLP-IJCNLP 2019,
Hong Kong, Porcelana, November 3–7, 2019,
pages 5707–5712. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D19-1573

Jindrich Libovick´y and Jindrich Helcl. 2018.
End-to-end non-autoregressive neural machine
translation with connectionist temporal classifi-
catión. En Actas de la 2018 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando, Bruselas, Bélgica, Octubre 31 –
Noviembre 4, 2018, pages 3016–3021. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/D18-1336

Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen
chi, Hang Zhang, Jian Jiao, Weizhu Chen,
Jie Fu, Linjun Shou, Ming Gong, Pengcheng
Wang, Jiusheng Chen, Daxin Jiang, Jiancheng
Lv, Ruo Fei Zhang, Winnie Wu, Ming Zhou,
and Nan Duan. 2021. GLGE: A new general
language generation evaluation benchmark. En
Hallazgos de la Asociación de Computación
Lingüística: ACL/IJCNLP 2021, Online Event,
August 1–6, 2021, volume ACL/IJCNLP
2021 of Findings of ACL, pages 408–420.
Asociación
for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.findings-acl.36

Joshua Maynez, Shashi Narayan, Bernd Bohnet,
and Ryan T. McDonald. 2020. On faithfulness
and factuality in abstractive summarization. En
Actas de la 58ª Reunión Anual de
the Association for Computational Linguis-
tics, LCA 2020, En línea, July 5–10, 2020,
pages 1906–1919. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2020.acl-main.173

Nasrin Mostafazadeh, Nathanael Chambers,
Xiaodong He, Devi Parikh, Dhruv Batra, Lucy
Vanderwende, Pushmeet Kohli, and James F.
allen. 2016. A corpus and cloze evaluation for
deeper understanding of commonsense stories.
In NAACL HLT 2016, El 2016 Conferencia
of the North American Chapter of the Asso-
ciation for Computational Linguistics: Humano
Language Technologies, San Diego California,
EE.UU, June 12–17, 2016, pages 839–849.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/N16
-1098

Shashi Narayan, Shay B.. cohen, and Mirella
Lapata. 2018. Don’t give me the details, justo
the summary! Topic-aware convolutional neu-
ral networks for extreme summarization. En
Actas de la 2018 Conferencia sobre el Imperio-
Métodos icales en el procesamiento del lenguaje natural,
Bruselas, Bélgica, Octubre 31 – November 4,
2018, pages 1797–1807. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/D18-1206

Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Admirador, Sam Gross, Nathan Ng, David Grangier,
and Michael Auli. 2019. fairseq: A fast, ex-
En
tensible toolkit

for sequence modeling.

957

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Actas de la 2019 Conference of the
North American Chapter of the Association for
Ligüística computacional: Human Language
Technologies, NAACL-HLT 2019, Minneapo-
lis, Minnesota, EE.UU, June 2–7, 2019, Demonstrations,
pages 48–53. Asociación de Computación
Lingüística. https://doi.org/10.18653
/v1/N19-4009

la Asociación de Lingüística Computacional
and the 11th International Joint Conference on
Natural Language Processing, ACL/IJCNLP
2021, (Volumen 1: Artículos largos), Virtual Event,
August 1–6, 2021, pages 1993–2003. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/2021.acl-long
.155

Kishore Papineni, Salim Roukos, Todd Ward,
y Wei-Jing Zhu. 2002. AZUL: Un método para
evaluación automática de la traducción automática.
In Proceedings of the 40th Annual Meeting
de la Asociación de Linguis Computacional-
tics, July 6–12, 2002, Filadelfia, Pensilvania, EE.UU,
páginas 311–318. LCA. https://doi.org
/10.3115/1073083.1073135

Romain Paulus, Caiming Xiong, y ricardo
Socher. 2018. A deep reinforced model for ab-
stractive summarization. In 6th International
Conferencia sobre Representaciones del Aprendizaje, ICLR
2018, vancouver, BC, Canada, Abril 30 –
Puede 3, 2018, Conference Track Proceedings.
OpenReview.net.

Weizhen Qi, Yeyun Gong, Jian Jiao, Yu Yan,
Weizhu Chen, Dayiheng Liu, Kewen Tang,
Houqiang Li, Jiusheng Chen, Ruo Fei Zhang,
Ming Zhou, and Nan Duan. 2021. BANG:
Bridging autoregressive and non-autoregressive
generation with large scale pretraining.
En
the 38th International Con-
Actas de
ference on Machine Learning, ICML 2021,
18–24 July 2021, Virtual Event, volumen 139
of Proceedings of Machine Learning Research,
pages 8630–8639. PMLR.

Weizhen Qi, Yu Yan, Yeyun Gong, Dayiheng
Liu, Nan Duan, Jiusheng Chen, Ruo Fei Zhang,
y Ming Zhou. 2020. ProphetNet: Predicting
future n-gram for sequence-to-sequence pre-
training. In Findings of the Association for Com-
Lingüística putacional: EMNLP 2020, En línea
Event, 16–20 November 2020, volume EMNLP
2020 of Findings of ACL, pages 2401–2410.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2020
.findings-emnlp.217

Lihua Qian, Hao Zhou, Yu Bao, Mingxuan
Wang, Lin Qiu, Weinan Zhang, Yong Yu, y
Lei Li. 2021. Glancing transformer for non-
autoregressive neural machine translation. En
Proceedings of the 59th Annual Meeting of

Alec Radford, Jeff Wu, niño rewon, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9.

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, wei li, y Pedro J.. Liu. 2020.
Exploring the limits of transfer learning with
a unified text-to-text transformer. Diario de
Machine Learning Research, 21:140:1–140:67.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev,
y Percy Liang. 2016. Equipo: 100, 000+
questions for machine comprehension of text.
En procedimientos de
el 2016 Conferencia sobre
Métodos empíricos en Natural Language Pro-
cesando, EMNLP 2016, austin, Texas, EE.UU,
November 1–4, 2016, páginas 2383–2392.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/D16
-1264

Marc’Aurelio Ranzato, Sumit Chopra, Miguel
Auli, and Wojciech Zaremba. 2016. Sequence
level training with recurrent neural networks.
In 4th International Conference on Learning
Representaciones, ICLR 2016, San Juan, Puerto
Rico, May 2–4, 2016, Conference Track Pro-
ceedings. https://doi.org/10.48550
/arXiv.1511.06732

Chenze Shao, Xuanfu Wu, and Yang Feng.
2022. One reference is not enough: Diverso
distillation with reference selection for non-
autoregressive translation. En procedimientos de
el 2022 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
NAACL 2022, seattle, Washington, United States, Julio
10-15, 2022, pages 3779–3791. Asociación para
Ligüística computacional. https://doi.org
/10.18653/v1/2022.naacl-main.277

Kurt Shuster, Spencer Poff, Moya Chen, Douwe
Kiela, y Jason Weston. 2021. Retrieval

958

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

augmentation reduces hallucination in con-
versation. In Findings of the Association for
Ligüística computacional: EMNLP 2021, Vir-
tual Event / Punta Cana, República Dominicana,
16–20 November, 2021, pages 3784–3803.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2021
.findings-emnlp.320

Kaitao Song, Xu Tan, tao qin, Jianfeng Lu, y
Tie-Yan Liu. 2019. MASS: Masked sequence
to sequence pre-training for language genera-
ción. In Proceedings of the 36th International
Conference on Machine Learning, ICML 2019,
9–15 June 2019, Long Beach, California, EE.UU,
volumen 97 of Proceedings of Machine Learn-
ing Research, pages 5926–5936. PMLR.

Yixuan Su, Deng Cai, Yan Wang, David Vandyke,
Simon Baker, Piji Li, and Nigel Collier. 2021.
Non-autoregressive text generation with pre-
trained language models. En Actas de la
16th Conference of the European Chapter of
la Asociación de Lingüística Computacional:
Volumen principal, EACL 2021, En línea, Abril 19 –
23, 2021, pages 234–243. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/2021.eacl-main.18

Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di
Él, Zi Lin, and Zhi-Hong Deng. 2019. Fast
structured decoding for sequence models. En
Avances en el procesamiento de información neuronal
Sistemas 32: Annual Conference on Neural In-
formation Processing Systems 2019, NeurIPS
2019, December 8–14, 2019, vancouver, BC,
Canada, pages 3011–3020.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N.. Gómez,
Lukasz Kaiser, y Illia Polosukhin. 2017.
Attention is all you need. In Advances in Neu-
ral Information Processing Systems 30: Annual
Conference on Neural Information Process-
ing Systems 2017, December 4–9, 2017, Largo
Beach, California, EE.UU, pages 5998–6008.

Xiaohui Wang, Yang Wei, Ying Xiong, Guyue
Huang, Xian Qian, Yufei Ding, Mingxuan
Wang, and Lei Li. 2022. Lightseq2: Accel-

erated training for transformer-based models
on GPUs. In SC22: International Conference
for High Performance Computing, Red-
En g, Storage and Analysis, Dallas, Texas, EE.UU,
November 13–18, 2022, pages 1–14. IEEE.
https://doi.org/10.1109/SC41404
.2022.00043

Zhenyi Wang, Xiaoyang Wang, Bang An,
Dong Yu, and Changyou Chen. 2020. A-
wards faithful neural table-to-text generation
En profesional-
with content-matching constraints.
cesiones de
the 58th Annual Meeting of
the Association for Computational Linguis-
tics, LCA 2020, En línea, July 5–10, 2020,
pages 1072–1086. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2020.acl-main.101

Tomás Lobo, Debut de Lysandre, Víctor Sanh,
Julien Chaumond, Clemente Delangue, Antonio
moi, Pierric Cistac, Tim Rault, R´emi Louf,
Morgan Funtowicz, and Jamie Brew. 2019.
HuggingFace’s transformers: State-of-the-art
natural language processing. arXiv preprint,
abs/1910.03771v5. https://doi.org/10
.18653/v1/2020.emnlp-demos.6

the 56th Annual Meeting of

Saizheng Zhang, Emily Dinan, Jack Urbanek,
Arthur Szlam, Douwe Kiela, y Jason Weston.
2018. Personalizing dialogue agents: I have
a dog, do you have pets too? En procedimientos
de
the Asso-
ciation for Computational Linguistics, LCA
2018, Melbourne, Australia, July 15–20, 2018,
Volumen 1: Artículos largos, pages 2204–2213.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/P18
-1205

Yukun Zhu, Ryan Kiros, Richard S. Zemel,
Salakhutdinov, Raquel Urtasun,
Ruslan
Antonio Torralba, and Sanja Fidler. 2015.
Aligning books and movies: Towards story-like
visual explanations by watching movies and
reading books. En 2015 IEEE International Con-
ference on Computer Vision, ICCV 2015, Santi-
atrás, Chile, December 7–13, 2015, pages 19–27.
IEEE Computer Society. https://doi.org
/10.1109/ICCV.2015.11

959

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
8
2
2
1
5
3
2
2
5

/
t

a
C
_
a
_
0
0
5
8
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3 Directed Acyclic Transformer Pre-training for High-quality image

Descargar PDF