On Generative Spoken Language Modeling from Raw Audio

On Generative Spoken Language Modeling from Raw Audio

Kushal Lakhotia∗ , Eugene Kharitonov∗, Wei-Ning Hsu, Yossi Adi, Adam Polyak,
Benjamin Bolte§, Tu-Anh Nguyen†, Jade Copet, Alexei Baevski,
Abdelrahman Mohamed, Emmanuel Dupoux‡
Facebook AI Research
textlessNLP@fb.com

Abstrait

We introduce Generative Spoken Language
Modeling, the task of learning the acoustic and
linguistic characteristics of a language from
raw audio (no text, no labels), and a set of
metrics to automatically evaluate the learned
representations at acoustic and linguistic lev-
els for both encoding and generation. We set
up baseline systems consisting of a discrete
speech encoder (returning pseudo-text units), un
generative language model (trained on pseudo-
text), and a speech decoder (generating a wave-
form from pseudo-text) all trained without
supervision and validate the proposed metrics
with human evaluation. Across 3 speech en-
coders (CPC, wav2vec 2.0, HuBERT), we find
that the number of discrete units (50, 100, ou
200) matters in a task-dependent and encoder-
dependent way, and that some combinations
approach text-based systems.1

1

Introduction

An open question for AI research is creating sys-
tems that learn from natural interactions as infants
learn their first language(s): from raw uncurated
data, and without access to text or expert labels
(Dupoux, 2018). Natural Language Processing
(NLP) systems are currently far from this re-
quirement. Even though great progress has been
made in reducing or eliminating the need for ex-
pert labels through self-supervised training obj-
ectives (Brown et al., 2020; Peters et al., 2018;
Radford et al., 2019; Devlin et al., 2019; Liu et al.,
2019b; Dong et al., 2019; Lewis et al., 2020), le
basic units on which these systems are trained are
still textual. Young children learn to speak several

∗Equal contribution. ‡ Also at EHESS. † Also at INRIA.
§Work done while at FAIR.
1Evaluation code and trained models are here: https://
github.com/pytorch/fairseq/tree/master/examples
/textless nlp/gslm. Sample examples are here: https://
speechbot.github.io/gslm.

years before they can read and write, providing
a proof of principle that language can be learned
without any text. Being able to achieve ‘textless
NLP’ would be good news for the majority of the
world’s languages, which do not have large textual
resources or even a widely used standardized or-
thography (Swiss German, dialectal Arabic, Igbo,
etc.), and which, despite being used by millions of
users, have little chance of being served by cur-
rent text-based technology. It would also be good
for ‘high-resource’ languages, where the oral and
written forms often mismatch in terms of lexicon
and syntax, and where some linguistically rele-
vant signals carried by prosody and intonation are
basically absent from text. While text is still the
dominant form of language present on the web,
a growing amount of audio resources like pod-
casts, local radios, social audio apps, online video
games provide the necessary input data to push
NLP to an audio-based future and thereby expand
the inclusiveness and expressivity of AI systems.
Is it possible to build an entire dialogue sys-
tem from audio inputs only? This is a difficult
challenge, but breakthroughs in unsupervised
representation learning may address part of it. Un-
supervised learning techniques applied to speech
were shown to learn continuous or discrete repre-
sentations that capture speaker invariant phonetic
content (Versteegh et al., 2016; Dunbar et al.,
2020), despite themselves not being phonemic
(Schatz et al., 2021). Recent developments in
self-supervised learning have shown impressive
results as a pretraining technique (van den Oord
et coll., 2017; Chung et al., 2019; Hsu et al., 2021),
to the extent that Automatic Speech Recognition
(ASR) on par with the state of the art from two
years back can be built with 5000 times less la-
belled speech (Baevski et al., 2020b), or even
no with no labelled speech at all (Baevski et al.,
2021). Bien sûr, ASR still assumes access to text
to learn a language model (LM) and the mapping

1336

Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021. https://doi.org/10.1162/tacl a 00430
Action Editor: Richard Sproat. Submission batch: 6/2021; Revision batch: 8/2021; Published 12/2021.
c(cid:3) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

to the learning objectives like perplexity or log
likelihood. Ici, such an approach is not directly
applicable even if we rely on discrete pseudo-text
units, since such metrics would depend in an
unknown fashion on their granularity (number, du-
ration, and distribution), making the comparison
of models that use different units infeasible.

Conceptually, generative spoken language mod-
els can be evaluated at two levels, the acoustic and
the language levels, and through two modes of op-
eration, encoding and generation, resulting in 2×2
tasks (see Table 1 and Figure 1). Acoustic Unit
Découverte (encoding at the acoustic level) consists
of representing speech in terms of discrete units
discarding non-linguistic factors like speaker and
bruit. Spoken Language Modeling (encoding at
the language level) consists of learning the prob-
abilities of language patterns. Speech Resynthesis
(generation for acoustic modeling) consists of gen-
erating audio from given acoustic units. This boils
down to repeating in a voice of choice an input lin-
guistic content encoded with speech units. Speech
Generation (generation for language modeling)
consists of generating novel and natural speech
(conditioned on some prompt or not). Compared
to standard text generation, a critical and novel
component of the audio variant is clearly the dis-
covery of units since it conditions all the other
components. This is why we devote our analyses
of model architectures to the unit-to-speech com-
ponent specifically, and leave it for further work
to evaluate how the downstream components can
also be optimized for spoken language generation.
The major contributions of this paper are as
follows: (1) We introduce two novel evaluation
metrics for the generation mode of spoken lan-
the acoustic and language
guage modeling at
levels respectively. Our key insight
is to use
a generic pretrained ASR system to establish
model-independent assessments of the intelligibil-
ville (acoustic level) and meaningfulness (langue
level) of the produced outputs. The ASR system
converts the generated waveform back to text,
enabling us to adapt standard text-based metrics
for these two levels. (2) We validate these met-
rics through comparison with human evaluation.
We show a high degree of concordance between
human and machine evaluations of intelligibility
and meaningfulness of generated audio. (3) Nous
show that these metrics can be predicted by sim-
pler ones geared to evaluate the encoding mode of
the spoken LM. Zero-shot metrics borrowed from

Chiffre 1: Setup of the baseline model architecture,
tasks, and metrics.

to the audio units. Ici, we study the case where
the LM is directly trained from the audio units
without any recourse to text.

The high level idea (voir la figure 1) is that auto-
matically discovered discrete units can be used to
encode speech intopseudo-text” (speech-to-unit,
S2u), which is used in turn to train a generative lan-
guage model (unit-based language model, uLM)
and to train a speech synthesizer (unit-to-speech,
u2S). This enables learning an LM from scratch
without text, and use it to generate speech condi-
tionally or unconditionally, essentially replicating
what toddlers achieve before learning to read.
Early studies using discrete codes learned from
an autoencoder show the feasibility of such an
approche, but remain at a level of a demo (van den
Oord et al., 2017).

In this paper, we address one major conceptual
stumbling block which has, thus far, prevented
such early studies from having the transformative
impact they could have in language technology:
model evaluation. We contend that it will be
impossible to make progress in this area beyond
demos unless proper evaluation methods enabling
system comparison are etablished.

Evaluation for speech generation is difficult due
to the continuous, variable and multi-level nature
of the speech waveform, and the necessity both to
capture fine grained acoustic details to generate in-
telligible audio and to abstract away from them to
learn higher level language concepts. Text-based
models do not have this problem, since the in-
put is already expressed in terms of mid-level
discrete units (characters or words), and are typ-
ically evaluated with unsupervised metrics close

1337

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Level

Task

Automatic metric

Task

Encoding

Generation
Automatic metric

Language Spoken

LM

Spot-the-word,
Syntax-Acc

Speech
Gen.

AUC-of-VERT/PPX, cont-
BLEU, PPX@o-VERT

Acoustic

Acoustic
Unit Disc.

ABX-across,
ABX-within

Resynthesis PER-from-ASR,

CER-from-ASR

Human

MMOS

CER,
MOS

Tableau 1: Tasks and metrics proposed to evaluate encoding/generation quality of models at the acoustic
or language levels. Bold fonts highlights the main metric used for each category (Section 3 for details).

previous studies in the Zero Resource Speech
Défis (Versteegh et al., 2016; Nguyen et al.,
2020) correlate well with their generative counter-
part, offering an easier proxy to rapidly iterate on
model selection. (4) We systematically study the
effect of the type of encoding units by factorially
crossing three recent speech-to-unit encoders—
CPC, Wave2vec 2.0, and HuBERT—with three
codebook sizes for the discrete units: 50, 100, 200.
We keep constant the rest of the system built from
out-of-the-box components (standard Transformer
for the uLM, Tacotron 2 for u2S). We show that
both the encoder type and the number of units mat-
ter, and that they matter differently depending on
the evaluation task. (5) We open source our evalu-
ation tools and models to help reproducibility and
comparability with future work.

In Section 3, we introduce the ASR, zero-shot
and human evaluation metrics, in Section 4 we pre-
sent the models, in Section 5, we analyze the re-
sults and discuss them in Section 6.

2 Related Work

Unsupervised Speech Representation Learning
aims to distill features useful for downstream
tasks, such as phone discrimination (Kharitonov
et coll., 2021; Schneider et al., 2019) and semantic
prediction (Lai et al., 2021; Wu et al., 2020), par
constructing pretext tasks that can exploit large
quantities of unlabeled speech. Pretext tasks in
the literature can be roughly divided into two
catégories: reconstruction and prediction. Recon-
struction is often implemented in the form of
auto-encoding (Hsu et al., 2017un), where speech
is first encoded into a low-dimensional space, et
then decoded back to speech. Various constraints
can be imposed on the encoded space, tel que
temporal smoothness (Ebbers et al., 2017; Glarner
et coll., 2018; Khurana et al., 2019, 2020), discrete-
ness (Ondel et al., 2016; van den Oord et al.,

2017), and presence of hierarchy (Lee and Glass,
2012; Hsu et al., 2017b).

Prediction-based approaches which task a mod-
el to predict information of unseen speech based
on its context, have gained increasing interest re-
cently. Examples of information include spectro-
grams (Chung et al., 2019; Wang et al., 2020; Chi
et coll., 2021; Liu et al., 2020; Chung and Glass,
2020; Liu et al., 2020; Ling et al., 2020; Ling
and Liu, 2020), cluster indices (Baevski et al.,
2019; Hsu et al., 2021), derived signal processing
features (Pascual et al., 2019; Ravanelli et al.,
2020), and binary labels of whether a candidate
is the target unseen spectrogram (van den Oord
et coll., 2018; Schneider et al., 2019; Baevski et al.,
2020un; Kharitonov et al., 2021; Baevski et al.,
2020b).

Speech Resynthesis. Recent advancements in
neural vocoders enabled generating natural sound-
ing speech and music (Oord et al., 2016; Kumar
et coll., 2019; Kong et al., 2020). These are often
conditioned on the log mel-spectrogram for the
generation process. Learning low bitrate speech
representations in an unsupervised manner has
attracted attention from both the machine learn-
ing and the speech communities (Liu et al., 2019un;
Feng et al., 2019; Nayak et al., 2019; Tjandra et al.,
2019; Schneider et al., 2019; Baevski et al., 2020un;
Chen and Hain, 2020; Morita and Koda, 2020;
Tobing et al., 2020). These representations can
later be used for generation without text, which is
particularly important for low-resource languages
(Dunbar et al., 2019, 2020). van den Oord et al.
(2017) proposed a Vector-Quantized Variational
Auto-Encoder (VQ-VAE) model to learn discrete
speech units, which will be later used for speech
synthesis using a WaveNet model. Eloff et al.
(2019) suggested a VQ-VAE model followed
by a FFTNet vocoder model (Jin et al., 2018).
Tjandra et al. (2020) suggested using transformer

1338

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

(Vaswani et al., 2017) together with a VQ-VAE
model for unsupervised unit discovery, and van
Niekerk et al. (2020) combines vector quanti-
zation together with contrastive predictive coding
for acoustic unit discovery. Another line of work
uses representations from an ASR acoustic model
that are combined with identity and prosodic
information for voice conversion (Polyak et al.,
2020b,un, 2021b). In terms of evaluation, the Zero-
Resource challenge (Dunbar et al., 2019, 2020;
Nguyen et al., 2020) used bitrate together with
human evaluation. In this paper we additionally
introduce an ASR based evaluation metric.

3 Evaluation Methods

We present two sets of automatic evaluation met-
rics; the first one assesses the output of generative
speech models (ASR metrics, Section 3.1); the sec-
ond one, the encoded representations (zero-shot
probe metrics, Section 3.2). Enfin, we present
the human evaluations (Section 3.3).

3.1 Generation: ASR Metrics

We present our new evaluation metrics for gen-
eration tasks. The first task, speech resynthesis,
involves S2u, which encodes input speech into
units and u2S, which decodes it back to speech. Dans
this task, we wish to evaluate intelligibility of the
resulting speech. The second task, speech genera-
tion, involves the full S2u→uLM→u2S pipeline,
and we wish to evaluate meaningfulness of the
generated speech. Our overall idea is to use ASR
to convert the generated speech back to text and
then use text-based metrics.

Speech Resynthesis Intelligibility: ASR-PER.
The ideal metric for intelligibility would be to use
humans to transcribe the resynthesized speech and
compare the text to the original input. An auto-
matic proxy can be obtained by using a state-of-
the-art ASR system pretrained on a large corpus
of real speech.2 Our main metric is Phone Error
Rate (PER), which only uses an acoustic-model
ASR, without fusing with an additional language
model (Chorowski and Jaitly, 2016). In prelim-
inary experiments we also experimented with a
full ASR with an LM and computed Word Error
Rate (WER) and Character Error Rate (CER) à
give partial credit. The latter is probably closer
to human intelligibility metrics, as humans cannot

turn off their lexicon or language model. We also
computed such metrics by training a fitted ASR
model for each resynthesis model on a specific
training corpus. The logic of this last test is that it
provides a more direct measure of the information
lost in the S2u→u2S pipeline, because it could
adapt to systematic errors introduced by the u2S
model. Since the scores between these different
approaches correlated highly, we only report here
the results on the PER for a pretrained ASR model
that is the simplest to deploy.

Speech Generation Quality and Diversity: AUC
on Perplexity and VERT. Text generation eval-
uation typically involves two axes: the quality of
the generated text (with automatic metrics like
mean perplexity or negative log likelihood com-
puted on a reference large language model) et le
diversity (with metrics like self-BLEU;3 Zhu et al.,
2018). Typiquement, there is a trade-off between these
two dimensions based on the temperature hyper-
parameter used for sampling from the language
model, whereby at low temperature, the system
outputs good sentences but not varied, and at
high temperatures, it outputs varied sentences, mais
not very good. This results in model compari-
son being either based on 2D plots with lines
representing the trade-off between quality and di-
versity, or on aggregate metrics like the area under
the curve. Preliminary explorations (see Appendix
Section 7.2) with our models revealed two prob-
lems preventing a straightforward application of
such a scoring strategy.

D'abord, we found that for some models, at a low
enough temperature, self-BLEU score stopped in-
creasing, but the systems started to repeat more and
more words within a sentence (par exemple., ‘‘the property
the property the property’’). We therefore intro-
duce a new metric, auto-BLEU, that measures
within-sentence diversity. For a single utterance
toi, auto-BLEU is calculated as the ratio of k-grams
s ∈ N Gk(toi) that are repeated at least once:

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

(cid:2)

auto-BLEU(toi, k) =

s 11 [s ∈ (N Gk(toi)\s)]
|N Gk(n)|

As with BLEU score,

(1)
to obtain n-gram
auto-BLEU we calculate the geometric mean of
auto-BLEU(toi, k) obtained for k ∈ [1, n] et
average over the set of generated utterances. Par

2We use a BASE wav2vec 2.0 phoneme detection model

3Higher self-BLEU scores indicate lower diversity of the

trained on LibriSpeech-960h with CTC loss from scratch.

produced text.

1339

We entirely draw on evaluations from the Zero
Resource challenge series (Versteegh et al., 2016;
Dunbar et al., 2019; Nguyen et al., 2020)6 for com-
parability with published work and refer to these
challenges for details. These metrics are ‘‘zero-
shot’’ because they do not require training any
classifier, and are either based on distances over
embeddings, or on computing probabilities over
entire utterances. When they have hyperparam-
eters, these are selected using a validation set.

For acoustic-level evaluation, we use the
between-speaker ABX score to quantify how
well-separated phonetic categories are. Briefly,
it consists of estimating the probability that two
tokens of the same category A (x and a) are closer
to one another than a token of A (X) and of B (b).
The categories are triphones that only differ in the
middle phoneme (like bit and bet) and the score
is averaged over all possible such pairs. For the
across-speaker ABX, a and b are spoken by the
same speaker and x by a different one, requiring
feature invariance over a speaker change. We also
include the bitrate, which has been used in the
TTS-without-T challenges (Dunbar et al., 2019)
to quantify the efficiency of the discrete units used
to resynthetize speech. It is simply the entropy of
the sequence of units divided by the total duration.
For language-level evaluation, we use spot-
the-word accuracy from the Zero Resource 2021
Benchmark (Nguyen et al., 2020). It consists of
detecting the real word from a pair of short utter-
ances like ‘brick’ vs. ‘blick’, matched for unigram
and bigram phoneme frequency to ensure that
low-level cues do not make the task trivial. Ce
task can be done by computing the probability
(or pseudo-probability) of the utterances from the
uLM. The test set (sWUGGY) consists of 5,000
word-pseudoword pairs generated by the Google
TTS API, filtered for the word being present
in the LibriSpeech 960h training set (Panayotov
et coll., 2015). The ZR21 benchmark also uses
higher level metrics, notably, syntactic (based
on the sBLIMP dataset), which we did not use
because the baselines were too close to chance.

3.3 Human Evaluation Metrics

As above, we asked humans to evaluate two
aspects of speech generation: intelligibility and
meaningfulness. Intelligibility was assessed using
two metrics: je) Mean Opinion Scores (MOS) dans

6www.zerospeech.com.

Chiffre 2: Comparison of diversity and perplexity
of the generated speech. We plot VERT vs. Median
perplexity. The blue diamond corresponds to the oracle
reference point. It defines two cut-offs on the curve:
VERT @oracle-PPX and PPX @oracle-VERT. Le
green area corresponds to the AUC metric.

calculating the geometric mean of self- and auto-
BLEU, we obtain an aggregate metric which we
call VERT (for diVERsiTy). We used a bigram
version of self- and auto-BLEU.

Deuxième, we found that critical temperatures for
which the output was reasonable were not con-
stant across models. This makes sense, because
temperature controls the probability of sampling
individual units, and the probabilistic distribution
and duration of these units depend on the models.
Ici, we chose to use the oracle text as an anchor
to compute reference temperatures, c'est, le
temperatures at which the perplexity or the VERT
score reach the values of the oracle text.

This gives us boundary conditions at which we
can compare (the perplexity at oracle diversity
and the diversity at oracle perplexity), ainsi que
a method to compute the area under curve (AUC)
between these two boundaries (voir la figure 2). Comme
AUC decreases, the system gets closer to the ora-
cle point. Thus with AUC, lower is better.

To calculate perplexity of the generated ut-
terances, we use a pre-trained ASR4 to convert
speech to text, and an off-the-shelf Transformer
model trained on the English NewsCrawl dataset.5

3.2 Encoding: Zero-shot Probe Metrics

The purpose of the encoding metrics is to eval-
uate the quality of the learned representations at
each linguistic level along the pipeline linking the
S2u and the uLM. They are inspired by human
psycholinguistics and can be be thought of as
unit tests providing interpretation and diagnosis.

4We use a LARGE wav2vec 2.0 model,

trained on
LibriSpeech-960h with CTC loss from scratch. Its decoder
uses the standard KenLM 4-gram language model.

5https://github.com/pytorch/fairseq/tree

/master/examples/language model.

1340

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

which raters were asked to evaluate subjectively
how intelligible a given audio sample is; et
ii) Character Error Rate (CER) computed from
written transcriptions providing an objective in-
telligibility test. As for meaningfulness, we set up
a meaningfulness-MOS (MMOS) in which raters
were asked to evaluate how natural (considering
both grammar and meaning) a given sample is. Pour
both subjective tests, raters evaluate the samples
on a scale of 1–5 with an increment of 1.

For the MMOS, we had to select a temperature
to sample from. Preliminary experiments showed
that humans preferred lower temperatures (yield-
ing also less diverse outputs). Ici, we settled on
selecting the temperature on a model-by-model
basis by constructing a continuation task: We take
le 1,000 shortest utterances from LibriSpeech
test-clean that are at least 6 seconds long, et
use the first 3 seconds as prompts for the uLM
(after transcribing them into pseudo-texts). Pour
each prompt, we generated 10 candidate continua-
tions of the same length (in seconds) as the utter-
ance which we took the prompt from. We varied
. . . , 1.4, 1.5, 1.7, 1.9,
temperature (0.3, 0.4,
2.1, 2.3, 2.5, 3.0), and selected the one yielding
the maximal BLEU-2 score with the reference
sentence (after ASR). These temperatures were
typically between the two boundary temperatures
described above.

We evaluated 100 samples from each of the
evaluated methods while we enforced at least 15
raters for each sample. The CrowdMOS package
(Ribeiro et al., 2011) was used for all subjective
experiments using the recommended recipes for
detecting and discarding inaccurate scores. Le
recordings for the naturalness test were generated
by the LM unconditionally and conditionally from
un 3 seconds prompt. Participants were recruited
using a crowd-sourcing platform.

4 Proposed Systems

Ici, we present our S2u (Section 4.1), uLM
(Section 4.2), and u2S (Section 4.3) components.

4.1 Speech-to-Unit Models

We selected 3 recent state-of-the-art unsupervised
Encoders, which we used ‘out of the box’: we did
not retrain them nor change their hyperparameters.
We also included a log Mel filter-bank baseline
(80 filters, computed every 10 ms). We then dis-
cretized the embeddings using k-means. We only

give a high level description of these models, et
refer to the original publications for details.

CPC. Contrastive Predictive Coding (van den Oord
et coll., 2017) as applied to speech consists of
two components: an encoder and a predictor. Le
encoder produces an embedding z from speech
input. The predictor predicts the future states
of the encoder based on the past, and the system
is trained with a contrastive loss. We use the CPC
model from Rivi`ere and Dupoux (2020), lequel
was trained on a ‘‘clean’’ 6k hour sub-sample
of the LibriLight dataset (Kahn et al., 2020;
Rivi`ere and Dupoux, 2020). We extract a represen-
tation from an intermediate layer of the predictor,
which provides a 256-dimensional embedding
(one per 10ms), as in the original paper.

wav2vec 2.0. Similar to CPC, this model uses
an encoder and a predictor, which is trained con-
trastively to distinguish positive and negative
samples from discretized and masked segments
of the encoder’s output. We use the LARGE vari-
ant of pretrained wav2vec 2.0 (Baevski et al.,
2020b) trained on 60k hours of LibriLight dataset
(Kahn et al., 2020). This model encodes raw au-
dio into frames of 1024-dimensional vectors (un
per 20ms). To choose the best layer, we extracted
frozen representations of the 10-hour LibriLight
subset from every layer of the model and trained
a linear classifier with the CTC loss to predict the
phonetic version of the text labels. Layer 14 ob-
tained the lowest PER on LS dev-other (a similar
approach was done in Baevski et al. [2021], lequel
in this case selected Layer 15).

HuBERT. Unlike CPC and wav2vec 2.0 que
use a contrastive loss, HuBERT is trained with a
masked prediction task similar to BERT (Devlin
et coll., 2019) but with masked continuous audio
signals as inputs. The targets are obtained through
unsupervised clustering of raw speech features or
learned features from earlier iterations, motivated
by DeepCluster (Caron et al., 2018). We use the
BASE 12 transformer-layer model trained for two
iterations (Hsu et al., 2021) sur 960 hours of
LibriSpeech (Panayotov et al., 2015). This model
encodes raw audio into frames of 768-dimensional
vectors (one per 20 ms) at each layer and we extract
those from the 6th layer as in the original paper.

LogMel. As a baseline, we consider a Log Mel
Filterbank encoder using 80 frequency bands.

1341

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Quantization. We use k-means to convert con-
tinuous frame representations into discrete repre-
sentation by training on LibriSpeech clean-100h
(Panayotov et al., 2015). We experiment with
codebooks that have 50, 100, et 200 units.

4.2 Unit-Language Model

We use the Transformer model as implemented
in fairseq (Ott et al., 2019). We use the trans-
former lm big architecture: Il a 12 layers, 16
attention heads, embedding size of 1024, FFN size
de 4096, and dropout probability of 0.1, and we
train it as a causal LM on sequences of pseudo-text
units. Each sample contains up to 3,072 units. Nous
use sampling with temperature for generation.

All language models are trained on a ‘‘clean’’
6k hours sub-sample of LibriLight used in
(Rivi`ere and Dupoux, 2020),
transcribed with
corresponding discrete units. In preliminary ex-
periments, we found that removing sequential rep-
etitions of units improves performance, hence we
apply it universally.7 We hypothesize that this sim-
ple modification allows us to use Transformer’s
limited attention span more efficiently, as in Hsu
et coll., 2020.

4.3 Unit-To-Speech Model

We adapt the Tacotron-2 model (Shen et al., 2018)
such that it takes pseudo-text units as input and
outputs a log Mel spectrogram. To enable the
model to synthesize arbitrary unit sequences, dans-
cluding those representing incomplete sentences,
we introduce two modifications. D'abord, we append
a special ‘‘end-of-input’’ (EOI) token to the in-
put sequence, hinting the decoder to predict the
‘‘end-of-output’’ token when attending to this new
token. Cependant, this modification alone may not
be sufficient, as the decoder could still learn to
ignore the EOI token and correlate end-of-output
prediction with the learned discrete token that
represents silence as most of the speech contains
trailing silence. To address this, we train the model
using random chunks of aligned unit sequence and
spectrogram, and append the EOI token to unit se-
quence chunks, such that the audio does not always
end with silence. We implement chunking in the
curriculum learning fashion, where the chunk size
gradually grows (starting with 50 frames with an
increment of 5 per epoch) to increase the diffi-

7Par exemple, a pseudo-text 10 11 11 11 21 32

32 32 21 becomes 10 11 21 32 21.

culty of the task. For waveform generation, we use
the pre-trained flow-based neural vocoder Wave-
Glow (Prenger et al., 2019). This model outputs
the time-domain signal given the log Mel spectro-
gram as input. All u2S models were trained on LJ
Speech (LJ) (Ito and Johnson, 2017).

5 Results

In Figure 3, we report the overall results of our
models and our LogMel baseline as a function of
the number of quantized units on our main auto-
mated and human metrics. More detailed results
follow in the following sections, including two
character-based toplines: one uses the oracle tran-
scripts for training the LM, the other uses tran-
scripts produced by the pre-trained ASR model.

5.1 Results on the Resynthesis Task

Overall resynthesis results are shown in the bot-
tom middle and right cells of Figure 3 pour notre
main automatic (PER) and human scores (MOS),
respectivement, averaged across the LS and LJ eval-
uation sets. We observe that across all models,
increasing the number of units uniformly leads
to better scores suggesting that the u2S compo-
nent can take benefit from extra details of the
input to produce a more realistic output. HuBERT
and CPC seem to be giving the best results, pour
both humans and models better capturing pho-
netic information than other models at equival-
ent bitrates.

More detailed results are in Table 2, separat-
ing the scores for the LJ and LS resynthesis, et
adding extra automatic metrics (CER) and human
metrics (human CER). On PER, we found a do-
main effect: Resynthesizing input from LJ Speech
yields lower PER than from LibriSpeech on all
unsupervised models. From the viewpoint of the
encoder, LJ Speech is out-of-domain; donc,
one would expect that the units are making more
errors than for the trained LibriSpeech. On the
other hand, the u2S component has learned from
LJ Speech encoded with these units, and might
have learned to compensate for these lower qual-
ity units. When LibriSpeech is offered as input,
the u2S component cannot adapt to this nominally
better input and ends up yielding lower quality

1342

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 3: Overall results with automatic and human metrics. The results are presented in terms of bitrate for 4
encoders (LogMel, CPC, HuBERT, and wav2vec 2.0) varying in number of units (50, 100, 200). For definition of
the tasks and metrics, see Table 1 and Figure 1. Negative human opinion scores are shown for ease of comparison
with automatic metrics (lower is better). The generation metrics have been averaged across LS and LJ (PER and
MOS; resynthesis task) and across prompted and unprompted conditions (AUC and MMOS; speech generation
task). The Log Mel Fbank based systems were not evaluated by humans in the speech generation task.

outputs. This observation is worth further explo-
ration, as other metrics like CER (using an LM)
and human evaluations only replicated this for the
models with the lowest score (like LogMel and
wav2vec). The automatic PER and CER scores
and the human MOS and CER scores all correlate
well with one another across the 4 × 3 models and
baselines. Within the LJ or LS domain, the Pear-
son r ranged from .95 à .99; across domains it was
less good (depuis .79 à .96), illustrating again the
existence of a domain effect. Not shown here, nous
reached similar conclusions with our fitted-ASR
metrics, but with less good score and correlations.
Tableau 2 also shows the results of the two toplines
(original text+TTS and ASR+TTS). Fait intéressant,
our best models come within 3% absolute in PER
or CER compared to these toplines, are quite close
to them in terms of MOS and even beat them in
terms of human CER.

5.2 Results on the Generation Task

The upper mid and right cells of Figure 3 show
generation results averaging across the uncondi-
tional and conditional conditions, on automatic

and human evaluations, respectivement. The main
result is that there is both an effect of number of
units and of system. As for resynthesis, 50 units is
always worst, but contrary to resynthesis, 200 units
is not always better. Dans l'ensemble, the results on gen-
eration are congruent with the idea that speech
generation both requires good scores on language
modeling and on speech synthesis. The best re-
sults for a particular model are then a compromise
between the number of units that give both scores
to either of these tasks. In terms of systems, le
best one here is HuBERT. Regarding human eval-
situations, they show similar patterns with a clear
dispreference for 50 units, and either 100 ou 200
being better.

Detailed results are shown in Table 3 avec
separate statistics for conditional and uncon-
ditional generation and additional results with
PPX@o-VERT and VERT@o-PPX. As expected,
the perplexity metric improved with prompts, mais
not the diversity score. The human results are
congruent with the automatic scores, bien que
they tend to prefer more units, perhaps showing
that they cannot fully dissociate their judgment

1343

Systems
S2u
architect.
Toplines
original wav
orig text+TTS
ASR + TTS
Baselines
LogMel
LogMel
LogMel
Unsupervised
CPC
CPC
CPC
HuBERT-L6
HuBERT-L6
HuBERT-L6
wav2vec-L14
wav2vec-L14
wav2vec-L14

Nb
units

Bit-
rate

27

50
100
200

50
100
200
50
100
200
50
100
200

214.8
292.7
373.8

159.4
213.1
279.4
125.7
168.1
211.3
141.3
182.1
226.8

End-to-end ASR-based metrics
PER↓ PER↓ CER↓ CER↓ MOS↑ MOS↑ CER↓ CER↓
(LS)
(LJ)

Human Opinion

(LS)

(LS)

(LS)

(LJ)

(LJ)

(LJ)


7.78
9.45

27.72
25.83
19.78

10.87
10.75
8.74
11.45
9.53
8.87
24.95
14.58
10.65


7.92
8.18

49.38
45.58
45.16

17.16
15.82
14.23
16.68
13.24
11.06
33.69
22.07
16.34


8.87
9.48

27.73
24.88
17.86

10.68
9.84
9.20
11.02
9.31
8.88
25.42
13.72
10.21


5.14
5.30

52.05
48.71
46.12

12.06
9.46
8.29
11.85
7.19
5.35
32.91
17.22
10.50

4.83
4.02
4.04

2.41
2.65
2.96

3.63
3.42
3.85
3.69
3.84
4.00
2.45
3.50
3.83

4.30
4.03
4.06

2.07
2.01
2.16

3.51
3.68
3.54
3.49
3.68
3.85
2.87
3.32
3.51

8.88
13.25
15.98

43.78
37.39
23.33

13.97
13.53
9.36
14.54
13.02
11.67
46.82
23.76
13.14

6.73
10.73
11.56

66.75
62.72
62.6

19.92
14.73
14.33
13.14
11.43
10.84
54.9
28.1
15.27

Tableau 2: Results on the resynthesis task for 3 unsupervised models plus one LogMel baseline and 3
unit sizes. Bitrates are in bit/sec, PER are for a pretrained phone recognition model without lexicon and
LM, CER are derived from a full ASR model (lower is better). Human MOS (upper is better) and CER
(computed from transcription, lower is better) are provided (le 95% confidence interval was on average
.32 for MOS and 1.8 for human CER).

of meaning from their judgment of intelligibility.
The three metrics correlate well with one another
(r between .86 et .99) and correlate with their
counterpart across task (prompted vs. unprompted:
r between .82 et .99). Human evaluations cor-
related well with the automatic metrics (AUC:
r = .87; PPX: r = .92; VERT: r = 0.75).

5.3 Results for Zero-shot Probe Metrics

In Table 4, we show the results for zero-shot
metrics across the different models and baselines.
Dans l'ensemble, the performance depends on the linguistic
levels while remaining above chance. While per-
formance is excellent at the acoustic level (6.5%
error for the best model on ABX-across), it is
intermediate at the lexical level (31.3% error for
the best model on spot-the-word). Not shown,
the syntactic test is close to chance (42% error
for the best model on the sBLIMP test). These
values are worse than the ASR-topline (3.1% et
29%, for lexicon and syntax, resp.), showing room
for improvement.

The metrics correlate well: The ABX score
predicts the lexical score (r = 0.85) et le
syntax score (r = 0.71). Across the different
models, CPC gets the best units (ABX score) et
HuBERT gets the best LM scores. En outre, nous
see a clear effect of number of units (Chiffre 3). Pour
wav2vec, the performances on all metrics increase
with more units, alors que, for CPC and HuBERT
a U-shaped pattern emerges on most metrics, avec
best scores for units of intermediate sizes. C'est
interesting that the models with the highest bitrate
do not always have the best results. This means
that encoding too much acoustic information can
be detrimental to linguistic encoding in the uLM.
See Appendix Section 7.1 showing that ABX
has good correlations with automatic and human
metrics (r > .88).

6 Discussion and Conclusion

We introduced Generative Spoken Language Mod-
eling as a new unsupervised task bridging the gap
between speech and natural language processing

1344

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Systems
Encoder
architect.
Controls
oracle text
ASR + LM
Baseline
LogMel
LogMel
LogMel
Unsupervised
CPC
CPC
CPC
HuBERT-L6
HuBERT-L6
HuBERT-L6
wav2vec-L14
wav2vec-L14
wav2vec-L14

Generation based metrics

Human Opinion

Nb
units

unconditional
VERT↓

AUC↓

PPX↓

prompt
PPX↓ VERT↓

prompt
uncond.
AUC↓ MMOS↑ MMOS↑

154.5
178.4

1588.97
1500.11
1539.00

374.26
349.56
362.84
376.33
273.86
289.36
936.97
948.96
538.56

19.43
21.31


95.50

46.26
41.797
40.28
43.06
31.36
33.04

79.51
61.06

50
100
200

50
100
200
50
100
200
50
100
200


0.18

154.5
162.8

19.43
20.49


0.04

4.02
3.91

4.26
4.38

1083.76
510.26
584.16



19.68
15.74
16.46
19.27
5.54
7.49
307.91
208.38
61.48

323.9
294.7
303.5
339.8
251.2
262.4
1106.3
775.1
585.8



39.92
42.93
43.42
45.85
33.67
34.30




18.44
14.06
26.67
21.03
5.88
6.13
330.8
205.7
91.07



3.31
3.65
3.58
3.53
3.95
4.01
2.26
2.28
2.64



3.61
3.65
3.67
3.00
3.53
4.32
1.91
1.92
3.04

Tableau 3: Results on the generation task for three unsupervised models plus the LogMel baseline and
3 unit sizes. PPX@-o-VERT and VERT@o-PPX are reported as PPX and VERT. ‘-’ : missing or non
calculable results. Human MMOS are also provided (le 95% confidence interval was on average .29
for uncond. et .61 for cond.).

S2u

Metrics
Nb
units with.↓

ABX ABX spot-the-
word↓

acr.↓

accept.
judg.↓

uLM

3.12

29.02

50
100
200

50
100
200
50
100
200
50
100
200

23.95
24.33
25.71

5.50
5.09
5.18
7.37
6.00
5.99
22.30
18.16
16.59

35.86
37.86
39.65

7.20
6.55
6.83
8.61
7.41
7.31
24.56
20.44
18.69

48.52
48.12
49.62

32.18
31.72
37.40
32.88
31.30
36.52
51.92
50.24
44.68

46.78
46.83
47.76

45.43
44.35
45.19
44.06
42.94
47.03
45.75
45.97
45.70

System
Toplines
ASR+LM
Baselines
LogMel
LogMel
LogMel
Unsupervised
CPC
CPC
CPC
HuBERT-L6
HuBERT-L6
HuBERT-L6
wav2vec-L14
wav2vec-L14
wav2vec-L14

Tableau 4: Results for zero-shot probe metrics for
3 unsupervised models plus one LogMel baseline
et 3 unit sizes. ABX within and across speakers,
spot-the-word, and acceptability judgments are
error rates (lower is better); chance is 50%.

and related it conceptually to previously studied
unsupervised tasks: Acoustic Unit Discovery, Spo-
ken Language Modeling, Discrete Speech Re-
synthesis, and Text Generation. We introduced a
suite of metrics, baselines, and first results on Lib-

rilight that sets the playing field for future work.
For comparability, we open source our evaluation
stack and the best of our baseline models.

Our main contributions are as follows. (1) Nous
established a set of easy to use automatic ASR-
based metrics for model comparison at two criti-
cal levels for this task: intelligibility of the speech
output and meaningfulness in terms of higher
linguistic content. We assessed the first through
ASR-based PER and CER metrics; and the sec-
ond using text-generation-based metrics (AUC for
PPX/VERT). (2) We found that these two sets
of metrics correlated well with human judgment
et (3) that they can be approximated with their
inference-mode counterparts, which are faster to
compute using zero-shot probe tasks. (4) Applying
these metrics to pipeline models based on current
speech representation learning models and out-
of-the-box LM and TTS components, we found
that our basic premise is fulfilled: C'est possible
to train a language model from quantized units de
rived from audio and using it to generate new
speech. The generated speech is English-sounding,
with recognizable phonemes and words and lo-
cally acceptable syntax (see transcribed examples
in the Appendix and audio snippets here: https://
speechbot.github.io/gslm). Our automatic
metrics confirm the quality of the representations

1345

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

and outputs at the acoustic/phonetic level, mais
show that improvements are needed at the lan-
guage level. It is to be expected that performance
will increase with larger training sets beyond our
6k hours, as has been noted in the case of text.
(5) We also uncovered specific issues regard-
ing the number of quantized units. For speech
resynthesis, the optimum number of units was al-
ways 200 by a large margin, reflecting the well
known bitrate/intelligibility trade-off
(Dunbar
et coll., 2019). Cependant, for language modeling,
this was not necessarily the case, as the more
detailed acoustic information may introduce too
numerous phonetic details that have no impact at
the level of lexical and syntactic representations.
(6) Enfin, we found that the choice of units also
affected the temperature parameter which is used
to control the trade-off between quality and di-
versity in text-based language model. To address
this effect, we proposed a method to normalize
the temperature by using an oracle text to build
perplexity and diversity anchor points.

Obviously, this is only a first step towards
building textless NLP applications that could be
applied to any language, even low resource ones.
To reach this long term goal, three important
challenges need to be addressed.

D'abord, even though we did compare three dif-
ferent encoders and obtained different results, nous
cannot conclude that one encoder is definitely su-
perior to the others. Our point here was merely
to use previously published pretrained encoders,
and study systematically the effect of number of
units on these encoders. A fuller study including
a wider set of encoders and a proper hyperparam-
eter search (including the selection of the embed-
ding layer and the clustering algorithm) would be
needed in order to determine which of them is
most appropriate for speech generation.

Deuxième, it is to be expected that to further
improve generation results, more needs to be done
than applying this pipeline to larger training sets.
Contrary to text, speech unfolds through time and
varies continuously in phonetic space. Speech also
contains multilayered representations (phonetic,
prosodic, speaker identity, emotions, background
bruit, etc.). Cependant, both our TTS and our LM
were out-of-the-box systems typically used for text
applications. More work is needed to adapt these
architectures to the richness and variability of the
speech signal (see Polyak et al., 2021un, for first
steps towards integrating prosody into discrete

units). The metrics and baselines we introduced
here provide landmarks against which we will
measure future progress.

Troisième, the automatic metrics that we defined
here depend on textual resources to build the
evaluation ASR and LM models, and on lin-
guistic resources to build the zero-shot metrics.
How could this ever be applied to low-resource
languages? Note that the linguistic resources we
require are used only for model selection, pas
model training. Our metrics allow for fast itera-
tions in architecture and hyperparameter search,
but the overall algorithm is totally unsupervised.
Donc, an important next step is to extend this
work to other languages, in order to find a com-
mon architecture/hyperparameter set that gives
good results in held-out languages (high or low
resource). The hope is that once good learning
models are tuned using a diverse sample of high
resource languages, the same models could be
deployed in languages where no such resources
are available, and work in a purely unsupervised
fashion.

Remerciements

We thank Michael Auli and Alexis Conneau for
their useful input on wav2vec, and Lior Wolf,
Pierre Emmanuel Mazar´e, and Gargi Gosh for
their support for this project. We would also like to
thank the reviewers and editors for their thorough
revoir, and constructive feedback.

Les références

Alexei Baevski, Michael Auli, and Abdelrahman
Mohamed. 2019. Effectiveness of self-supervised
pre-training for speech recognition. CoRR, abs
/1911.03912.

Alexei Baevski, Wei-Ning Hsu, and Alexis
Conneau. 2021. Unsupervised speech recog-
nition. arXiv preprint arXiv:2012.15454.

Alexei Baevski, Steffen Schneider, and Michael
Auli. 2020un. vq-wav2vec: Self-supervised
learning of discrete speech representations.
International Conference on Learning Repre-
sentations (ICLR).

Alexei Baevski, Henry Zhou, Abdelrahman
Mohamed, and Michael Auli. 2020b. wav2vec

1346

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

2.0: A framework for self-supervised learning
of speech representations. In Proceedings of
the 34th International Conference on Neural
Information Processing Systems, volume 33,
pages 12449–12460.

Steven Bird, Edward Loper, and Ewan Klein.
2009. Natural Language Processing with
Python.

Tom B. Brun, Benjamin Mann, Nick Ryder,
Melanie Subbiah,
Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
Christopher Hesse, Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner,
Sam
McCandlish, Alec Radford, Ilya Sutsk. 2020.
Language models are few-shot learners. En Pro-
ceedings of the 34th International Conference
on Neural Information Processing Systems.

Mathilde Caron, Piotr Bojanowski, Armand
Joulin, and Matthijs Douze. 2018. Deep cluster-
ing for unsupervised learning of visual features.
In Proceedings of the European Conference on
Computer Vision (ECCV), pages 132–149.

Mingjie Chen and Thomas Hain. 2020. Unsu-
pervised acoustic unit representation learning
for voice conversion using WaveNet auto-
encoders. In Proceedings of INTERSPEECH,
pages 4866–4870. https://est ce que je.org/10
.21437/Interspeech.2020-1785

Po-Han Chi, Pei-Hung Chung, Tsung-Han Wu,
et
Chun-Cheng Hsieh, Shang-Wen Li,
Hung-yi Lee. 2021. Audio ALBERT: A lite
BERT for self-supervised learning of audio
In IEEE Spoken Language
representation.
Technology Workshop (SLT), pages 344–350.
https://doi.org/10.1109/SLT48900
.2021.9383575

Jan Chorowski and Navdeep Jaitly. 2016. Towards
better decoding and language model integra-
tion in sequence to sequence models. En Pro-
ceedings of INTERSPEECH, pages 523–527.
https://doi.org/10.21437/Interspeech
.2017-343

Yu-An Chung and James Glass. 2020. Improved
speech representations with multi-target auto-
regressive predictive coding. In Proceedings of
the 58th Annual Meeting of the Association for
Computational Linguistics, pages 2353–2358.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-main.213

Yu-An Chung, Wei-Ning Hsu, Hao Tang, et
James Glass. 2019. An unsupervised au-
toregressive model for speech representation
learning. In Proceedings of INTERSPEECH,
pages 146–150. https://doi.org/10.21437
/Interspeech.2019-1473

Jacob Devlin, Ming-Wei Chang, Kenton Lee, et
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of the 2019 Con-
ference of the North American Chapter of the
Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long
and Short Papers), pages 4171–4186. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/N19-1423

Li Dong, Nan Yang, Wenhui Wang, Furu Wei,
Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming
Zhou, and Hsiao-Wuen Hon. 2019. Unified lan-
guage model pre-training for natural language
understanding and generation. In Advances
in Neural Information Processing Systems,
volume 32, pages 13063–13075. Curran Asso-
ciates, Inc.

Ewan Dunbar, Robin Algayres, Julien Karadayi,
Mathieu Bernard, Juan Benjumea, Xuan-Nga
Cao, Lucie Miskic, Charlotte Dugrain, Lucas
Ondel, Alan W. Noir, Laurent Besacier,
Sakriani Sakti, and Emmanuel Dupoux. 2019.
The Zero Resource Speech Challenge 2019:
TTS without T. In Proceedings of INTER-
SPEECH, pages 1088–1092. https://est ce que je
.org/10.21437/Interspeech.2019-2904

Ewan Dunbar, Julien Karadayi, Mathieu Bernard,
Xuan-Nga Cao, Robin Algayres, Lucas
Ondel, Laurent Besacier, Sakriani Sakti, et
Emmanuel Dupoux. 2020. The Zero Resource
Speech Challenge 2020: Discovering discrete
subword and word units. In Proceedings of
INTERSPEECH, pages 4831–4835. https://

1347

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

doi.org/10.21437/Interspeech.2020
-2743

Emmanuel Dupoux. 2018. Cognitive science in
the era of artificial intelligence: A roadmap
langue-
for reverse-engineering the infant
learner. Cognition, 173:43–59. https://
doi.org/10.1016/j.cognition.2017
.11.008

Janek Ebbers, Jahn Heymann, Lukas Drude,
Thomas Glarner, Reinhold Haeb-Umbach,
and Bhiksha Raj. 2017. Hidden Markov
Model variational autoencoder for acoustic unit
discovery. In Proceedings of INTERSPEECH,
pages 488–492. https://doi.org/10.21437
/Interspeech.2017-1160

Ryan Eloff, Andr´e Nortje, Benjamin van Niekerk,
Avashna Govender, Leanne Nortje, Arnu
Pretorius, Elan van Biljon, Ewald van der
Westhuizen, Lisa van Staden, and Herman
Kamper. 2019. Unsupervised acoustic unit
discovery for speech synthesis using discrete
latent-variable neural networks. In Proceed-
ings of INTERSPEECH, pages 1103–1107.
https://doi.org/10.21437/Interspeech
.2019-1518

Siyuan Feng, Tan Lee, and Zhiyuan Peng. 2019.
Combining adversarial training and disentan-
gled speech representation for robust zero-
resource subword modeling. In Proceedings of
INTERSPEECH, pages 1093–1097. https://
doi.org/10.21437/Interspeech.2019
-1337

Thomas Glarner, Patrick Hanebrink,

Janek
Ebbers, and Reinhold Haeb-Umbach. 2018. Full
Bayesian Hidden Markov Model variational au-
toencoder for acoustic unit discovery. En Pro-
ceedings of INTERSPEECH, pages 2688–2692.
https://doi.org/10.21437/Interspeech
.2018-2148

Wei-Ning Hsu, David Harwath, Christopher Song,
and James Glass. 2020. Text-free image-
to-speech synthesis using learned segmental
units. arXiv preprint arXiv:2012.15454.

Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin
Bolte, Ruslan Salakhutdinov, and Abdelrahman
Mohamed. 2021. HuBERT: How much can a

bad teacher benefit ASR pre-training? In Neu-
ral Information Processing Systems Workshop
on Self-Supervised Learning for Speech and
Audio Processing Workshop, pages 6533–6537.
https://doi.org/10.1109/ICASSP39728
.2021.9414460

Wei-Ning Hsu, Yu Zhang, and James Glass.
2017un. Learning latent
representations for
speech generation and transformation. En Pro-
ceedings of INTERSPEECH, pages 1273–1277.
https://doi.org/10.21437/Interspeech
.2017-349

Wei-Ning Hsu, Yu Zhang, and James Glass.
2017b. Unsupervised learning of disentan-
gled and interpretable representations from
sequential data. In Advances in Neural In-
formation Processing Systems, volume 30,
pages 1878–1889. Curran Associates, Inc.

Keith Ito and Linda Johnson. 2017. The lj
speech dataset. https://keithito.com
/LJ-Speech-Dataset/

Z. Jin, UN. Finkelstein, G. J.. Mysore, and J. Lu.
2018. FFTNet: A real-time speaker-dependent
neural vocoder. In IEEE International Confer-
ence on Acoustics, Speech and Signal Process-
ing (ICASSP), pages 2251–2255. https://
doi.org/10.1109/ICASSP.2018.8462431

J.. Kahn, M.. Rivi`ere, W. Zheng, E. Kharitonov,
Q. Xu, P.. E. Mazar´e, J.. Karadayi, V. Liptchinsky,
R.. Collobert, C. Fuegen, T. Likhomanenko,
G. Synnaeve, UN. Joulin, UN. Mohamed, et
E. Dupoux. 2020. Libri-light: A benchmark
for ASR with limited or no supervision. Dans
IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP),
pages 7669–7673. https://est ce que je.org/10
.1109/ICASSP40776.2020.9052942

Eugene Kharitonov, Morgane Rivi`ere, Gabriel
Pierre-Emmanuel
Synnaeve, Lior Wolf,
Mazar´e, Matthijs Douze,
and Emmanuel
Dupoux. 2021. Data augmenting contrastive
learning of speech representations in the time
domain. arXiv preprint arXiv:2007.00991,
215–222. https://est ce que je.org/10
pages
.1109/SLT48900.2021.9383605

Sameer Khurana, Shafiq Rayhan Joty, Ahmed
Ali, and James Glass. 2019. A factorial

1348

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

deep Markov Model for unsupervised disen-
tangled representation learning from speech.
In IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP),
pages 6540–6544. IEEE.

Sameer Khurana, Antoine Laurent, Wei-Ning
Hsu, Jan Chorowski, Adrian Lancucki, Ricard
Marxer, and James Glass. 2020. A convolu-
tional deep Markov model for unsupervised
speech representation learning. In Proceedings
INTERSPEECH, pages 3790–3794. https://
doi.org/10.21437/Interspeech.2020
-3084

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae.
2020. HiFi-GAN: Generative adversarial net-
works for efficient and high fidelity speech syn-
thesis. In Proceedings of the 34th International
Conference on Neural Information Process-
ing Systems, volume 33, pages 17022–17033.
https://arxiv.org/abs/2010.05646.

Kundan Kumar, Rithesh Kumar, Thibault
de Boissiere, Lucas Gestin, Wei Zhen Teoh,
Jose Sotelo, Alexandre de Br´ebisson, Yoshua
Bengio, and Aaron C. Courville. 2019. Mel-
GAN: Generative adversarial networks for
conditional waveform synthesis. In Advances
in Neural Information Processing Systems,
volume 32, pages 14910–14921. Curran Asso-
ciates, Inc.

Cheng-I Lai, Yung-Sung Chuang, Hung-Yi Lee,
Shang-Wen Li, and James Glass. 2021. Semi-
supervised spoken language understanding via
self-supervised speech and language model pre-
entraînement. In IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP),
pages 7468–7472. https://est ce que je.org/10
.1109/ICASSP39728.2021.9414922

Chia-ying Lee and James Glass. 2012. A nonpara-
metric Bayesian approach to acoustic model
discovery. In Proceedings of the 50th Annual
Meeting of
the Association for Computa-
tional Linguistics (Volume 1: Long Papers),
pages 40–49.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
2020. BART: Denoising sequence-to-sequence
pre-training for natural language generation,

the 58th Annual Meeting of

translation, and comprehension. In Proceed-
le
ings of
Association for Computational Linguistics,
pages 7871–7880. Association for Compu-
tational Linguistics. En ligne. https://est ce que je
.org/10.18653/v1/2020.acl-main.703

Shaoshi Ling and Yuzong Liu. 2020. DeCoAR
2.0: Deep contextualized acoustic representa-
tions with vector quantization. arXiv preprint
arXiv:2012.06659.

Shaoshi Ling, Yuzong Liu, Julian Salazar, et
Katrin Kirchhoff. 2020. Deep contextualized
acoustic representations for semi-supervised
In IEEE International
speech recognition.
Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 6429–6433. IEEE.

UN. T. Liu, S. Lequel, P.. Chi, P.. Hsu, and H. Lee.
2020. Mockingjay: Unsupervised speech rep-
resentation learning with deep bidirectional
transformer encoders. In IEEE International
Conference on Acoustics, Speech and Signal Pro-
cessation (ICASSP), pages 6419–6423. https://
doi.org/10.1109/ICASSP40776.2020
.9054458

Alexander H. Liu, Yu-An Chung, and James
Verre. 2020. Non-autoregressive predictive
coding for
learning speech representations
from local dependencies. arXiv preprint
arXiv:2011.00406.

Andy T. Liu, Po chun Hsu, and Hung-Yi
Lee. 2019un. Unsupervised end-to-end learning
of discrete linguistic units for voice con-
version. In Proceedings of INTERSPEECH,
pages 1108–1112. https://est ce que je.org/10
.21437/Interspeech.2019-2048

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019b. RoBERTa: A robustly op-
timized BERT pretraining approach. arXiv
preprint arXiv:1907.11692.

Takashi Morita and Hiroki Koda. 2020. Explor-
ing TTS without T using biologically/psycho-
logically motivated neural network modules
(ZeroSpeech 2020). In Proceedings of INTER-
SPEECH, pages 4856–4860. https://est ce que je
.org/10.21437/Interspeech.2020-3127

1349

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Shekhar Nayak, C. Shiva Kumar, G. Ramesh,
Saurabhchand Bhati, and K. Sri Rama Murty.
2019. Virtual phone discovery for speech
synthesis without text. In IEEE Global Con-
ference on Signal and Information Processing
(GlobalSIP). https://doi.org/10.1109
/GlobalSIP45357.2019.8969412

Vassil Panayotov, Guoguo Chen, Daniel Povey,
and Sanjeev Khudanpur. 2015. LibriSpeech: Un
ASR corpus based on public domain audio
livres. In IEEE International Conference on
Acoustics, Speech and Signal Processing
(ICASSP), pages 5206–5210. IEEE. https://
doi.org/10.1109/ICASSP.2015.7178964

Tu Anh Nguyen, Maureen de Seyssel, Patricia
Roz´e, Morgane Rivi`ere, Evgeny Kharitonov,
Alexei Baevski, Ewan Dunbar, and Emmanuel
Dupoux. 2020. The Zero Resource Speech
Benchmark 2021: Metrics and baselines for
unsupervised spoken language modeling. Dans
Advances in Neural Information Processing
Systems (NeurIPS) – Self-Supervised Learning
for Speech and Audio Processing Workshop.

Benjamin van Niekerk, Leanne Nortje, et
Herman Kamper. 2020. Vector-quantized neu-
ral networks for acoustic unit discovery in the
ZeroSpeech 2020 Défi. In Proceedings of
INTERSPEECH, pages 4836–4840. https://
doi.org/10.21437/Interspeech.2020
-1693

Lucas Ondel, Luk´aˇs Burget, and Jan ˇCernock`y.
acoustic
2016. Variational
unit discovery. Procedia Computer Science,
81:80–86.

inference

pour

Aaron van den Oord, Yazhe Li, and Oriol
Vinyals. 2018. Representation learning with
contrastive predictive coding. arXiv preprint
arXiv:1807.03748.

Aaron van den Oord, Oriol Vinyals, and Koray
Kavukcuoglu. 2017. Neural discrete represen-
tation learning. In Advances in Neural Informa-
tion Processing Systems, pages 6306–6315.

Aaron van den Oord, Sander Dieleman, Heiga
Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, et
Koray Kavukcuoglu. 2016. WaveNet: A gen-
erative model for raw audio. arXiv preprint
arXiv:1609.03499.

Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Fan, Sam Gross, Nathan Ng, David Grangier,
and Michael Auli. 2019. Fairseq: A fast, ex-
tensible toolkit for sequence modeling. En Pro-
ceedings of NAACL-HLT, pages 48–53.

Santiago Pascual, Mirco Ravanelli, Joan Serra,
Antonio Bonafonte, and Yoshua Bengio. 2019.
Learning problem-agnostic speech representa-
tions from multiple self-supervised tasks. Dans
Proceedings of INTERSPEECH, pages 161–165.
https://doi.org/10.21437/Interspeech
.2019-2605

Matthew Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextu-
alized word representations. In Proceedings of
le 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 2227–2237,
La Nouvelle Orléans, Louisiana. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N18-1202

Adam Polyak, Yossi Adi, Jade Copet, Eugene
Kharitonov, Kushal Lakhotia, Wei-Ning Hsu,
Abdelrahman Mohamed,
and Emmanuel
Dupoux. 2021un. Speech resynthesis from dis-
crete disentangled self-supervised representa-
tion. In Proceedings of INTERSPEECH.

Adam Polyak, Lior Wolf, Yossi Adi, Ori Kabeli,
and Yaniv Taigman. 2021b. High fidelity
speech regeneration with application to speech
enhancement. In IEEE International Confer-
ence on Acoustics, Speech and Signal Process-
ing (ICASSP), pages 7143–7147. https://
doi.org/10.1109/ICASSP39728.2021
.9414853

Adam Polyak, Lior Wolf, Yossi Adi, and Yaniv
Taigman. 2020un. Unsupervised cross-domain
singing voice conversion. In Proceedings of
INTERSPEECH, pages 801–805. https://
doi.org/10.21437/Interspeech.2020
-1862

Adam Polyak, Lior Wolf, and Yaniv Taigman.
2020b. TTS skins: Speaker conversion via
INTERSPEECH,
ASR.

In Proceedings of

1350

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

786–790. https://est ce que je.org/10

pages
.21437/Interspeech.2020-1416

R.. Prenger, R.. Valle, et B. Catanzaro. 2019.
Waveglow: A flow-based generative network
for speech synthesis. In IEEE International
Conference on Acoustics, Speech and Signal
Processing
3617–3621.
https://doi.org/10.1109/ICASSP.2019
.8683143

(ICASSP),

pages

Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9.

J.. Zhong, S. Pascual, P..
M.. Ravanelli,
J.. Trmal, et
J.. Monteiro,
Swietojanski,
Oui. Bengio. 2020. Multi-task self-supervised
learning for
Dans
IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP),
pages 6989–6993. https://est ce que je.org/10
.1109/ICASSP40776.2020.9053569

robust speech recognition.

F. Ribeiro, D. Florˆencio, C. Zhang, and M. Seltzer.
2011. CROWDMOS: An approach for crowd-
sourcing mean opinion score studies.
Dans
IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP),
pages 2416–2419. https://est ce que je.org/10
.1109/ICASSP.2011.5946971

Morgane Rivi`ere and Emmanuel Dupoux. 2020.
Towards unsupervised learning of speech fea-
tures in the wild. In IEEE Spoken Language
Technology Workshop (SLT), pages 156–163.
https://doi.org/10.1109/SLT48900
.2021.9383461

Thomas Schatz, Naomi H. Feldman, Sharon
Goldwater, Xuan-Nga Cao, and Emmanuel
Dupoux. 2021. Early phonetic learning without
phonetic categories: Insights from large-scale
simulations on realistic input. In Proceedings
of the National Academy of Sciences, 118(7).
https://doi.org/10.1073/pnas
.2001844118

Steffen Schneider, Alexei Baevski, Ronan
Collobert, and Michael Auli. 2019. wav2vec:
Unsupervised pre-training for speech recog-
INTERSPEECH,
nition.
pages 3465–3469. http://est ce que je.org/10
.21437/Interspeech.2019-1873

In Proceedings of

J.. Shen, R.. Pang, R.. J.. Blanc, M.. Schuster,
N. Jaitly, Z. Lequel, Z. Chen, Oui. Zhang, Oui.
Wang, R.. Skerrv-Ryan, R.. UN. Saurous, Oui.
Agiomvrgiannakis, and Y. Wu. 2018. Natural
TTS synthesis by conditioning WaveNet on
MEL spectrogram predictions. In IEEE Inter-
ational Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 4779–4783.
https://doi.org/10.1109/ICASSP
.2018.8461368

Andros Tjandra, Sakriani Sakti, and Satoshi
Nakamura. 2020. Transformer VQ-VAE for
unsupervised unit discovery and speech synthe-
sis: ZeroSpeech 2020 Défi. In Proceed-
ings of INTERSPEECH, pages 4851–4855.
https://doi.org/10.21437/Interspeech
.2020-3033

Andros Tjandra, Berrak Sisman, Mingyang
Zhang, Sakriani Sakti, Haizhou Li, and Satoshi
Nakamura. 2019. VQVAE unsupervised unit
discovery and multi-scale Code2Spec inverter
for Zerospeech Challenge 2019. In Proceedings
of INTERSPEECH, pages 1118–1122. https://
doi.org/10.21437/Interspeech.2019
-3232

Patrick Lumban Tobing, Tomoki Hayashi,
Yi-Chiao Wu, Kazuhiro Kobayashi,
et
Tomoki Toda. 2020. Cyclic spectral mod-
eling for unsupervised unit discovery into
voice conversion with excitation and waveform
modeling. In Proceedings of INTERSPEECH,
pages 4861–4865. https://est ce que je.org/10
.21437/Interspeech.2020-2559

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in Neu-
ral Information Processing Systems, volume 30,
pages 5998–6008. Curran Associates, Inc.

Maarten Versteegh, Xavier Anguera, Aren Jansen,
and Emmanuel Dupoux. 2016. The Zero
Resource Speech Challenge 2015: Proposed
approaches and results. Procedia Computer
Science, 81:67–72. https://10.1016/j
.procs.2016.04.031

W. Wang, Q. Tang, and K. Livescu. 2020. Un-
supervised pre-training of bidirectional speech
encoders via masked reconstruction.

1351

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

In IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP),
pages 6889–6893. https://est ce que je.org/10
.1109/ICASSP40776.2020.9053541

Anne Wu, Changhan Wang, Juan Pino, et
Jiatao Gu. 2020. Self-supervised represen-
improve end-to-end speech trans-
tations
INTERSPEECH,
In Proceedings of
lation.
pages 1491–1495. https://est ce que je.org/10
.21437/Interspeech.2020-3094

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian
Guo, Weinan Zhang, Jun Wang, and Yong
Yu. 2018. Texygen: A benchmarking plat-
form for text generation models. In SIGIR,
pages 1097–1100. https://est ce que je.org/10
.1145/3209978.3210080

7 Appendix

7.1 Zero-shot Metrics Correlation Results

In Figure A1, we present the Pearson correlations
between the zero-shot metrics and the human
and automatic metrics on downstream tasks. Le
fact that the ABX metric correlates well with these
downstream metrics makes it a useful proxy metric
for preliminary model and unit size selection, comme
it is much less costly than generating TTS output
and running human or ASR evaluations.

7.2 Effect of Temperature on Outputs

Dans cette section, we describe preliminary ex-
periments we conducted to test the effects of

temperature on the generated outputs. Comme indiqué
in Table A1, the temperature defined qualitatively
4 operating zones. With the lowest temperature,
we get repetitive outputs, where the system keeps
repeating the same few words. At a slightly higher
temperature, the system outputs complete sen-
tences, but they are sampled from a narrow set
of topics. At the highest temperature, the sys-
tem utters an unstructured bag of words. In the
mid-temperature range, we observe relatively co-
herent and varied outputs. This is the range we
want to select for our systems. As described in
Chiffre 2, the lowest bound was set by using
the oracle PPX (temperature range between 0.2
et 0.65. across unsupervised models) et le
highest bound by using the oracle VERT (tem-
perature range between 1.1 à 1.4). In Figure A2
we present human opinion results for samples
from these two temperatures, plus an extra mean
temperature falling in between. Humans typically
preferred the lower temperature.

single

judgments

selecting a

In Figure A3, we illustrate the continuation
temperature
method for
in a
for human meaningfulness
model-neutral way, as explained in Section 3.3.
It consists of generating possible continuations of
each prompt and computing the BLEU-2 score8
with oracle continuation. The temp@cont tem-
perature is defined as the temperature maximizing
this score. Computing these estimates with 10
continuations gave continuation temperatures
varying between 0.5 et 0.9 across models and
unit sizes. These are the temperatures we used for
the MMOS results reported in the main paper.

8We used NLTK to compute BLEU (Bird et al., 2009).

1352

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure A1: Patterns of correlations between the zero-shot metrics and the automatic and human metrics.
Color scale indicates strength of the Pearson correlation coeficient (we used negative MOS and MMOS to enforce
less is better for all metrics).

Temp Example

Very low temperature samples (stuttering zone)
0.3

the property by james resell red for liberata or by jason downy the property
by jason downy the property the property the property the property
and to take in another path and to take in another path and to take in an-
other path and to take in another path and to take in another path and to
take in another path and take in a

Low temperature samples (obsessive zone)
0.7

chapter nineteen of the life of the upper part of the ocean this is ali
bravos recording only bravos recordings are in the public domain i for more
information or to volunteer
this is a lipper vox are courting oliver vox or courting are in the public
domain for afraid art to volunteer pleases it lipper vox dot or this

Mid temperature samples
1.0

but it is attendant from the people to defend himself from this information
pride of the potential in criminal activity a curiosity and impetuosity of
the world a war soon acquired
finally we ought to have a strong plan a without positively the best type of
the public with which we ascend it or extend it our business and as we are a
persons of the most strong designs and other affairs of the case we

High temperature samples (babble zone)
1.5

ation of pure blue he said at once a licking streamy at her warm spot of half
performed note was a raging oath let it as bir of amole in mood strolling er
crass
at the swing here as to motions out of the events not time and abe he was any
stump headed and flow any he’s the kiln are tama why do ye take the floor

0.3

0.7

1.0

1.5

Table A1: Unconditional uLM (trained on CPC-100 units) samples, transcribed by ASR, at different
temperatures.

1353

Figure A2: MMOS for unconditional (no prompt)
and conditional generated speech sampled at the three
reference temperatures (oracle VERT, oracle PPX, et
average temperature) (preliminary experiments).

Figure A3: Method for selecting the continuation
temperature for MMOS judgments.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
3
0
1
9
7
6
7
8
4

/

/
t

je

un
c
_
un
_
0
0
4
3
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

1354On Generative Spoken Language Modeling from Raw Audio image
On Generative Spoken Language Modeling from Raw Audio image
On Generative Spoken Language Modeling from Raw Audio image
On Generative Spoken Language Modeling from Raw Audio image
On Generative Spoken Language Modeling from Raw Audio image
On Generative Spoken Language Modeling from Raw Audio image

Télécharger le PDF