Consistent Transcription and Translation of Speech

Matthias Sperber, Hendra Setiawan, Christian Gollan,
Udhyakumar Nallasamy, Matthias Paulik

Apple
{sperber,hendra,cgollan,udhay,mpaulik}@apple.com

Abstrait

The conventional paradigm in speech trans-
lation starts with a speech recognition step to
generate transcripts, followed by a translation
step with the automatic transcripts as input.
ce
To address various shortcomings of
paradigm, recent work explores end-to-end
trainable direct models that translate without
transcribing. Cependant, transcripts can be an
indispensable output in practical applications,
which often display transcripts alongside the
translations to users.

We make this common requirement explicit
and explore the task of jointly transcribing and
translating speech. Although high accuracy
of transcript and translation are crucial, même
highly accurate systems can suffer from incon-
sistencies between both outputs that degrade
the user experience. We introduce a method-
ology to evaluate consistency and compare
several modeling approaches, including the
traditional cascaded approach and end-to-end
models. We find that direct models are poorly
suited to the joint
transcription/translation
task, but that end-to-end models that feature
a coupled inference procedure are able to
achieve strong consistency. We further intro-
duce simple techniques for directly optimizing
for consistency, and analyze the resulting
trade-offs between consistency, transcription
accuracy, and translation accuracy.1

1 Introduction

Speech translation (ST) is the task of translating
in a foreign
acoustic speech signals into text
langue. According to the prevalent framing of
ST (par exemple., Ney, 1999), given some input speech

1We release human annotations of consistency under
https://gi t h u b . com/apple/ml-transcript
-translation-consistency-ratings.

695

(1)

)

X, ST seeks an optimal translation ˆt ∈ T , alors que
possibly marginalizing over transcripts s ∈ S:

ˆt = argmax

{P. (t | X)}

t ∈ T

≈ argmax

t ∈ T (

s ∈ S
X

PMT (t | s) PASR (s | X)

According to this formulation, ST models
primarily focus on translation quality, while tran-
scription receives less emphasis. In contrast, prac-
tical ST user interfaces often display transcripts
to the user alongside the translations. A typical
example is a two-way conversational ST applica-
tion that displays the transcript to the speaker for
verification, and the translation to the conversa-
tion partner (Hsiao et al., 2006). Donc, là
is a mismatch between this practical requirement
and the prevalent framing as described above.

While traditional ST models often do commit
to a single automatic speech recognition (ASR)
transcript that is then passed on to a machine
translation (MT) component (Stentiford and Steer,
1988; Waibel et al., 1991), researchers have
undertaken much effort to mitigate resulting error
propagation issues by developing models that
avoid making decisions on transcripts. Recent
examples include direct models (Weiss et al.,
that bypass transcript generation, et
2017)
lattice-to-sequence models (Sperber et al., 2017)
that translate the ASR search space as a whole.
Despite their merits, such models may not be ideal
for scenarios that display both a translation and a
corresponding transcript to users.

In this paper, we replace Eq. 1 by a joint
transcription/translation objective to reflect this
requirement:

ˆs, ˆt = argmax
s ∈ S,t ∈ T

{P. (s, t | X)} .

(2)

This change in perspective has significant
implications not only on model design but also

Transactions of the Association for Computational Linguistics, vol. 8, pp. 695–709, 2020. https://doi.org/10.1162/tacl a 00340
Action Editor: David Chiang. Submission batch: 4/2020; Revision batch: 7/2020; Published 11/2020.
c(cid:13) 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
4
0
1
9
2
3
4
7
9

/
t

un
c
_
un
_
0
0
3
4
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

novel training and inference schemes, under the
hypothesis that both joint model training and a
coupled inference procedure are desirable for our
goal of accurate and consistent models. Troisième,
we provide a comprehensive analysis, comparing
accuracy and consistency for a wide variety of
model
language pairs to
determine the most suitable models for our task
and analyze potential trade-offs.

types across several

2 Evaluation Beyond Accuracy—The

Need for Consistency

To better understand the desiderata of models
that perform transcription and translation, it is
helpful to discuss how one should evaluate such
models. A first step is to evaluate transcription
accuracy and translation accuracy in isolation. Pour
this purpose, we can use well-established evalua-
tion metrics such as word error rate (WER) pour
transcripts and BLEU (Papineni et al., 2002)
for translations. When considering scenarios in
which both transcript and translation are dis-
played, consistency is an essential additional re-
quirement.2 Let us first clarify what we mean by
this term.

Definition: Consistency between transcript
and translation is achieved if both are seman-
tically equivalent, with a preference for a
faithful translation approach (Newmark, 1988),
meaning that stylistic, lexical, and grammatical
characteristics should be transferred whenever
fluency is not compromised. Surtout, consis-
tency measures are defined over the space of
both well-formed and erroneous sentence pairs.
In the case of ungrammatical sentence pairs,
consistency may be achieved by adhering to a
literal or word-for-word translation strategy.

Consistency is only loosely related to accuracy,
and can even be in opposition in some cases.
Par exemple, when a translation error cannot
be avoided, consistency is improved at the cost
of transcription accuracy by placing the back-
translated error in the transcript. Because accuracy
and error metrics assess transcript or translation
quality in isolation, these metrics cannot capture
phenomena that involve the interplay between
transcript and translation.

2Other important ST use cases do not show both tran-
scripts at the same time, such as multilingual movie subtitling.
For such cases, consistency may be less critical.

lexical

Chiffre 1: Example of
inconsistencies we
encountered when generating transcript and translation
independently. Although the transcript correctly con-
tains replay,
the German translation (mistakenly)
chooses ersetzen (English: replace). The inconsistency
is explained by the acoustic similarity between replay
and replace, which is not obvious to a monolingual
user.

Chiffre 2:
Illustration of surface-level consistency
between English transcript and German translation.
Only translation 1 spells both named entities (Bill Gross
and eSolar) consistently, and the German translation
Solarthermaltechnologie (translation 1) is preferred
over Solarw¨arme-Technologie (translation 2), by itself
a correct choice but less similar on the surface level.

on evaluation. D'abord, besides translation accuracy,
transcription accuracy becomes relevant and
equally important. Deuxième, the issue of consistency
between transcript and translation becomes
essential. Par exemple, let us consider a naive
approach of transcribing and translating with two
completely independent, potentially erroneous
models. These independent models would ex-
pectedly produce inconsistencies, including in-
consistent lexical choice caused by acoustic or
linguistic ambiguity (Chiffre 1), and inconsistent
spelling of named entities (Chiffre 2). Even if
output quality is high on average, such incon-
sistencies may considerably degrade the user
experience.

Our contributions are threefold: D'abord, nous
introduce the notion of consistency between
transcripts and translations and propose methods
to assess consistency quantitatively. Deuxième, nous
survey and extend existing models, and develop

696

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
4
0
1
9
2
3
4
7
9

/
t

un
c
_
un
_
0
0
3
4
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

2.1 Motivational Use Cases

Although ultimately user studies must assess to
what extent consistency improves user satisfac-
tion, our intention in this paper is to provide a
universally useful notion of consistency that does
not depend too much on specific use cases. Never-
theless, our definition may be most convincing
when put in the context of specific example use
cases.

Lecture Use Case. Ici, a person follows
a presentation or lecture-like event, presented in
a foreign language, by reading transcript and trans-
lation on a screen (F¨ugen, 2008). This person may
have partial knowledge of the source language,
but knows only the target language sufficiently
well. Elle, donc, pays attention mainly to the
translation outputs, but may occasionally con-
sult the transcription output in cases where the
translation seems wrong. Dans ce cas, quick
orientation can be critical, and inconsistencies
would cause distraction and undermine trust
and perceived transparency of the transcription/
translation service.

Dialog Use Case. Suivant, consider the scenario
of a dialog between two people who speak differ-
ent languages. One person, the speaker, attempts
to convey a message to the recipient, relying on
an ST service that displays a transcript and a
translation. Ici, the transcript is shown to the
conférencier, who speaks only the source language, pour
purposes of verification and possibly correction.
The translation is shown to the recipient, OMS
only understands the target language, to convey
the message (Hsiao et al., 2006). We can expect
that if transcript and translation are error-free, alors
the message is conveyed smoothly. Cependant,
when the transcript or translation contains errors,
miscommunication occurs. To efficiently recover
from such miscommunication, both parties should
agree on the nature and details of the mistaken
content. Autrement dit, occurring errors are
preferred to be consistent between transcript and
translation.

3 Estimating Consistency

Having argued for consistency as a desirable
property, we now wish to empirically quantify the
level of consistency between a particular model’s
transcripts and translations. To our knowledge,

697

consistency has not been addressed in the
context of ST before, perhaps because tradi-
tional cascaded models have not been observed
to suffer from inconsistencies in the outputs.
Donc, we propose several metrics for esti-
mating transcript/translation consistency in this
section. In §7.3, we demonstrate strong agree-
ment of these metrics with human ratings of
consistency.

3.1 Lexical Consistency

Our first metric focuses on semantic equivalency
lexical choice in
in general, and consistent
particular, as illustrated in Figure 1. To this
end, we use a simple lexical coverage model
based on word-level
translation probabilities.
This approach might also capture some aspects
of grammatical consistency by rewarding the
use of comparable function words. We sum
negative translation log-probabilities for each
utterance: tt→s= −
tj ∈t maxsi∈s log p (tj | si).
We then normalize across the test corpus C
and average over both translation directions:
1
, where n
2
and m denote the number of translated and
P.
transcribed words in the corpus, respectivement. Dans
pratique, we use fast align (Dyer et al., 2013)
to estimate probability tables from our training
data. When a word has no translation probability
assigned, including out-of-vocabulary cases, nous
use a simple smoothing method by assigning the
lowest score found in the lexicon.

(s,t) ∈ C tt→s+ 1
m

(s,t) ∈ C ts→t

1
n

(cid:17)

(cid:16)

Although it may seem tempting to use a
more elaborate translation model such as an
encoder-decoder model, we deliberately choose
this simple lexical approach. The main reason is
that we need to estimate consistency for potentially
Dans un tel
erroneous transcript/translation pairs.
cases, we found severe robustness issues when
computing translation scores using a full-fledged
encoder-decoder model.

3.2 Surface Form Consistency

Our consistency definition mentions a preference
for a stylistic similarity between transcript and
translation. One way of assessing stylistic aspects
is to compare transcripts and translations at
the surface level. This is most sensible when
the source and target language are related, et
could help capture phenomena such as consistent
spelling of named entities, or translations using

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
4
0
1
9
2
3
4
7
9

/
t

un
c
_
un
_
0
0
3
4
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

words with similar surface form as found in the
transcript. Chiffre 2 provides an illustration.

We propose to assess surface form consistency
through substring overlap. Our notion of substring
overlap follows CharCut, which was proposed
as a metric for reference-based MT evaluation
(Lardilleux and Lepage, 2017). Following Eq. 2
of that paper, we determine substring insertions,
deletions, and shifts in the translation, quand
compared with the transcript, and compute
1 − deletions+insertions+shifts
. Counts are aggregated
|s|+|t|
and normalized at corpus
level. To avoid
spurious matches, we match only substrings of
least length n (ici: 5), compare in case-
à
sensitive fashion, and deactivate CharCut’s special
treatment of longest common prefixes/suffixes.

We note that surface form consistency is less
suited to language pairs that use different alpha-
bets, and leave it to future work to explore alter-
natives, such as the assessment of cross-lingual
phonetic similarity in such cases.

3.3 Correlation of Transcription/Translation

Error

This third metric bases consistency on well-
established accuracy metrics or error metrics.
We posit that a necessary (though not sufficient)
condition for consistency is that the accuracy
of the transcript should be correlated with the
accuracy of
the translation, where both are
measured against some respective gold standard.
We therefore propose to assess consistency
through computing statistical correlation between
utterance-level error metrics for transcript and
translation.

Spécifiquement, for a test corpus of size N , nous
compute Kendall’s τ coefficient across utterance-
level error metrics. On the transcript side, we use
utterance-level WER as the error metric. Because
BLEU is a poor utterance-level metric, we make
use of CharCut on the translation side, which has
been shown to correlate well with human judgment
at utterance level (Lardilleux and Lepage, 2017).
Officiellement, we compute:

kendall τ

WERclipped

1:N , CharCut1:N

(3)

(cid:16)

(cid:17)

Because CharCut

is clipped above 1, nous
also apply clipping to utterance-level WER for
stability.

698

Chiffre 3: Dialog use case. Whenever the transcript or
the translation has errors, additional effort is needed.

3.4 Combined Metric for Dialog Task

The previous metrics estimate consistency in a
fashion that is complementary to accuracy, tel
that it is possible to achieve good consistency
despite poor accuracy. This allows trading off
accuracy against consistency, depending on spe-
cific task requirements. Ici, we explore a particu-
lar instance of such a task-specific trade-off that
arises naturally through the formulation of a
communication model. We consider a dialog
situation (§2.1), and assume that communication
will be successful if and only if both transcript and
translation do not contain significant deviations
from some reference, as motivated in Figure 3.
the main difference to §3.3 is
Conceptually,
that here we penalize, rather than reward, le
bad/bad situation (Chiffre 3). To estimate the prob-
ability of some generated transcript and translation
allowing successful communication, given refer-
ence transcript and translation, we thus require
that both the transcript and the translation are
sufficiently accurate. For utterance with index k:

P. (succk | ref) = P (sk ok ∩ tk ok | ref)

= P (sk ok | ref) × P (tk ok | sk, ref)
≈ P (sk ok | ref) × P (tk ok | ref)

(4)

We then use utterance-level accuracy metrics
as a proxy, computing accuracy (sk) = 1−
WER clipped
, accuracy (tk) = 1−CharCut k. Pour
k
a test corpus of size N we compute corpus-level
scores as 1
1≤k≤N P (succk).
N

4 Models for Transcription and

Translation

We now turn to discuss model candidates for
consistent transcription and translation of speech
(Figures 4–5). We hypothesize that there are two
desirable model characteristics in our scenario.
D'abord, motivated by Eq. 2, models may achieve
better consistency by performing joint inference,
in the sense that no independence assumption

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
3
4
0
1
9
2
3
4
7
9

/
t

un
c
_
un
_
0
0
3
4
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 4: Cascaded and direct model types.

representations, dec(·)

For ease of reference, we use enc(·) to refer
to the encoder component that transforms speech
inputs (or embedded text inputs) into a hidden
à
encoder
the attentional decoder component
that pro-
duces hidden decoder states auto-regressively,
and SoftmaxOut(·) to refer
to the output
softmax layer that models discrete output token
probabilities. We will subscript components with
the parameter sets π, φ to indicate cases in which
model components are separately parametrized.

to refer

4.2 Cascaded Model (CASC)

The cascaded model (Figure 4a) represents ST’s
traditional approach of using separately trained
ASR and MT models (Stentiford and Steer,
1988; Waibel et al., 1991). Ici, we use modern
sequence-to-sequence ASR and MT components.
CASC runs a speech input x1:l through an ASR
model

g1:l = encφ(x1:je)
ui = decφ(toi

Télécharger le PDF