CANINE: Pre-training an Efficient Tokenization-Free Encoder

CANINE: Pre-training an Efficient Tokenization-Free Encoder
for Language Representation

Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting
Google Research, Etats-Unis

{jhclark,dhgarrette,iuliaturc,jwieting}@google.com

Abstrait

Pipelined NLP systems have largely been su-
perseded by end-to-end neural modeling, yet
nearly all commonly used models still require
an explicit tokenization step. While recent to-
kenization approaches based on data-derived
subword lexicons are less brittle than manually
engineered tokenizers, these techniques are
not equally suited to all languages, and the use
of any fixed vocabulary may limit a model’s
ability to adapt. In this paper, we present
CANINE, a neural encoder that operates directly
on character sequences—without explicit to-
kenization or vocabulary—and a pre-training
strategy that operates either directly on char-
acters or optionally uses subwords as a soft
inductive bias. To use its finer-grained input
effectively and efficiently, CANINE combines
downsampling, which reduces the input se-
quence length, with a deep transformer stack,
which encodes context. CANINE outperforms a
comparable mBERT model by 5.7 F1 on TYDI
QA, a challenging multilingual benchmark,
despite having fewer model parameters.

1

Introduction

End-to-end neural models have generally replaced
the traditional NLP pipeline, and with it, the er-
ror cascades and feature engineering common to
such systems, preferring instead to let the model
automatically induce its own sophisticated repre-
sentations. Tokenization, cependant, is one of the
few holdovers from that era, with nearly all com-
monly used models today requiring an explicit
preprocessing stage to segment a raw text string
into a sequence of discrete model inputs. Broadly
speaking, tokenizers are generally either carefully
constructed systems of language-specific rules,

CANINE: Character Architecture with No tokenization In

Neural Encoders.

Code and checkpoints are available on GitHub at

http://caninemodel.page.link/code.

73

which are costly, requiring both manual feature
engineering and linguistic expertise, or data-
driven algorithms such as Byte Pair Encoding
(Sennrich et al., 2016), WordPiece (Wu et al.,
2016), or SentencePiece (Kudo and Richardson,
2018) that split strings based on frequencies in a
corpus, which are less brittle and easier to scale,
but are ultimately too simplistic to properly handle
the wide range of linguistic phenomena that can’t
be captured by mere string-splitting (§2.1).

The degree of sophistication required to ac-
curately capture the full breadth of linguistic
phenomena, along with the infeasibility of writing
such rules by hand across all languages and do-
mains, suggests that explicit tokenization itself is
problematic. In contrast, an end-to-end model that
operates directly on raw text strings would avoid
these issues, instead learning to compose indi-
vidual characters into its own arbitrarily complex
features, with potential benefits for both accuracy
and ease of use. While this change is conceptu-
ally very simple—one could replace the subword
vocabulary in a model like BERT (Devlin et al.,
2019) with a vocabulary made solely of individ-
ual characters—doing so leads to two immediate
problems. D'abord, the computational complexity of a
transformer (Vaswani et al., 2017), the main com-
ponent in BERT as well as other models such as
GPT (Radford et al., 2019; Brown et al., 2020)
and T5 (Raffel et al., 2020), grows quadrati-
cally with the length of the input. Since standard
subword models have roughly four characters
per subword on average,
the 4x increase in
input sequence length would result is a signifi-
cantly slower model. Deuxième, simply switching to
a character vocabulary yields empirically poor
résultats (§4.2).

In order to enable tokenization-free model-
ing that overcomes these obstacles, we present
CANINE. CANINE is a large language encoder with
a deep transformer stack at its core. Inputs to the

Transactions of the Association for Computational Linguistics, vol. 10, pp. 73–91, 2022. https://doi.org/10.1162/tacl a 00448
Action Editor: Shay Cohen. Submission batch: 4/2021; Revision batch: 8/2021; Published 1/2022.
c(cid:2) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

model are sequences of Unicode characters.1 To
represent the full space of Unicode characters2
without a vocabulary, we use a hashing strategy.
To avoid the slowdown from increasing the se-
quence length, CANINE uses strided convolutions
to downsample input sequences to a shorter length
before the deep transformer stack.

Like BERT, we pre-train CANINE on the Masked
Language Model (MLM) and Next Sentence Pre-
diction (NSP) tasks. For the MLM task, CANINE
offers two options:

k-t-b

‘‘write’’ (root form)

kataba

‘‘he wrote’’

kattaba

‘‘he made (someone) write’’

iktataba

‘‘he signed up’’

Tableau 1: Non-concatenative morphology in Arabic.3
When conjugating, letters are interleaved within
the root. The root is therefore not separable from
its inflection via any contiguous split.

1. A fully character-level loss that autoregres-
sively predicts characters in masked spans.

2. A vocabulary-based loss that predicts the
identities of masked subword tokens. Criti-
cally, this tokenization is used only for the
pre-training loss; tokens are never input to
the encoder, and the tokenizer and subword
vocabulary can be safely discarded after pre-
entraînement. This effectively converts the hard
constraint of token boundaries found in other
models into a soft inductive bias in CANINE.

In this article, we contribute:

• the first pre-trained tokenization-free deep

encoder;

• an efficient model architecture that directly
encodes long sequences of characters with
speed comparable to vanilla BERT; et

• a model

that performs no tokenization
on the input, avoiding the lossy infor-
mation bottleneck associated with most
pre-processing.

2 Motivation

2.1 Linguistic Pitfalls of Tokenization

Subword tokenizers are the de facto standard in
modern NLP (Devlin et al., 2019; Raffel et al.,

1We consider splitting on Unicode characters to be
tokenization-free because it depends only on the (determinis-
tic) process defined by the Unicode standard, and not on any
models, hand-crafted rules, or other linguistic knowledge.

2Unicode defines 1,114,112 total codepoints, of which
only 143,698 are assigned to characters as of Unicode 13.0.
This covers 154 scripts and over 900 languages.

2020; Brown et al., 2020). These algorithms are
limited to only simple word-splitting operations.
While this is perhaps a reasonable approach for a
language with impoverished morphology such as
English, it is much less appropriate in the face of
phenomena like agglutinative morphology such as
English, it is much less appropriate in the face of
phenomena like agglutinative morphology, non-
concatenative morphology (Tableau 1), consonant
mutation, vowel harmony, et ainsi de suite.

Even in high-resource languages, subword
models still tend to struggle on challenging do-
mains, such as informal text, which often includes
typos, spelling variation,4 transliteration, or emoji
(O’Connor et al., 2010). BERT, which uses Word-
Piece tokenization, is sensitive to corruptions of
the input, both natural typos (Sun et al., 2020) et
adversarial manipulations (Pruthi et al., 2019),
with some of the loss attributable to corrupted
strings no longer being covered by the vocabulary.
Seemingly safe heuristics used by these al-
gorithms, such as splitting on whitespace and
punctuation, are problematic when applied to lan-
guages that do not use spaces between words
(Thai, Chinese) or use punctuation as letters
(Hawaiian,5 Twi6). While SentencePiece does of-
fer the option to skip whitespace splitting, it is not
typically used due to poor empirical performance.
Fixed vocabulary methods can also force mod-
elers to choose between difficult preprocessing
tradeoffs: Should one keep accents, casing, and so
forth, and avoid destructive preprocessing?—Or

3From en.wikipedia.org/wiki/Arabic verbs.
4Par exemple, Spanish speakers may drop accents when

typing.

5Hawaiian uses an apostrophe to indicate a glottal stop.
6Informal Twi uses a right paren ) to represent

le

letter

.

74

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

keep such orthographic information and risk im-
portant words dropping out of the frequency-based
vocabulary altogether due to the presence of mul-
tiple variants of otherwise-similar words? Pour
instance, mBERT initially removed all diacritics,
thus dropping tense information in Spanish7 and
conflating many unrelated words in Vietnamese.8
Enfin, using a fixed vocabulary during pre-
training also creates complications for down-
stream tasks, which are subsequently tied to
the same tokenizer and vocabulary used for pre-
entraînement, even if it is not well-suited for the target
domain and/or end-task. Boukkouri et al. (2020)
showed that BERT’s Wikipedia+BooksCorpus
WordPiece vocabulary results in excessive seg-
mentation when fine-tuning on medical data, dim-
inishing the benefit of pre-training as a strategy.

2.2 Enabling Better Generalization

Much as Tenney et al. (2019) showed that large
encoders learn elements of
the classic NLP
pipeline, it seems natural to let the model dis-
cover tokenization as well. En gardant cela à l'esprit, nous
seek an approach that can better generalize be-
yond the orthographic forms encountered during
pre-training.

In terms of scientific inquiry, we would like to
know whether we can build models that learn how
to compose words where appropriate, and mem-
orize them where memorization is needed. Large
frequency-derived vocabularies partially mitigate
this problem by simply memorizing more, mais
language inherently requires aspects of both mem-
orization and composition. By building a model
that directly engages with these issues within the
small scale of word composition, we hope to en-
able future work studying these problems at larger
scales such as phrasal constructions.

Practically, generalization is hindered for vo-
cabulary elements that are slight orthographic
variations, where one is very infrequent. Hypo-
thetically, a model may estimate a very good
embedding for a common vocabulary element kit-
ten, but a poor embedding for the less frequent
element kittens since the model has no a priori
knowledge that they are related. Embeddings that
are rarely touched during pre-training will not be
updated much beyond their random initializations.

2.3 Reducing Engineering Effort

Mature tokenizers often include years of hand-
engineered rules around special cases such as
email addresses, URLs, and handling unknown
words;9 even fairly minimal modern tokenizers
include initial word-splitting heuristics followed
by a specific algorithm and vocabulary for further
breaking these tokens into subwords.

Modern pre-trained models also have many re-
quirements throughout their lifecycle: Entre
the time a model is pre-trained, fine-tuned, et
served—potentially months or years apart—its
weights and model implementation may be con-
verted to be compatible with another toolkit, its
fine-tuning data may be tokenized in a different
chemin, and the natural distribution of words may be
quite different. All of these things introduce ample
opportunities for mismatches to arise between to-
kenization and the vocabulary from pre-training.
Yet this same pre-training paradigm presents an
advantage for character models: access to far more
(unsupervised) data to learn word composition
from characters; without transfer learning, ce
has historically been impractical for many tasks
having little supervised data.

3 CANINE

CANINE consists of three primary components:
(1) a vocabulary-free technique for embedding
text; (2) a character-level model that is efficient
by means of downsampling and upsampling; et
(3) an effective means of performing masked
language modeling on a character-level model.

3.1 Model

CANINE is designed to be a minimally modified
variant of the deep transformer stack found in
modern encoders such as GPT, (m)BERT, XLM,
and XLM-R such that its architecture is easily
adoptable by other models in this family. Le
simplest implementation of such a character model
would be to feed characters at each position in
place of subwords. Cependant, this approach would
result in far more sequence positions given the
same input text, leading to linearly more compute
in feed forward layers and quadratically more
compute in self-attention layers.

7Spanish past tense uses an accented final vowel.
8Vietnamese uses diacritics to indicate tones—often the

only difference among several unrelated content words.

9Par exemple, should a subword containing an unknown
character be a separate token, or should the unknown
character be separated as its own token?

75

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 1: CANINE neural architecture.

The overall form of the CANINE model (Chiffre 1)
is the composition of a downsampling function
DOWN, a primary encoder ENCODE, and an upsam-
pling function UP;10 given an input sequence of
character embeddings e ∈ Rn×d with length n
and dimensionality d:

Yseq ← UP (ENCODE (DOWN(e)))

where Yseq ∈ Rn×d is the final representation for
sequence prediction tasks. De la même manière, for classi-
fication tasks, the model simply uses the zeroth
element of the primary encoder:

ycls

[ENCODE (DOWN(e))]0

Preprocessing Like existing models, the input
to CANINE must ultimately be represented as a
sequence of integers, but because the nature of
characters is well-defined and standardized by
Unicode, preprocessing code that would typically
be hundreds or thousands of lines can be replaced
by a very simple procedure: just iterate over the
characters in the input string, and return their code-
point integer values (par exemple., a single line of code11
in Python). En outre, because codepoint val-
ues are part of the Unicode Standard, ils sont
documented publicly, already supported by pro-
gramming languages, and will not change over
temps, unlike arbitrary vocabulary-based IDs.

Character Hash Embeddings CANINE uses
hashing (Svenstrup et al., 2017) to support em-
bedding the full space of Unicode codepoints with

10Enveloping the attention stack between downsampling
and upsampling layers is similar to the Funnel-Transformer
(Dai et al., 2020), which operates on WordPiece. Cependant,
many of its design choices (par exemple., average pooling, their
residual structure) did not work well in CANINE.

11Python preprocessing: [ord(c) for c in text].

a relatively small number of parameters—but,
to reduce the chance that different codepoints
will share exactly the same representation, nous
define a generalization of the standard hash-
ing approach in which we apply multiple hash
functions to each codepoint and concatenate
the representations associated with the various
hash values.

More formally, given a single codepoint12 xi ∈
N, we apply K hash functions Hk : N → N, et
look up each hashing result in its own embedding
matrix13 Ek ∈ RB×d(cid:6)
, yielding K embeddings of
size d(cid:6) = d/K, which are then concatenated into a
single representation of size d:

ei ←

K(cid:2)

k

LOOKUP (Hk(xi)%B, Ek)

where ⊕ denotes vector concatenation. We refer
to these as the character embeddings e ∈ Rn×d.
In our experiments, we use d = 768, K = 8, et
B = 16k.14

While each individual hash function is subject
to hash collisions,15 the overall effect is minimal
since each function only accounts for a small por-
tion of the codepoint’s overall embedding, and it
is highly improbable that the other hash functions
will produce the same collisions.

it

Because the model always supports all code-
points,
is possible to learn representations
during fine-tuning for characters (et, by ex-
tension, words, scripts, etc.) that were never seen

12Conceptually, a codepoint is a character; cependant, un
Unicode codepoint is defined precisely and unambiguously.
13CANINE uses learned embeddings, not random embedding

as in other hash embeddings (Kaliamoorthi et al., 2019).

14The memory footprint of these hash embeddings is

equivalent to a vocabulary embedding with 16k items.

15This is not a probing/chaining hash table, but rather as an
approximate map, where we expect and tolerate collisions,
similar to a Bloom Map (Talbot and Talbot, 2008).

76

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

during pre-training, while still making use of what
pre-training learned about word composition and
sentence structure.

Optional Vocabulary-Free n-Grams We can
also redefine the embeddings ei above to include
character n-grams, again without a fixed vocab-
ulary, such that each n-gram order contributes
equally to a summed embedding:16

eN
je

K(cid:2)

N(cid:3)

k

j

LOOKUP (H(cid:6)

k(xi…j)%B, Ej,k)

H(cid:6)

k(xi…j) =

(cid:4)
Hk(xi)
(cid:5)
H(cid:6)
k

xi + H(cid:6)

k(X(i+1)…j)

if i = j
(cid:6)
o/w

This formulation still admits tokenization-free
modeling, but provides the model with an induc-
tive bias that favors slightly more memorization
via a compute-cheap means of adding parameters.
Notably, it also allows the model’s input signature
to remain a simple sequence of codepoints.

Downsampling To make CANINE efficient, nous
use a multi-part downsampling strategy. D'abord, nous
encode characters using a single-layer block-wise
local attention transformer. This model per-
forms self-attention only within each block of
a pre-defined size,17 saving the quadratic cost of
attention while leveraging the linguistic intuition
that word composition—that is, the kind of com-
position relevant in the lowest layers of the model
(Tenney et al., 2019)—tends to happen at a very
local level. Suivant, we use a strided convolution
to reduce the number of sequence positions to be
similar to that of a word piece model.18 Given
character embeddings e ∈ Rn×d with a sequence
length of n characters and dimensionality d, nous
use a convolution with a stride of r to downsample
the sequence:

m = 512, giving CANINE’s primary encoder—the
transformer stack—the same length as in mBERT.

Deep Transformer Stack After downsampling,
CANINE applies a deep transformer stack with L
layers to the resulting downsampled positions.
This is the same as the core of BERT and derivative
models, and remains the core of CANINE in that
it accounts for the vast majority of its compute
and parameters, though we note that this middle
portion of the model could easily be replaced with
any other sequence-to-sequence model including
those with better compute performance such as
Performer (Choromanski et al., 2021), Big Bird
(Zaheer et al., 2020), RFA (Peng et al., 2021), ETC
(Ainslie et al., 2020), et ainsi de suite. This portion of the
model yields a new downsampled representation
h(cid:6)

∈ Rm×d:

down

h(cid:6)
down

← TRANSFORMERL(hdown)

ycls = [h(cid:6)

down]0

We used L = 12 to match mBERT.

Upsampling While the above architecture is
sufficient for classification tasks, sequence pre-
diction tasks require that the model expose an
output layer with the same sequence length as
the input (c'est à dire., characters are the model’s input
and output ‘‘API’’ for tasks like tagging and span
prediction).

We reconstruct a character-wise output rep-
resentation by first concatenating the output of
the original character transformer (au-dessus de) avec
the downsampled representation produced by the
deep transformer stack. (Note that since each
downsampled position is associated with exactly r
characters for a downsampling rate of r, each po-
sition of downsampled representation is replicated
r times before concatenation.) More formally,

hinit ← LOCALTRANSFORMER1(e)
hdown ← STRIDEDCONV(hinit, r)

hup ← CONV (hinit ⊕ h(cid:6)
yseq ← TRANSFORMER1(hup)

down, w)

We refer to this output as the downsampled po-
sitions: hdown ∈ Rm×d where m = n/r is the
number of downsampled positions. In our exper-
iments, we use r = 4 and n = 2048 such that

16We use B = 15k and N = 4 for our n-grams.
17We use blocks of 128 characters in our experiments.
18In our experiments, we found a downsampling rate of
4X to result in high quality with a speed comparable to BERT.

where ⊕ indicates vector concatenation of the rep-
resentations (c'est à dire., not sequences) such that CONV
projects from Rn×2d back to Rn×d across a window
of w characters.19 Applying a final transformer
layer (standard, not local) yields a final sequence
representation yseq ∈ Rn×d.

19We use w = 4 in our experiments.

77

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Residual Connections While the initial char-
acter encoder (before downsampling) and final
character encoder (after upsampling) both repre-
sent character positions, they conceptually have
very different purposes in the network. Intuitively,
we think of the initial character encoder as com-
posing characters to create a more word-like
representation, while the final character encoder is
extracting the in-context representation that’s rel-
evant for predicting the ‘‘meaning’’ of the content
at each position; CANINE must be able to deal with
additional ambiguity during upsampling since a
single downsampled position may span more than
one conceptual word. Because of the different
roles of these induced features, we do not use
residual connections from hinit to hup.

3.2 Pre-training

Recent pre-trained models ranging from BERT to
T5 have largely used variations on a masked lan-
guage model (MLM) task (also known as span
corruption) as an unsupervised pre-training loss
function—a means of generating synthetic ex-
amples that are not from any realistic task, yet
prepare a model to learn realistic tasks in fu-
ture phases of training (c'est à dire., fine-tuning). Le
CANINE pre-training procedure retains the MLM
task, and offers two distinct strategies for com-
puting the MLM loss—autoregressive character
prediction vs. subword prediction—both of which
yield a fully tokenization-free model following
pre-training. In our experiments, we use only one
of these losses at a time.

3.2.1 Autoregressive Character Loss

Span-wise Masking CANINE-C is an autoregres-
sive character loss that masks character spans
within each sequence. These spans are chosen
based on whitespace boundaries. No punctuation
splitting nor other heuristics are used. All char-
acters within the masked span are replaced by a
special mask codepoint in the input.20 No random
subword replacement is performed as there is no
subword vocabulary.21

Span Prediction CANINE-C autoregressively
predicts the masked characters. The order of the
masked positions is shuffled such that masked

20We use codepoints in Unicode’s Private Use Area block

such that the input remains a valid Unicode string.

context is not necessarily revealed left-to-right,
but rather a single character at a time. Le
pre-training data preparation is shown in Figure 2.
Masked inputs are fed to the model as x.
The output of the CANINE model yseq and the
embeddings eg of the gold characters g (c'est à dire., le
character positions selected for MLM prediction)
are concatenated and then fed through a small
feed-forward neural network to project back to
the original dimensionality d; these are finally
shuffled and used by a single layer autoregressive
transformer with a left-to-right self-attention
mask:22

ˆy ← TRANSFORMERAUTOREG (eg ⊕ yseq)

This representation ˆy is then used to predict
each character. To avoid wasting time on a large
output weight matrix and softmax, the gold tar-
get classes t are bucketed codepoint IDs such
that ti = gi% B. This is similar to the strat-
egy used in the character hash embedder (§3.1).
The occassional collisions among characters is
less problematic due (un) the fact that this is an
encoder-only model and (b) that the embeddings
must still retain contextual information in order to
correctly predict characters. Because we’re only
predicting a relatively small subsequence of the
input (15% in our experiments), the cost of this
layer is small.

3.2.2 Subword Loss

We also experiment with CANINE-S, a subword-
based loss function, to demonstrate how a token-
aware pre-training loss can still be paired with
a tokenization-free model such that the tokenizer
and vocabulary are discarded after pre-training.

Span-wise Masking Like mBERT’s MLM setup,
each span in CANINE-S corresponds to a single
subword. As with the autoregressive loss, all char-
acters within the masked span are replaced with a
special ‘‘mask’’ codepoint. Random replacements
of subwords are chosen from the vocabulary of
same-length subwords such that the length of
the character sequence remains unchanged; plus
formally, given a subword selected for random
replacement x and a vocabulary of subwords V ,
x’s replacement will be drawn from the subset of
v ∈ V where LEN(v) = LEN(X).

21Though we expect that future work on vocabulary-free

22The left-to-right self-attention masking is with regard to

random replacement may improve quality.

the shuffled sequence.

78

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 2: CANINE-C pre-training data preparation (§3.2.1). Character-wise predictions are made by an autoregressive
transformer layer that predicts then reveals one character at a time, in a shuffled order.

Span Prediction Within each masked character
span, CANINE-S randomly selects a character posi-
tion where the model will make a prediction; le
model predicts the identity of the masked subword
via softmax. The associated subword embeddings
are discarded after pre-training.

3.2.3 Targeted Upsampling

By design, each final character representation
(after upsampling) is a function of the output of
the initial character encoder (before downsam-
pling) and the output of the deep transformer
stack—there are no inter-position dependencies
across the upsampled sequence. This depends on
the upsampler using position-wise feed-forward
projections and a single transformer layer. During
pre-training, we leverage this design to improve
speed by only performing upsampling on the se-
quence positions that will be used by the MLM
task p. More formally, we use the following
equivalent23 form of the UP function during
pre-training:

h∗
en haut
y∗
seq

← GATHER (p, hup)
← TRANSFORMER1(Q = h∗

en haut, KV = hup)

3.2.4 Modularity

Unlike previous models, CANINE removes both
the vocabulary and tokenization algorithm as
that must
fossilized parts of the final model

23This highly effective targeted upsampling optimization
is the primary reason that CANINE uses a full Transformer
layer for the final full-length character sequence rather than
a local transformer. Because a block-wise local transformer
assumes uniform position-wise locality over attention blocks,
it is not trivial to combine these two optimizations; le
local self-attention mask would no longer be a simple block
diagonal. Cependant, this final upsampling layer is discarded
for classification tasks and so does not contribute any cost.
Ainsi, while it is possible to combine local attention and
targeted upsampling, this is left as future work.

be replicated during fine-tuning and prediction.
Regardless of which pre-training loss is cho-
sen (characters or subwords), the use of these
components in CANINE is limited to a detail of
the pre-training procedure—an inductive bias of
the loss function—that is then discarded. Le
fine-tuning and prediction phases of the model
lifecycle never have any knowledge of what vo-
cabulary or tokenization algorithm (if any) étaient
used in pre-training. This allows the model to na-
tively process untokenized data, or even process
data that has been pre-processed by different tok-
enizers, a situation that would otherwise introduce
a significant skew between training phases.

4 Experiments

4.1 Experimental Setup

4.1.1 Information-Seeking QA Data

TYDI QA: Primary Tasks TYDI QA is a dataset
of information-seeking questions in 11 typolog-
ically diverse languages (Clark et al., 2020).
Questions are written before answers, leading to
less lexical and morphological overlap between
questions and answers, which are drawn from
Wikipedia. We evaluate on the primary tasks.24

Passage Selection Task (SELECTP) Given a list
of the passages in a Wikipedia article, return either
the index of the passage that answers the question,
or return NULL if the article contains no accept-
able answer.

Minimal Answer Span Task (MINSPAN) Given
a full Wikipedia article, return the start and end
byte indices of the minimal span that completely
answers the question. Alternativement, a system may

24As opposed to the simplified TYDIQA-GOLDP task, lequel

is part of the XTREME meta-benchmark.

79

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

indicate that the article does not contain an answer,
or return YES or NO for yes/no type questions.

4.1.2 Named Entity Recognition Data
We also consider the task of named entity recogni-
tion (NER), which requires the model to identify
which spans of a sentence correspond to entities
and label the entity type. In all of our experi-
ments, we framed the task as sequence labeling,
predicting BIO-encoded span labels.

CoNLL NER We use Spanish and Dutch data
from the CoNLL 2002 NER task (Tjong Kim Sang,
2002) and English and German from the CoNLL
2003 NER task (Tjong Kim Sang and De Meulder,
2003), all from the newswire domain.

MasakhaNER To widen the scope of our exper-
iments beyond European languages, we also in-
clude MasakhaNER (Adelani et al., 2021), lequel
includes ten African languages (Amharic, Hausa,
Igbo, Kinyarwanda, Luganda, Luo, Nigerian Pidgin,
Swahili, Wolof, and Yor`ub´a) with human anno-
tations on local news text.

4.1.3 Model Configuration
Direct Comparison with mBERT
In order to de-
termine which pre-training architecture produces
better quality downstream predictions, we com-
pare CANINE to mBERT, which we re-implemented
and re-trained in order to hold as many variables
as possible constant. Note that we intentionally
do not compare against public pre-trained check-
points that use different pre-training corpora since
(un) this would be a major confounding variable
et (b) most publicly available pre-trained mod-
els are simply instantiations of BERT, y compris
XLM-R25 and X-STILTS.26

Setup We pre-train on the multilingual Wikipedia
data of mBERT, which includes 104 languages.
De la même manière, we reuse mBERT’s exponential smooth-
ing technique to weight the languages within the
pre-training samples. We train for 124k steps with
batch size 4096 (2.5 passes over the data) using the
LAMB optimizer (You et al., 2020) with a linearly
decayed learning rate of 0.018 où 2.5% of the
steps are used for warm-up. We use a sequence
length of 512 for mBERT, et 2048 for CANINE,
which results in 512 downsampled positions in

25XLM-R instantiates BERT with a larger pre-training

corpus, larger model size, and larger vocabulary size.

26X-STILTS performs English fine-tuning on an existing

XLM-R checkpoint (Phang et al., 2020).

its core deep transformer stack. We pre-train on
64 Cloud TPUs v327 for approximately one day
(see results for precise timings). For both mBERT
and CANINE-S (CANINE with the subword loss), nous
select 15% of subwords for the MLM loss and
predict up to 80 output positions; 80% of these are
masked in the input, 10% are randomly replaced,
et 10% are unmodified. For CANINE-C (CANINE
with the autoregressive character loss), we se-
lect 15% of contiguous spans for the MLM loss
and predict up to 320 output characters, and no
random replacement is performed. For TYDI QA,
we use a maximum answer length of 100 char-
acters, which is approximately the 99th percentile
answer length. Sequences longer than the max-
imum sequence length are zero-padded, suivre-
ing BERT.28

4.2 TYDI QA Results

Our main result is shown in Table 2. CANINE-S
(CANINE with the subword loss) improves over
mBERT in the TYDI QA SELECTP task by 2.8 F1,
while using about 30% fewer parameters. Simi-
larly, CANINE-C (CANINE with the autoregressive
character loss), improves over mBERT by 2.5 F1.
Adding vocab-free character n-grams leads to even
further gains over mBERT (+3.8 F1) and even more
on the MINSPAN task (+6.9 F1). A language-wise
breakdown is provided in Table 7 in the Appendix.
We also present results from some ablation
models as additional baselines in rows 3–4 of
Tableau 2. D'abord, for row 3, we simply replace BERT’s
subword vocabulary with a pure character vo-
cabulary, which makes characters both the input
granularity and the unit of masking and prediction
for the MLM task, and observe that not only is the
model 10X slower than subword-based BERT, mais
the quality also suffers greatly. Alors, for row 4, nous
modify that model to use subwords for masking
and MLM predictions, while keeping characters
as the input granularity, and we see a substantial
quality improvement, though pre-training remains
extremely slow. Enfin, by comparing to the full

27v3 TPUs have 16 GiB memory / core (128 GiB total).
28Each pre-training uses approximately 24 hours on 64
TPUs (1.5k TPU-hours), so the 18 pre-trainings in Tables
2/3/4 required about 28k TPU-hours. Le 18 TyDi QA
experiments in these tables, each take about 1 hour on
16 TPUs, each with 3 replicas (48 TPU-hours), about 1k
TPU-hours total. Le 3 NER experiments in Table 5 chaque
took 3 hours on 4 TPUs with 3 replicas each (36 TPU-hours),
108 TPU-hours total. Thus replicating the experiments in this
paper would take approximately 29k TPU-hours.

80

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Model

Input

MLM

r Length

Examples
/ sec

TYDIQA TYDIQA
Params SELECTP MINSPAN

mBERT (public)
mBERT (ours)

Subwords Subwords
Subwords Subwords
Chars
Chars
Chars
CANINE-S
CANINE-C
Chars
CANINE-C + n-grams Chars



1
Single Chars
1
Subwords
Subwords
4
Autoreg. Chars 4
Autoreg. Chars 4

512
512
2048
2048
2048
2048
2048


9000
925
900
6400
6050
5600

50.5
51.3

179M. 63.1
179M. 63.2
127M. 59.5 (–3.7) 43.7 (–7.5)
127M. 63.8 (+0.6) 50.2 (–1.0)
127M. 66.0 (+2.8) 52.5 (+1.2)
127M. 65.7 (+2.5) 53.0 (+1.7)
167M. 68.1 (+4.9) 57.0 (+5.7)

Tableau 2: Direct comparison between mBERT (rows 1–2) and CANINE (rows 5–7) on TYDI QA. Public
mBERT results are taken from the TYDI QA paper. Rows 3 et 4 show simple baselines that yield
inefficient / low-quality performance. Despite operating on 4x more sequence positions, CANINE
remains comparable to mBERT in terms of speed. Pre-training example/sec are shown for our reported
hardware (see Setup, §4.1). r represents the ratio for downsampling. Parameters are calculated at
fine-tuning time. All results are averaged over 3 fine-tuning replicas. TYDI QA scores are F1 scores,
macro-averaged across languages. Deltas from our mBERT (the most comparable baseline) are shown
in parentheses.

Question

Passage Answer

Chelsea ina milikiwa na nani?

Who owns Chelsea?

isambazayo umeme

Kampuni
nchini Kenya inaitwaje?
What is the name of the com-
pany that distributes electricity
in Kenya?

Kwa kawaida Chelsea huvaa jezi ya blu, kaptula blu na soksi
nyeupe. Nembo ya klabu imebadilishwa mara nyingi kulingana
na wakati na kuboresha muonekano wa klabu. Nembo ya sasa
inaonesha picha ya simba akiwa amebeba mkuki. Tangu Julai
2003, Chelsea imekuwa ikimilikiwa na Bilionea wa Kirusi,
Roman Abramovich.
Chelsea usually wear blue jerseys, blue shorts and white socks. Le
club logo has been changed many times over time and improved
the club’s appearance. The current emblem shows a picture of a
lion carrying a spear. Since July 2003, Chelsea has been owned
by Russian billionaire Roman Abramovich.

Kenya Power and Lighting (KPLC) ni kampuni inayohusika na
maambukizi ya umeme na usambazaji wa umeme nchini Kenya.
Kenya Power and Lighting (KPLC) is a company responsible for
electricity transmission and distribution in Kenya.

Tableau 3: Kiswahili examples in which CANINE improved over mBERT in the TYDI QA SELECTP
task. On examining the mBERT’s subword tokenization, we observe that the segmentations do not
align well, putting more pressure on the model to combine them and more opportunities for some
embeddings to be poorly estimated. Top: The model must match a key word in the question milikiwa
(propre) to a morphological variant in the answer iki-milikiwa (to be owned). mBERT’s WordPiece
segmentation produces milik -iwa and iki -mi -iki -wa for these, respectivement. Bottom: The model must
match i-sambaza-yo (distributes) in the question with u-sambaza-ji (distribution). mBERT’s WordPiece
segmentation produces isam -ba -za -yo and usa -mba -zaj -i.

CANINE model in row 5, we can see that adding the
downsampling strategy improves speed by 700%,
and also leads to an additional small bump in qual-
ville. We speculate that this additional quality gain
comes from giving the model a better inductive

bias toward more word-like units within the deep
transformer stack.

Analysis CANINE fares particularly well on mor-
phologically rich languages such as Kiswahili.

81

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tableau 3 shows examples where CANINE out-
performs mBERT on the TYDI QA SELECTP
task. En particulier, we observe examples where
Kiswahili’s rich morphology does not hinder the
matching process for CANINE.

Model

SELECTP MINSPAN

65.7
CANINE-C
17.2
No concatenation
+Final-to-initial resid.
17.3
+Final-to-downsampled resid. 62.0

53.0
35.6
35.9
50.2

4.3 Ablations

In Table 6, we consider minor modifications to the
final CANINE architecture, and evaluate the effect
of each on the downstream quality of the model.29

Attending Directly to h(cid:6)
Instead of attend-
ing to the character-wise sequence hup, we attend
to the downsampled sequence:

down

seq = TRANSFORMER1(Q = hup, KV = h(cid:6)
y+

down)

While this change reduces the overall FLOPS
the model due to the reduced attention
de
it does not have a major effect
computation,
it does
on pre-training throughput. Cependant,
substantially degrade quality.

Number of Hash Buckets We reduce the num-
ber of hash buckets (B) from 16k to 8k, meaning
plus (partial) collisions in embedding lookups.
This significantly hinders the MINSPAN task.

switch

Character Vocab We
from our
hash-based no-vocabulary strategy to using a
normal character vocabulary (which we derive
from the pre-training corpus). We observe that
this underperforms the hashing approach. Nous
speculate that this might be due to skew between
the pre-training corpus and the final downstream
task since not all codepoints can be included in
the vocabulary.

Input Character Dimension We reduced the
embedding size of the initial character encoder
(c'est à dire., the embedding size of hinit and e—not hup
nor yseq) and observe that quality falls off rapidly.

No Initial Transformer We remove the local
transformer from hinit and similarly observed a
marked reduction in quality.

Increased Downsampling While more aggres-
sive downsampling (a factor of 5X or 6X, rather
than 4X) brings substantial speed gains,
le
passage-level quality degrades substantially and
the minimal span predictions suffer even more.

29These ablations were carried out during initial model

development, hence comparisons to a non-final model.

Tableau 4: Ablations for residuals and feature con-
catenation on TYDI QA. Rows are cumulative
(each row contains all changes from the previous).

Model

CoNLL

MasakhaNER

mBERT (ours)
CANINE-C
CANINE-C + n-grams

87.8
74.0 (–13.8)
86.7 (–1.1)

72.4
65.5 (–6.9)
76.8 (+4.3)

Tableau 5: F1 scores on NER tasks.

No Position-Limited MLM When we do not
use the trick of applying the final character trans-
former (yseq) only to the positions that will be
computed by the MLM task, we observe a large
reduction in speed. Since this model is theoreti-
cally equivalent in terms of operations, we show
only the speed for exposition.

We also performed ablations aimed at exploring
the effect of feature concatenation and residuals;
results are in Table 4. Not concatenating the down-
sampled representation with the initial character
representation when computing hup causes the
model to become unstable (row 2); adding a resid-
ual from hup back to hinit does not help (row 3).
Cependant, additionally inserting a residual from
hup back to h(cid:6)
down does stabilize the model (row 4)
though it does not recover the original quality.

4.4 NER Results

Named entity recognition is a task in which mem-
orization is often a very effective strategy. Pour
example, if a model has London in its vocabulary
and sees it with the label LOCATION during train-
ing, then it simply has to retrieve this memorized
association when it sees the token London at test
temps. Donc, evaluating on NER is helpful for
understanding the ways in which different models
emphasize memorization vs. generalization.

As shown in Table 5, CANINE-C performs sig-
likely
nificantly worse than mBERT on NER,
due to mBERT’s memorization-friendly vocabu-
lary. Cependant, quand (tokenization-free) n-gram

82

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Condition
Attend to h(cid:6)
down (instead of hup)
8k codepoint hash buckets (instead of 16k)
Character vocab (no hashing)
Input character dim 384 (instead of 768)
Input character dim 192 (instead of 768)
No initial character transformer
Downsample by a factor of 5 (instead of 4)
Downsample by a factor of 6 (instead of 4)
Don’t limit final character transformer to MLM positions

CANINE-S

6400
6400
6400
6600
6400
6700
7000
9200
5200

6400

Examples TYDI QA
SELECTP
/ sec

TYDI QA
MINSPAN

52.2
50.5 (–1.7)
51.2 (–1.0)
49.3 (–1.2)
47.3 (–3.2)
48.3 (–2.9)
49.2 (–2.0)
47.6 (–3.6)

64.5
64.1 (–0.4)
64.6 (+/–)
62.9 (–1.2)
61.7 (–2.4)
63.2 (–1.4)
62.9 (–1.7)
62.7 (–1.9)

66.0

52.5

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tableau 6: Ablation experiments on the CANINE model with TYDI QA F1 scores. Deltas are shown in
parentheses with regard to the top-most experiment, which serves as the baseline configuration for all
experiments in this table. Each result is averaged over 3 fine-tuning and evaluation replicas.

features are added to CANINE-C, performance re-
bounds, showing that it is possible to cheaply boost
a model’s memorization ability while remaining
fully tokenization-free.

A full language-wise breakdown is provided in
the appendix (Tableau 8). It’s worth noting that part
of the performance difference on MasakhaNER
is due to mBERT producing no usable outputs for
Amharic. The mBERT pre-training data does not
contain Amharic (or any Amharic-script text),
so it has no vocabulary entries to Amharic’s
scénario (meaning that mBERT sees only sequences
de [UNK] on Amharic inputs). Cependant, depuis
CANINE always supports the full Unicode space, it
is able to achieve 50 F1 even though it, aussi, had
never seen Amharic text during pre-training. Nous
take this as validation of CANINE’s vocabulary-free
approche. It may also be evidence that CANINE
exhibits cross-script transfer abilities analogous
to those in mBERT (Pires et al., 2019).

Error Analysis CANINE-C tends not to label rarer
lexical items that mBERT appears to have memo-
rizé. Par exemple, with CANINE-C, JCPenney (un
relatively rare lexical item) is not recognized as
an entity. CANINE-C also tends to separate long
entities; Par exemple, ‘‘State Street Bank and
Trust Company’’ is labeled as two separate spans:
‘‘State Street Bank’’ and ‘‘Trust Company’’; et
the location TAMPA BAY is recognized only
as TAMPA. Cependant, adding n-grams features
appears to mostly resolve this issue.

5 Related Work

5.1 Improvements to Subword Tokenization

Further improvements to standard subword token-
ization like Byte Pair Encoding (BPE) (Sennrich
et coll., 2016), WordPiece (Wu et al., 2016), et
SentencePiece (Kudo and Richardson, 2018)
have been proposed. Subword regularization (Kudo,
2018) and BPE-dropout (Provilkov et al., 2020)
recognize that deterministic segmentation during
training limits the ability to leverage morphology
and word composition; instead, they sample at
random one of the multiple tokenizations of the
training input, made possible by the inherent ambi-
guity of subword vocabularies. Wang et al. (2021)
recently expanded on this paradigm to enforce
consistency of predictions over different segmen-
tations. Unigram LM (Kudo, 2018), which builds
its vocabulary top–down, was shown to align
with morphology better than BPE on pre-trained
encoders (Bostrom and Durrett, 2020).

Others have built hybrid models that use mul-
tiple granularities, combining characters with to-
kens (Luong and Manning, 2016) or different
subword vocabularies (Zhang and Li, 2021).

5.2 Character-Level Models

Following the larger NLP trend, character-level n-gram
models (Huang et al., 2013; Wieting et al., 2016;
Bojanowski et al., 2017) have mostly been replaced
by neural networks. While generally lagging be-
hind their word-level counterparts, character-level

83

features are important for morphologically rich
languages, particularly in low-resource settings
(Garrette and Baldridge, 2013).

For Language Modeling Character language
models (CLMs) have used vanilla RNN ar-
chitectures to produce distributions over seq-
uences of characters in a purely tokenization-free
manière (Sutskever et al., 2011; Graves, 2013;
Hwang and Sung, 2017; Radford et al., 2017).
Hierarchical RNNs modeled the assumption that
language operates on increasing layers of ab-
straction: Chung et al. (2017) jointly trained a
sub-module to segment the character-level input
into larger spans at each layer of a stacked LSTM.
Due to the consistent lag in performance behind
their word-level counterparts, attention shifted
from pure CLMs towards merely character-aware
models, still reliant on traditional tokenization.
Some hybrid models processed the input at
character
level, but predicted words from a
closed vocabulary (Kim et al., 2016; Gerz et al.,
2018). Others reintroduced explicit tokenization
on the input side, and either generated bursts of
character sequences that formed an open vocabulary
(Kawakami et al., 2017) or used a character-only
generator as a fallback when the main closed-
vocabulary word generator produced a rare or un-
known token (Matthews et al., 2019; Mielke and
Eisner, 2019). Especially after the popularization
of the inherently ambiguous subword vocabularies
like BPE, several studies moved beyond a single
input segmentation and marginalized over all pos-
sible segmentations (van Merri¨enboer et al., 2017;
Buckman and Neubig, 2018; Grave et al., 2019).
Coming full circle, Kawakami et al. (2019) dans-
duced a lexicon without any explicit supervision,
reverting back to pure CLMs. In a revitalized
effort to bring them on par with coarser granu-
larities, researchers leveraged external resources
such as grounding in vision (Kawakami et al.,
2019) or multi-task learning together with super-
vised morphology tasks (Blevins and Zettlemoyer,
2019).

After the transformer (Vaswani et al., 2017) concernant-
placed RNNs as the dominant architecture in NLP,
character-level models followed. Al-Rfou et al.
(2019) showed that byte-level vanilla Transform-
ers significantly underperform their word-level
homologues. A similar finding was reported by
Radford et al. (2019). Although the gap has been

reduced (Choe et al., 2019), subword transformers
remain the status quo for pure language modeling.

For Specific Tasks
In parallel with LM efforts,
the neural machine translation (NMT) community
sought to solve its open-vocabulary problem via
character-level modeling. Luong and Manning
(2016) proposed a hybrid model that operated mainly
at the word level, but consulted a character-level
LSTM for unknown words; this was a practical
compromise, as their character-only model took
3 months to train. Lee et al. (2017) enabled pure
character NMT by shortening the input length via
convolutional, pooling, and highway layers. Non-
tably, their many-to-English model outperformed
its subword counterpart and most bilingual base-
lines, with a 35% increase in training time (on a
single GPU) compared to a baseline BPE-to-char
model. CANINE has a similar motivation, but op-
erates in the context of pre-trained transformers;
training is 7x faster compared to a char-to-char
baseline (on TPU v3), and has a 28% increase in
training time over mBERT (Tableau 2).

Character information has been leveraged for
many other end tasks as well, y compris: text clas-
sification (Zhang et al., 2015; Zhang and LeCun,
2017), part-of-speech tagging and NER (Gillick
et coll., 2016; Akbik et al., 2018; Pinter et al., 2019),
named entity detection (Yu et al., 2018), depen-
dency parsing (Vania et al., 2018), and machine
reading comprehension (Hewlett et al., 2018).
Character information proved particularly useful
for low-resource languages (Xie et al., 2018), phe-
nomena such as code-switching and transliteration
(Ball and Garrette, 2018), and rich morphology
(Vania and Lopez, 2017), previously receiving
special modeling including adaptor grammars
(Botha and Blunsom, 2013).

For Transfer Learning Token-based models
have also been augmented with character-level in-
formation in the context of transfer learning, où
encoders trained with unsupervised objectives
are repurposed to solve downstream tasks. Pinter
et autres. (2017) addressed the out-of-vocabulary
problem of static pre-trained word embeddings
by training a model to map the surface of a word
to its pre-trained representation, and used it on
unknown words. ELMo (Peters et al., 2018), a bi-
directional LSTM model, applied character con-
volutions to its whitespace-separated input tokens.
CharacterBERT (Boukkouri et al., 2020) ported

84

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

this technique to BERT, augmenting its existing
WordPiece-tokenized input. Consistent with pre-
vious observations that feeding characters into a
transformer stack comes with a huge computational
cost while not improving over tokenization-based
approaches (Al-Rfou et al., 2019), a BERT model
fine-tuned for semantic parsing achieved gains
only when characters complemented subwords
(van Noord et al., 2020).

5.3 Multilingual Models

Multilingual NLP has been dominated by deep
pre-trained multilingual models whose subword
vocabularies are shared across languages. Tel
models borrow their architectures from monolin-
gual predecessors and apply joint training in 100+
languages, either with unsupervised LM losses:
mBERT, mT5 (Xue et al., 2021), or with additional
translation losses: XLM (Lample and Conneau,
2019), XLM-R (Conneau et al., 2020). Chung et al.
(2020) extended this by forming language clusters
with per-cluster vocabularies. To accommodate
languages unseen during pre-training, Wang et al.
(2020) extended the vocabulary and continued
pre-training.

6 Conclusion

In this article, we described CANINE, which is,
to our knowledge, the first pre-trained deep en-
coder for language understanding that uses a
tokenization-free, vocabulary-free model, alors que
surpassing the quality of models built on top of
heuristic tokenizers. CANINE eliminates many en-
gineering pitfalls for practitioners and opens up
new research directions for the community.

Remerciements

The authors wish to thank Noah Constant, Rami
Al-Rfou, Kristina Toutanova, Kenton Lee, Ming-
Wei Chang, and Tim Dozat for their feedback on
this work. We would also like to thank Martin
Njoroge and Nanjala Misiko for their consulta-
tions on the Kiswahili examples, Diana Akrong
for consulting on Twi orthography, and Waleed
Ammar for consulting on Arabic morphology.

Les références

David Ifeoluwa Adelani, Jade Abbott, Graham
Neubig, Daniel D’souza,
Julia Kreutzer,
Constantine Lignos, Chester Palen-Michel,

Happy Buzaaba, Shruti Rijhwani, Sebastian
Ruder, Stephen Mayhew, Israel Abebe Azime,
Shamsuddeen H. Muhammad, Chris Chinenye
Emezue, Joyce Nakatumba-Nabende, Perez
Ogayo, Aremu Anuoluwapo, Catherine Gitau,
Derguene Mbaye, Jesujoba Alabi, Seid Muhie
Yimam, Tajuddeen Rabiu Gwadabe, Ignatius
Ezeani, Rubungo Andre Niyongabo, Jonathan
Mukiibi, Verrah Otiende, Iroro Orife, Davis
David, Samba Ngom, Tosin Adewumi, Paul
Rayson, Mofetoluwa Adeyemi, Gerald Muriuki,
Emmanuel Anebi, Chiamaka Chukwuneke,
Nkiruka Odu, Eric Peter Wairagala, Samuel
Oyerinde, Clemencia Siro, Tobius Saul Bateesa,
Temilola Oloyede, Yvonne Wambui, Victor
Akinode, Deborah Nabagereka, Maurice
Katusiime, Ayodele Awokoya, Mouhamadane
MBOUP, Dibora Gebreyohannes, Henok Tilaye,
Kelechi Nwaike, Degaga Wolde, Abdoulaye
Faye, Blessing Sibanda, Orevaoghene Ahia,
Bonaventure F. P.. Dossou, Kelechi Ogueji,
Thierno Ibrahima DIOP, Abdoulaye Diallo,
Adewale Akinfaderin, Tendai Marengereke,
and Salomey Osei. 2021. MasakhaNER: Named
entity recognition for african languages. TACL.
https://doi.org/10.1162/tacl a 00416

Joshua Ainslie, Santiago Ontanon, Chris Alberti,
Vaclav Cvicek, Zachary Fisher, Philip Pham,
Anirudh Ravula, Sumit Sanghai, Qifan Wang,
and Li Yang. 2020. ETC: Encoding long and
structured inputs in transformers. In Proceed-
ings of EMNLP. https://est ce que je.org/10
.18653/v1/2020.emnlp-main.19

Alan Akbik, Duncan Blythe, and Roland Vollgraf.
2018. Contextual string embeddings for se-
quence labeling. In Proceedings of COLING.

Rami Al-Rfou, Dokook Choe, Noah Constant,
Mandy Guo, and Llion Jones. 2019. Character-
level
language modeling with deeper self-
attention. In Proceedings of AAAI. https://
doi.org/10.1609/aaai.v33i01.33013159

Kelsey Ball and Dan Garrette. 2018. Part-of-
speech tagging for code-switched, transliterated
texts without explicit language identification. Dans
Proceedings of EMNLP. https://doi.org
/10.18653/v1/D18-1347

Terra Blevins and Luke Zettlemoyer. 2019. Better
character language modeling through morphol-
ogy. In Proceedings of ACL. https://est ce que je
.org/10.18653/v1/P19-1156

85

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Piotr Bojanowski, Edouard Grave, Armand Joulin,
and Tomas Mikolov. 2017. Enriching word
vectors with subword information. TACL.
https://doi.org/10.1162/tacl a 00051

multilingual models with language-clustered
In Proceedings of EMNLP.
vocabularies.
https://doi.org/10.18653/v1/2020.emnlp
-main.367

Kaj Bostrom and Greg Durrett. 2020. Byte
langue
pair encoding is suboptimal
le
model
pretraining.
Association for Computational Linguistics:
EMNLP. https://doi.org/10.18653
/v1/2020.findings-emnlp.414

pour
In Findings

de

Jan A. Botha and Phil Blunsom. 2013. Adaptor
Grammars for learning non-concatenative mor-
phology. In Proceedings of EMNLP.

Hicham El Boukkouri, Olivier Ferret, Thomas
Lavergne, Hiroshi Noji, Pierre Zweigenbaum,
and Junichi Tsujii. 2020. CharacterBERT:
Reconciling ELMo and BERT for word-level
open-vocabulary representations from charac-
ters. In Proceedings of COLING. https://
doi.org/10.18653/v1/2020.coling-main.609

Tom Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah, Jared D. Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh,
Daniel Ziegler, Jeffrey Wu, Clemens Winter,
Chris Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, Jack
Clark, Christopher Berner, Sam McCandlish,
Alec Radford,
Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-shot
learners. In Proceedings of NeurIPS.

Jacob Buckman and Graham Neubig. 2018. Neural
lattice language models. TACL. https://est ce que je
.org/10.1162/tacl a 00036

Dokook Choe, Rami Al-Rfou, Mandy Guo,
Heeyoung Lee, and Noah Constant. 2019.
Bridging the gap for tokenizer-free language
models. arXiv preprint arXiv:1908.10322.

Krzysztof Choromanski, Valerii Likhosherstov,
David Dohan, Xingyou Song, Andreea Gane,
Tamas Sarlos, Peter Hawkins, Jared Davis,
Afroz Mohiuddin, Lukasz Kaiser, David
Belanger, Lucy Colwell, and Adrian Weller.
2021. Rethinking attention with performers. Dans
Proceedings of ICLR.

Hyung Won Chung, Dan Garrette, Kiat Chuan
Improving

Tan, and Jason Riesa. 2020.

86

Junyoung Chung, Sungjin Ahn, and Yoshua
Bengio. 2017. Hierarchical multiscale recurrent
neural networks. In Proceedings of ICLR.

Jonathan H. Clark, Eunsol Choi, Michael Collins,
Dan Garrette, Tom Kwiatkowski, Vitaly
Nikolaev, and Jennimaria Palomaki. 2020. TyDi
QA: A benchmark for
information-seeking
question answering in typologically diverse
languages. TACL.

Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzm´an, Edouard Grave, Myle Ott,
Luke Zettlemoyer, and Veselin Stoyanov.
2020. Unsupervised cross-lingual representa-
In Proceedings of
tion learning at scale.
ACL. https://doi.org/10.18653/v1
/2020.acl-main.747

Zihang Dai, Guokun Lai, Yiming Yang, and Quoc
V. Le. 2020. Funnel-Transformer: Filtering out
sequential redundancy for efficient language
traitement. In Proceedings of NeurIPS.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, et
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of NAACL.

Dan Garrette and Jason Baldridge. 2013. Learn-
ing a part-of-speech tagger from two hours of
annotation. In Proceedings of NAACL.

Daniela Gerz, Ivan Vuli´c, Edoardo Ponti, Jason
Naradowsky, Roi Reichart, and Anna Korhonen.
2018. Language modeling for morphologically
rich languages: Character-aware modeling for
word-level prediction. TACL. https://est ce que je
.org/10.1162/tacl_a_00032

Dan Gillick, Cliff Brunk, Oriol Vinyals, et
Amarnag Subramanya. 2016. Multilingual lan-
guage processing from bytes. In Proceedings
of NAACL. https://doi.org/10.18653
/v1/N16-1155

Edouard Grave, Sainbayar Sukhbaatar, Piotr
Bojanowski, and Armand Joulin. 2019. Train-
ing hybrid language models by marginaliz-
ing over segmentations. In Proceedings of
ACL. https://doi.org/10.18653/v1
/P19-1143

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Alex Graves. 2013. Generating sequences with
recurrent neural networks. arXiv preprint
arXiv:1308.0850.

Daniel Hewlett, Alexandre Lacoste, Llion Jones,
Illia Polosukhin, Andrew Fandrianto, Jay Han,
Matthew Kelcey, and David Berthelot. 2018.
Byte-level machine reading across morpho-
logically varied languages. In Proceedings of
AAAI.

Po-Sen Huang, Xiaodong He, Jianfeng Gao,
Li Deng, Alex Acero, and Larry Heck.
2013. Learning deep structured semantic mod-
els for web search using clickthrough data.
In Proceedings of
the ACM International
Conference on Information and Knowledge
Management (CIKM). https://doi.org
/10.1145/2505515.2505665

Kyuyeon Hwang and Wonyong Sung. 2017.
Character-level language modeling with hier-
archical recurrent neural networks. In Proceed-
ICASSP. https://est ce que je.org/10
ings of
.1109/ICASSP.2017.7953252

Prabhu Kaliamoorthi, Sujith Ravi, and Zornitsa
Kozareva. 2019. PRADO: Projection attention
networks for document classification on-device.
In Proceedings of EMNLP. https://est ce que je
.org/10.18653/v1/D19-1506

Kazuya Kawakami, Chris Dyer, and Phil Blunsom.
2017. Learning to create and reuse words
in open-vocabulary neural language modeling.
In Proceedings of ACL. https://doi.org
/10.18653/v1/P17-1137

Kazuya Kawakami, Chris Dyer, and Phil Blunsom.
2019. Learning to discover, ground and use
words with segmental neural language models.
In Proceedings of ACL.

System Demonstrations. https://doi.org
/10.18653/v1/D18-2012

Guillaume Lample and Alexis Conneau. 2019.
Cross-lingual language model pretraining. Dans
Proceedings of NeurIPS.

Jason Lee, Eth Z¨urich, Kyunghyun Cho, et
Thomas Hofmann. 2017. Fully character-level
neural machine translation without explicit seg-
mentation. TACL. https://est ce que je.org/10
.1162/tacl_a_00067

Minh-Thang Luong and Christopher D. Manning.
2016. Achieving open vocabulary neural ma-
chine translation with hybrid word-character
models. In Proceedings of ACL. https://
doi.org/10.18653/v1/P16-1100

Austin Matthews, Graham Neubig, and Chris
Dyer. 2019. Using morphological knowledge
in open-vocabulary neural language models. Dans
Proceedings of NAACL. https://doi.org
/10.18653/v1/N18-1130

Sebastian J. Mielke and Jason Eisner. 2019.
Spell once, summon anywhere: A two-level
En Pro-
open-vocabulary language model.
ceedings of AAAI. https://est ce que je.org/10
.1609/aaai.v33i01.33016843

Rik van Noord, Antonio Toral, and Johan
Bos. 2020. Character-level
representations
improve drs-based semantic parsing even
In Proceedings of
in the age of BERT.
EMNLP. https://doi.org/10.18653
/v1/2020.emnlp-main.371

Brendan O’Connor, Michel Krieger, and David
Ahn. 2010. TweetMotif: Exploratory search and
topic summarization for twitter introduction and
description. In Proceedings of the International
AAAI Conference on Web and Social Media.

Yoon Kim, Yacine Jernite, David Sontag, et
Alexander M. Rush. 2016. Character-aware
language models. In Proceedings of
neural
AAAI.

Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy
Schwartz, Noah Smith, and Lingpeng Kong.
2021. Random feature attention. In Proceedings
of ICLR.

Taku Kudo. 2018. Subword regularization: Im-
proving neural network translation models with
multiple subword candidates. In Proceedings of
ACL.

Taku Kudo and John Richardson. 2018. Sentence-
Piece: A simple and language independent sub-
word tokenizer and detokenizer for Neural
Text Processing. In Proceedings of EMNLP:

Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextual-
ized word representations. In Proceedings of
NAACL. https://doi.org/10.18653
/v1/N18-1202

Jason Phang,

Iacer Calixto, Phu Mon Htut,
Yada Pruksachatkun, Haokun Liu, Clara Vania,

87

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Katharina Kann, and Samuel R Bowman.
2020. English intermediate-task training im-
proves zero-shot cross-lingual transfer too. Dans
Proceedings of AACL.

Yuval Pinter, Robert Guthrie, and Jacob Eisenstein.
2017. Mimicking word embeddings using
subword RNNs. In Proceedings of EMNLP.

Yuval Pinter, Marc Marone, and Jacob Eisenstein.
langue
2019. Character
through character-level taggers. In Proceed-
ings of BlackboxNLP. https://doi.org
/10.18653/v1/W19-4811

eyes: Seeing

Telmo Pires, Eva Schlinger, and Dan Garrette.
2019. How multilingual is Multilingual BERT?
In Proceedings of ACL. https://doi.org
/10.18653/v1/P19-1493

Ivan Provilkov, Dmitrii Emelianenko, and Elena
Voita. 2020. BPE-Dropout: Simple and effec-
tive subword regularization. In Proceedings of
ACL. https://doi.org/10.18653/v1
/2020.acl-main.170

Danish Pruthi, Bhuwan Dhingra, and Zachary
C. Lipton. 2019. Combating adversarial mis-
spellings with robust word recognition. Dans
Proceedings of ACL. https://doi.org
/10.18653/v1/P19-1561

Alec Radford, Rafal

Jozefowicz,

and Ilya
Sutskever. 2017. Learning to generate re-
views and discovering sentiment. arXiv preprint
arXiv:1704.01444.

Alec Radford, Jeff Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI Technical Report. https://
www.semanticscholar.org/paper/Language
-Models-are-Unsupervised-Multitask
-Learners-Radford-Wu/9405cc0d61699
88371b2755e573cc28650d14dfe

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. JMLR.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of rare
words with subword units. In Proceedings of
ACL. https://doi.org/10.18653/v1
/P16-1162

88

Lichao Sun, Kazuma Hashimoto, Wenpeng Yin,
Akari Asai, Jia Li, Philip Yu, and Caiming
Xiong. 2020. Adv-BERT: BERT is not robust
on misspellings! Generating nature adver-
samples on BERT. arXiv preprint
sarial
arXiv:2003.04985.

Ilya Sutskever, James Martens, and Geoffrey E.
Hinton. 2011. Generating text with recurrent
neural networks. In Proceedings of ICML.

Dan Svenstrup, Jonas Meinertz Hansen, and Ole
Winther. 2017. Hash embeddings for effi-
cient word representations. In Proceedings of
NeurIPS.

In Proceedings of

David Talbot and John Talbot. 2008. Bloom
the Workshop
maps.
on Analytic Algorithmics and Combinatorics
(ANALCO). https://doi.org/10.1137
/1.9781611972986.4

Ian Tenney, Dipanjan Das, and Ellie Pavlick.
2019. BERT rediscovers the classical NLP pipe-
line. In Proceedings of ACL. https://est ce que je
.org/10.18653/v1/P19-1452

à

the CoNLL-2002

Erik F. Tjong Kim Sang. 2002.

Introduc-
task:
tion
entity
Language-independent
of CoNLL.
In Proceedings
reconnaissance.
https://doi.org/10.3115/1118853
.1118877

named

shared

Erik F. Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 shared
task: Language-independent named entity
of NAACL.
In Proceedings
reconnaissance.
https://doi.org/10.3115/1119176
.1119195

Bart van Merri¨enboer, Amartya Sanyal, H.
Larochelle, and Yoshua Bengio. 2017. Mul-
tiscale sequence modeling with a learned
dictionary. arXiv preprint arXiv:1707.00762.

Clara Vania, Andreas Grivas, and Adam Lopez.
2018. What do character-level models learn
about morphology? The case of dependency
parsing. In Proceedings of EMNLP.

Clara Vania and Adam Lopez. 2017. Depuis
characters to words to in between: Do we
capture morphology? In Proceedings of ACL.
https://doi.org/10.18653/v1/P17-1184

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Lukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Proceedings of
NeurIPS.

Xinyi Wang, Sebastian Ruder, and Graham
Neubig. 2021. Multi-view subword regular-
ization. In Proceedings of NAACL. https://
doi.org/10.18653/v1/2021.naacl-main.40

Zihan Wang, Karthikeyan K. Stephen Mayhew,
and Dan Roth. 2020. Extending multilingual
BERT to low-resource languages. In Findings
of EMNLP. https://doi.org/10.18653
/v1/2020.findings-emnlp.240

John Wieting, Mohit Bansal, Kevin Gimpel, et
Karen Livescu. 2016. Charagram: Embedding
words and sentences via character n-grams. Dans
Proceedings of EMNLP. https://doi.org
/10.18653/v1/D16-1157

Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin
Gao, Klaus Macherey, Jeff Klingner, Apurva
Shah, Melvin Johnson, Xiaobing Liu, Łukasz
kaiser, Stephan Gouws, Yoshikiyo Kato, Taku
Kudo, Hideto Kazawa, Keith Stevens, George
Kurian, Nishant Patil, Wei Wang, Cliff Young,
Jason Smith, Jason Riesa, Alex Rudnick, Oriol
Vinyals, Greg Corrado, Macduff Hughes, et
Jeffrey Dean. 2016. Google’s neural machine
translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144.

Jiateng Xie, Zhilin Yang, Graham Neubig, Noah
UN. Forgeron, and Jaime Carbonell. 2018. Neural
cross-lingual named entity recognition with
minimal resources. In Proceedings of EMNLP.

Linting Xue, Noah Constant, Adam Roberts,
Mihir Kale, Rami Al-Rfou, Aditya Siddhant,
Aditya Barua, and Colin Raffel. 2021. mT5: UN
massively multilingual pre-trained text-to-text
transformer. In Proceedings of NAACL.

Yang You, Jing Li, Sashank Reddi, Jonathan
Hseu, Sanjiv Kumar, Srinadh Bhojanapalli,
Xiaodan Song, James Demmel, Kurt Keutzer,
and Cho-Jui Hsieh. 2020. Large batch opti-
mization for deep learning: Training BERT in
76 minutes. In Proceedings of ICLR.

Xiaodong Yu, Stephen Mayhew, Mark Sammons,
and Dan Roth. 2018. On the strength of char-
acter language models for multilingual named
entity recognition. In Proceedings of EMNLP.

Manzil Zaheer, Guru Guruganesh, Avinava
Dubey, Joshua Ainslie, Chris Alberti, Santiago
Ontanon, Philip Pham, Anirudh Ravula, Qifan
Wang, Li Yang, and Amr Ahmed. 2020. Big
Oiseau: Transformers for longer sequences. Dans
Proceedings of NeurIPS.

Xiang Zhang and Yann LeCun. 2017. Which
encoding is the best for text classification in
Chinese, English, Japanese and Korean? arXiv
preprint arXiv:1708.02657v2.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text
classification. In Proceedings of NeurIPS.

Xinsong Zhang and Hang Li. 2021. AMBERT:
A pre-trained language model with multi-
In Findings of ACL.
grained tokenization.
https://doi.org/10.18653/v1/2021
.findings-acl.37

89

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

A Appendix

Language

mBERT CANINE-S CANINE-C CANINE-C
+ n-grams

(English)
Arabic
Bengali
Finnish
Indonesian
Japonais
Korean
Russian
Swahili
Telugu
Thai
Macro Avg

(English)
Arabic
Bengali
Finnish
Indonesian
Japonais
Korean
Russian
Swahili
Telugu
Thai
Macro Avg

62.2
82.3
58.5
60.4
61.3
46.2
60.2
62.2
58.8
81.0
61.1
63.2

46.0
70.7
47.3
51.1
52.2
36.1
36.8
45.6
49.4
75.6
48.4
51.3

SELECTP

58.6 (–3.6)
82.8 (+0.5)
61.8 (+3.3)
62.2 (+1.8)
63.5 (+2.2)
51.7 (+5.5)
60.3 (+0.1)
64.6 (+2.4)
67.8 (+9.0)
82.5 (+1.5)
62.8 (+1.7)
66.0 (+2.8)

MINSPAN

46.3 (+0.3)
66.9 (–3.8)
46.7 (–0.6)
53.0 (+1.9)
53.6 (+1.4)
40.3 (+4.2)
35.7 (–1.1)
46.7 (+1.1)
59.0 (+9.6)
75.2 (–0.4)
47.9 (–0.5)
52.5 (+1.2)

61.6 (–0.6)
82.5 (+0.2)
62.5 (+4.0)
63.6 (+3.2)
64.2 (+2.9)
49.7 (+3.5)
59.7 (–0.5)
65.6 (+3.4)
67.0 (+8.2)
81.1 (+0.1)
61.2 (+0.1)
65.7 (+2.5)

49.0 (+3.0)
65.6 (–5.1)
52.5 (+5.2)
53.8 (+2.7)
54.4 (+2.2)
40.7 (+4.6)
36.5 (–0.3)
47.2 (+1.6)
57.6 (+8.2)
74.2 (–1.4)
47.1 (–1.3)
53.0 (+1.7)

64.6 (+2.4)
84.3 (+2.0)
66.0 (+7.5)
66.7 (+6.3)
65.9 (+4.6)
51.2 (+5.0)
60.6 (+0.4)
68.5 (+6.3)
67.2 (+8.4)
84.6 (+3.6)
65.8 (+4.7)
68.1 (+4.9)

51.8 (+5.8)
73.0 (+2.3)
57.1 (+9.8)
57.1 (+6.0)
56.8 (+4.6)
42.0 (+5.9)
39.9 (+3.1)
51.5 (+5.9)
59.2 (+9.8)
79.7 (+4.1)
54.2 (+5.8)
57.0 (+5.7)

Tableau 7: Language-wise breakdown for TYDI QA primary tasks.
English is parenthesized because it is not included in the overall
score calculation for TYDI QA.

90

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Language

mBERT

CANINE-C

Dutch
English
German
Spanish
Macro Avg

Amharic
Hausa
Igbo
Kinyarwanda
Luganda
Luo
Nigerian Pidgin
Swahili
Wolof
Yor`ub´a
Macro Avg

CONLL

90.2
91.1
82.5
87.6
87.8

74.7 (–15.5)
79.8 (–11.3)
64.1 (–18.4)
77.4 (–10.2)
74.0 (–13.8)

MASAKHANER

0.0
89.3
84.6
73.9
80.2
75.8
89.8
87.1
64.9
78.7
72.4

44.6 (+44.6)
76.1 (–13.2)
75.6 (–9.0)
58.3 (–15.6)
69.4 (–10.8)
63.4 (–12.4)
66.6 (–23.2)
72.7 (–14.4)
60.7 (–4.2)
67.9 (–10.8)
65.5 (–6.9)

CANINE-C
+ n-grams

88.5 (–1.7)
89.8 (–1.3)
82.1 (–0.4)
86.5 (–1.1)
86.7 (–1.1)

50.0 (+50.0)
88.0 (–1.3)
85.0 (+0.4)
72.8 (–1.1)
79.6 (–0.6)
74.2 (–1.6)
88.7 (–1.1)
83.7 (–3.4)
66.5 (+1.6)
79.1 (+0.4)
76.8 (+4.3)

Tableau 8: Language-wise breakdown for Named Entity Recognition
for the CoNLL and MasakhaNER datasets (labeled F1). mBERT
obtains a score of zero on Amharic due to having no vocabulary
entries in the Amharic script.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
4
8
1
9
8
5
9
3
3

/

/
t

je

un
c
_
un
_
0
0
4
4
8
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

91
Télécharger le PDF