Tabula Nearly Rasa: Probing the Linguistic Knowledge of Character-level - IA de Investigación especializada en el MIT

Tabula Nearly Rasa: Probing the Linguistic Knowledge of Character-level
Neural Language Models Trained on Unsegmented Text

Michael Hahn∗
Universidad Stanford
mhahn2@stanford.edu

Marco Baroni
Facebook AI Research
UPF Linguistics Department
Catalan Institution for Research
and Advanced Studies
mbaroni@gmail.com

Abstracto

(RNNs) tener
Recurrent neural networks
reached striking performance in many natural
language processing tasks. This has renewed
interés
in whether these generic sequence
processing devices are inducing genuine lin-
guistic knowledge. Nearly all current ana-
lytical studies, sin embargo, initialize the RNNs
with a vocabulary of known words, y
feed them tokenized input during training.
We present a multi-lingual study of
el
linguistic knowledge encoded in RNNs trained
as character-level language models, on input
data with word boundaries removed. Estos
networks face a tougher and more cognitively
realistic task, having to discover any useful
linguistic unit from scratch based on input
Estadísticas. The results show that our ‘‘near
tabula rasa’’ RNNs are mostly able to
solve morphological, syntactic and semantic
tasks that intuitively presuppose word-level
conocimiento, and indeed they learned, to some
extent, to track word boundaries. Our study
el
opens the door to speculations about
necessity of an explicit, rigid word lexicon
in language learning and usage.

Introducción

Recurrent neural networks (RNNs; elman, 1990),
in particular in their long short term memory
variante (LSTMs; Hochreiter and Schmidhuber,
1997), are widely used in natural language pro-
cesando. RNNs, often pre-trained on the simple

∗Work partially done while interning at Facebook AI

Investigación.

467

language modeling objective of predicting the
next symbol in natural text, are a crucial com-
para
ponent
machine translation, natural language inference,
and text categorization (Goldberg, 2017).

state-of-the-art

architectures

RNNs are very general devices for sequence
Procesando, hardly assuming any prior linguistic
conocimiento. Además, the simple prediction task
they are trained on in language modeling is well-
attuned to the core role that prediction plays in
cognition (p.ej., Bar, 2007; clark, 2016). RNNs
have thus long attracted researchers interested in
language acquisition and processing. Their recent
success in large-scale tasks has rekindled this
interés (p.ej., Frank et al., 2013; Lau et al., 2017;
Kirov and Cotterell, 2018; Linzen et al., 2018;
McCoy et al., 2018; Pater, 2018).

The standard pre-processing pipeline of modern
RNNs assumes that the input has been tokenized
into word units that are pre-stored in the RNN
vocabulary (Goldberg, 2017). This is a reasonable
practical approach, but it makes simulations less
interesting from a linguistic point of view. Primero,
discovering words (or other primitive constituents
of linguistic structure) is one of the major chal-
lenges a learner faces, and by pre-encoding them in
the RNN we are facilitating its task in an unnatural
way (not even the staunchest nativists would take
specific word dictionaries to be part of our genetic
código). Segundo, assuming a unique tokenization
into a finite number of discrete word units is
in any case problematic. The very notion of what
counts as a word in languages with a rich morphol-
ogy is far from clear (p.ej., Dixon and Aikhenvald,
2002; Bickel and Z´u˜niga, 2017), y, universally,
lexical knowledge is probably organized into a
not-necessarily-consistent hierarchy of units at
different levels: morphemes, palabras, compounds,

Transacciones de la Asociación de Lingüística Computacional, volumen. 7, páginas. 467–484, 2019. https://doi.org/10.1162/tacl a 00283
Editor de acciones: Hinrich Schütze. Lote de envío: 2/2019; Lote de revisión: 4/2019; Publicado 9/2019.
C(cid:3) 2019 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

constructions, Etcétera (p.ej., Goldberg, 2005).
En efecto, it has been suggested that the notion
of word cannot even be meaningfully defined
cross-linguistically (Haspelmath, 2011).

Motivated by these considerations, we study
here RNNs that are trained without any notion
of word in their input or in their architecture.
We train our RNNs as character-level neural
language models (CNLMs, Mikolov et al., 2011;
Sutskever et al., 2011; Tumbas, 2014) by removing
whitespace from their input, de modo que, like children
learning a language, they don’t have access to
explicit cues to wordhood.1 This set-up is almost
as tabula rasa as it gets. By using unsegmented
orthographic input (and assuming that,
en el
alphabetic writing systems we work with, hay
a reasonable correspondence between letters and
phonetic segments), we are only postulating that
the learner figured out how to map the continuous
speech stream to a sequence of phonological units,
an ability children already possess a few months
after birth (p.ej., Maye et al., 2002; kühl, 2004).
We believe that focusing on language modeling of
an unsegmented phoneme sequence, abstracting
away from other complexities of a fully realistic
child language acquisition set-up, is particularly
instructive in order to study which linguistic struc-
tures naturally emerge.

We evaluate our character-level networks on
a bank of linguistic tests in German, italiano, y
Inglés. We focus on these languages because of
resource availability and ease of benchmark con-
estructura. También, well-studied synthetic languages
with a clear, orthographically driven notion of
word might be a better starting point to test non-
word-centric models, compared with agglutina-
tive or polysynthetic languages, where the very
notion of what counts as a word is problematic.

syntactic,

Our tasks require models to develop the latent
ability to parse characters into word-like items
associated to morphological,
y
broadly semantic features. The RNNs pass most
of the tests, suggesting that they are in some
way able to construct and manipulate the right
lexical objects. In a final experiment, we look
more directly into how the models are handling
word-like units. We find, confirming an earlier
observation by Kementchedjhieva and Lopez
(2018), that the RNNs specialized some cells to

1We do not erase punctuation marks, reasoning that they
have a similar function to prosodic cues in spoken language.

the task of detecting word boundaries (o, más
generally, salient linguistic boundaries, in a sense
to be further discussed below). Tomados juntos,
our results suggest
that character-level RNNs
capture forms of linguistic knowledge that are
traditionally thought to be word-based, sin
being exposed to an explicit segmentation of their
input and, more importantly, without possessing
an explicit word lexicon. We will discuss the
implications of these findings in the Discussion.2

2 Trabajo relacionado

On the primacy of words Several linguistic
studies suggest that words, at least as delimited
by whitespace in some writing systems, are neither
necessary nor sufficient units of linguistic anal-
ysis. Haspelmath (2011) claims that there is no
cross-linguistically valid definition of the notion
of word (see also Schiering et al., 2010, OMS
address specifically the notion of prosodic word).
Others have stressed the difficulty of characteriz-
ing words in polysynthetic languages (Bickel and
Z´u˜niga, 2017). Children are only rarely exposed
to words in isolation during learning (Tomasello,
2003),3 and it is likely that the units that adult speak-
ers end up storing in their lexicon are of variable
tamaño, both smaller and larger than conventional
palabras (p.ej., Jackendoff, 2002; Goldberg, 2005).
From a more applied perspective, Sch¨utze (2017)
recently defended tokenization-free approaches to
NLP, proposing a general non-symbolic approach
to text representation.

We hope our results will contribute to the
theoretical debate on word primacy, sugerencia,
that word
through computational simulations,
priors are not crucial to language learning and
Procesando.

Character-based neural language models
re-
ceived attention in the last decade because of their
greater generality compared with word-level mod-
los. Early studies (Mikolov et al., 2011; Sutskever
et al., 2011; Tumbas, 2014) established that CNLMs
might not be as good at language modeling as their
word-based counterparts, but lag only slightly

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

2Nuestro

input data,

disponible

son
tabula-rasa-rnns.

test sets, and pre-trained models
at https://github.com/m-hahn/

3Single-word utterances are not uncommon in child-
directed language, but they are still rather the exception than
the rule, and many important words, such as determiners,
never occur in isolation (Christiansen et al., 2005).

468

behind. This is particularly encouraging in light
of the fact that character-level sentence predic-
tion involves a much larger search space than
prediction at the word level, as a character-level
model must make a prediction after each charac-
ter, rather than after each word. Sutskever et al.
(2011) and Graves (2014) ran qualitative anal-
yses showing that CNLMs capture some basic
linguistic properties of their input. Este último, OMS
used LSTM cells, also showed, qualitatively, eso
CNLMs are sensitive to hierarchical structure.
En particular, they balance parentheses correctly
when generating text.

Most recent work in the area has focused on
character-aware architectures combining character-
and word-level information to develop state-of-
the-art language models that are also effective in
morphologically rich languages (p.ej., Bojanowski
et al., 2016; Kim y cols., 2016; Gerz et al., 2018).
Por ejemplo, Kim and colleagues perform predic-
tion at the word level, but use a character-based
convolutional network to generate word repre-
sentaciones. Other work focuses on splitting words
into morphemes, using character-level RNNs and
an explicit segmentation objective (p.ej., Kann
et al., 2016). These latter lines of work are only
distantly related to our interest in probing what
a purely character-level network trained on run-
ning text has implicitly learned about linguistic
estructura. There is also extensive work on seg-
mentation of the linguistic signal that does not rely
on neural methods, and is not directly relevant
aquí, (p.ej., Brent and Cartwright, 1996; Goldwater
et al., 2009; Kamper et al., 2016, and references
therein).

linguistic knowledge

of neural
Probing
language models
is currently a popular research
topic (Le et al., 2016; Linzen et al., 2016; Shi
et al., 2016; Adi et al., 2017; Belinkov et al., 2017;
K`ad`ar et al., 2017; Hupkes et al., 2018; Conneau
et al., 2018; Ettinger et al., 2018; Linzen et al.,
2018). Among studies focusing on character-level
modelos, elman (1990) already reported a proof-
of-concept experiment on implicit learning of word
segmentation. Christiansen et al. (1998) entrenado
a RNN on phoneme-level
language modeling
of transcribed child-directed speech with tokens
marking utterance boundaries, and found that
the input by
the network learned to segment
predicting the utterance boundary symbol also
at word edges. More recently, Sennrich (2017)

explored the grammatical properties of character-
and subword-unit-level models that are used as
components of a machine translation system. Él
concluded that current character-based decoders
generalize better to unseen words, but capture
less grammatical knowledge than subword units.
Still, his character-based systems lagged only
marginally behind the subword architectures on
grammatical tasks such as handling agreement
and negation. Radford et al. (2017) focused on
CNLMs deployed in the domain of sentiment
análisis, where they found the network to special-
ize a unit for sentiment tracking. We will discuss
below how our CNLMs also show single-unit
specialization, but for boundary tracking. Godin
et al. (2018) investigated the rules implicitly used
by supervised character-aware neural morpholog-
ical segmentation methods, finding linguistically
sensible patterns. Alishahi et al. (2017) probed the
linguistic knowledge induced by a neural network
that receives unsegmented acoustic input. Focus-
ing on phonology, they found that the lower layers
of the model process finer-grained information,
layers are sensitive to more
whereas higher
abstract patterns. Kementchedjhieva and Lopez
(2018) recently probed the linguistic knowledge
of an English CNLM trained with whitespace in
the input. Their results are aligned with ours. El
model is sensitive to lexical and morphological
estructura, and it captures morphosyntactic cate-
gories as well as constraints on possible morpheme
tracks
combinations.
word/morpheme boundaries through a single spe-
cialized unit, suggesting that such boundaries are
salient (at least when marked by whitespace, as in
their experiments) and informative enough that it
is worthwhile for the network to devote a special
mechanism to process them. We replicated this
finding for our networks trained on whitespace-
free text, as discussed in Section 4.4, where we
discuss it in the context of our other results.

Intriguingly,

el modelo

3 Experimental Set-up

We extracted plain text from full English, Alemán,
and Italian Wikipedia dumps with WikiExtractor.4
We randomly selected test and validation sections
consisting of 50,000 paragraphs each, and used
the remainder for training. The training sets
contained 16M (Alemán), 9METRO (italiano), and 41M
(Inglés) párrafos, corresponding to 819M,

4https://github.com/attardi/wikiextractor.

469

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

463METRO, and 2,333M words, respectivamente. Párrafo
order was shuffled for training, without attempting
to split by sentences. All characters were
lower-cased. For benchmark construction and
word-based model
training, we tokenized and
tagged the corpora with TreeTagger (Schmid,
1999).5 We used as vocabularies the most frequent
characters from each corpus, setting thresholds
so as to ensure that all characters representing
phonemes were included, resulting in vocabularies
of sizes 60 (Inglés), 73 (Alemán), y 59
(italiano). We further constructed word-level neural
language models (WordNLMs); their vocabulary
included the most frequent 50,000 words per
cuerpo.

We trained RNN and LSTM CNLMs; nosotros
will refer to them simply as RNN and LSTM,
respectivamente. The ‘‘vanilla’’ RNN will serve as
a baseline to ascertain if/when the longer-range
information-tracking abilities afforded to the
LSTM by its gating mechanisms are necessary.
Our WordNLMs are always LSTMs. For each
model/language, we applied random hyperpa-
rameter search. We terminated training after 72
hours.6 None of the models had overfitted, como
measured by performance on the validation set.7
Language modeling performance on the test
partitions is shown in Table 1. Recall that we re-
moved whitespace, which is both easy to
predict, and aids prediction of other characters.
Como consecuencia, the fact that our character-level
models are below the state of the art is expected.8
Por ejemplo, the best model of Merity et al. (2018)
logrado 1.23 English bits per character (BPC) en
a Wikipedia-derived dataset. On EuroParl data,
Cotterell et al. (2018) informe 0.85 para ingles, 0.90
for German, y 0.82 for Italian. Still, our English
BPC is comparable to that reported by Graves

5http://www.cis.uni-muenchen.de/∼schmid/

tools/TreeTagger/.

6This was due to resource availability. The reasonable
language-modeling results in Table 1 suggest that no model
is seriously underfit, but the weaker overall RNN results in
particular should be interpreted in the light of the following
qualification: models are compared given equal amount of
training, but possibly at different convergence stages.

7Hyperparameter details are in Table 8. Chosen
architectures (layers/embedding size/hidden size): LSTM:
En. 3/200/1024, Ge. 2/100/1024, Él. 2/200/1024; RNN:
En. 2/200/2048, Ge. 2/50/2048,
Él. mismo; WordNLM;
En. 2/1024/1024, Ge. 2/200/1024, Él. mismo.

8Training our models with whitespace, without further
hyperparameter tuning, resulted in BPCs of 1.32 (Inglés),
1.28 (Alemán), y 1.24 (italiano).

Inglés
Alemán
italiano

LSTM

1.62
1.51
1.47

RNN

2.08
1.83
1.97

WordNLM

48.99
37.96
42.02

Mesa 1: Performance of language models. Para
CNLMs, we report bits-per-character (BPC). Para
WordNLMs, we report perplexity.

(2014) for his static character-level LSTM trained
on space-delimited Wikipedia data, sugerencia
that we are achieving reasonable performance.
The perplexity of the word-level model might not
be comparable to that of highly optimized state-
of-the-art architectures, but it is at the expected
nivel
for a well-tuned vanilla LSTM language
modelo. Por ejemplo, Gulordava et al. (2018)
informe 51.9 y 44.9 perplexities, respectivamente, en
English and Italian for their best LSTMs trained
on Wikipedia data with same vocabulary size as
nuestro.

4 experimentos

4.1 Discovering morphological categories

Words belong to part-of-speech categories, semejante
as nouns and verbs. Además, they typically carry
inflectional features such as number. We start by
probing whether CNLMs capture such properties.
We use here the popular method of ‘‘diagnostic
classifiers’’ (Hupkes et al., 2018). Eso es, we treat
the hidden activations produced by a CNLM
whose weights were fixed after language model
training as input features for a shallow (logistic)
classifier of the property of interest (p.ej., plural
vs. singular). If the classifier is successful, este
means that the representations provided by the
model are encoding the relevant information. El
classifier is deliberately shallow and trained on a
small set of examples, as we want to test whether
the properties of interest are robustly encoded in
the representations produced by the CNLMs, y
amenable to a simple linear readout (Fusi et al.,
2016). In our case, we want to probe word-level
properties in models trained at the character level.
Para hacer esto, we let the model read each target word
character-by-character, and we treat the state of
its hidden layer after processing the last character
in the word as the model’s implicit representation
of the word, on which we train the diagnostic
classifier. The experiments focus on German
and Italian, as it is harder to design reliable test

470

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

sets for the impoverished English morphological
sistema.

suffixes

Although we

Word classes (nouns vs. verbos) For both
German and Italian, we sampled 500 verbs and
500 nouns from the Wikipedia training sets,
requiring that they are unambiguously tagged in
the corpus by TreeTagger. Verbal and nominal
forms are often cued by suffixes. We removed
this confound by selecting examples with the
same ending across the two categories (-en in
Alemán: Westen ‘west’,9 stehen ‘to stand’; y
-re in Italian: autore ‘author’, dire ‘to say’). Nosotros
seleccionado al azar 20 training examples (10 nouns
y 10 verbos), and tested on the remaining items.
We repeated the experiment 100 times to account
for random train-test split variation.
controlled for

como
described above, it could still be the case that
other substrings reliably cue verbs or nouns. Nosotros
thus considered a baseline trained on word-internal
information only, a saber, a character-level LSTM
autoencoder trained on the Wikipedia datasets to
reconstruct words in isolation.10 The hidden state
of the LSTM autoencoder should capture dis-
criminating orthographic features, pero, por diseño,
will have no access to broader contexts. Nosotros
further considered word embeddings from the
output layer of the WordNLM. Unlike CNLMs,
the WordNLM cannot make educated guesses
about words that are not in its training vocabulary.
These out of vocabulary (OOV) words are by con-
struction less frequent, and thus likely to be in gen-
eral more difficult. To get a sense of both ‘‘best-
case scenario’’ and more realistic WordNLM
actuación, we report its accuracy both exclud-
ing and including OOV items (WordNLMsubs. y
WordNLM in Table 2, respectivamente). In the lat-
ter case, we let the model make a random guess
for OOV items. The percentage of OOV items
over the entire dataset, balanced for nouns and
verbos, era 92.3% for German and 69.4% para
italiano. Note that none of the words were OOV

9German nouns are capitalized; this cue is unavailable to

the CNLM as we lower-case the input.

10The autoencoder is implemented as a standard LSTM
sequence-to-sequence model (Sutskever et al., 2014). Para
each language, autoencoder hyperparameters were chosen
using random search, as for the language models; details
are in supplementary material to be made available upon
publicación. For both German and Italian models, the fol-
lowing parameters were chosen: 2 capas, 100 incrustar
dimensions, 1024 hidden dimensions.

471

Aleatorio
Autoencoder
LSTM
RNN
WordNLM
WordNLMsubs.

Alemán
50.0
65.1 (± 0.22)
89.0 (± 0.14)
82.0 (± 0.64)
53.5 (± 0.18)
97.4 (± 0.05)

italiano
50.0
82.8 (± 0.26)
95.0 (± 0.10)
91.9 (± 0.24)
62.5 (± 0.26)
96.0 (± 0.06)

Mesa 2: Accuracy of diagnostic classifier on
predicting word class, with standard errors across
100 random train-test splits. ‘subs.’ marks in-
vocabulary subset evaluation, not comparable
with the other results.

for the CNLM, as they all were taken from the
Wikipedia training set.

Results are in Table 2. All language models
outperform the autoencoders, showing that they
learned categories based on broader distributional
evidencia, not just typical strings cuing nouns and
verbos. Además, the LSTM CNLM outperforms
the RNN, probably because it can track broader
contextos. No es sorprendente, the word-based model
fares better on in-vocabulary words, but the gap,
especially in Italian, is rather narrow, y ahí
is a strong negative impact of OOV words (como
esperado, given that WordNLM is at random on
a ellos).

Number We turn next to number, a more gran-
ular morphological feature. We study German,
as it possesses a rich system of nominal classes
forming plural through different morphological
procesos. We train a diagnostic number clas-
sifier on a subset of these classes, and test on
los demás, in order to probe the abstract number
generalization capabilities of the tested models.
If a model generalizes correctly, it means that
the CNLM is sensitive to number as an abstract
feature, independently of its surface expression.

We extracted plural nouns from the Wiktionary
and the German UD treebank (McDonald et al.,
2013; Brants et al., 2002). We selected nouns with
plurals in -n, -s, or -e to train the classifier (p.ej.,
Geschichte(norte) ‘story(-es)', Radio(s) ‘radio(s)',
Pferd(mi) ‘horse(s)', respectivamente). We tested on
plurals formed with -r (p.ej., Lieder for singular
Lied ‘song’), or through vowel change (Umlaut,
p.ej., ¨Apfel from singular Apfel ‘apple’). Certain
nouns form plurals through concurrent suffixing
and Umlaut. We grouped these together with
nouns using the same suffix, reserving the Umlaut
group for nouns only undergoing vowel change

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

-r

test classes

train classes
-n/-s/-e
50.0
Aleatorio
61.4 (± 0.9) 50.7 (± 0.8) 51.9 (± 0.4)
Autoencoder
71.5 (± 0.8) 78.8 (± 0.6) 60.8 (± 0.6)
LSTM
65.4 (± 0.9) 59.8 (± 1.0) 56.7 (± 0.7)
RNN
77.3 (± 0.7) 77.1 (± 0.5) 74.2 (± 0.6)
WordNLM
WordNLMsubs. 97.1 (± 0.3) 90.7 (± 0.1) 97.5 (± 0.1)

Umlaut

50.0

Mesa 3: German number classification accuracy,
with standard errors computed from 200 aleatorio
‘subs.’ marks in-vocabulary
divisiones de prueba de tren.
subset evaluation, not comparable to the other
resultados.

(p.ej., Saft/S¨afte ‘juice(s)’ would be an instance
of -e suffixation). The diagnostic classifier was
trained on 15 singulars and plurals randomly se-
lected from each training class. As plural suffixes
make words longer, we sampled singulars and
plurals from a single distribution over lengths,
to ensure that their lengths were approximately
emparejado. Además, because in uncontrolled sam-
ples from our training classes a final -e- vowel
would constitute a strong surface cue to plurality,
we balanced the distribution of this property across
singulars and plurals in the samples. For the test
colocar, we selected all plurals in -r (127) or Umlaut
(38), with their respective singulars. We also used
all remaining plurals ending in -n (1,467), -s (98),
and -e (832) as in-domain test data. To control for
the impact of training sample selection, nosotros reportamos
accuracies averaged over 200 random train-test
splits and standard errors over these splits. Para
WordNLM OOV, Había 45.0% OOVs in the
training classes, 49.1% among the -r forms, y
52.1% for Umlaut.

Results are in Table 3. The classifier based on
word embeddings is the most successful. It out-
performs in most cases the best CNLM even
in the more cogent OOV-inclusive evaluation.
This confirms the common observation that word
embeddings reliably encode number (Mikolov
et al., 2013b). De nuevo, the LSTM-based CNLM
is better than the RNN, but both significantly
outperform the autoencoder. The latter is near-
random on new class prediction, confirming that
we properly controlled for orthographic confounds.
We observe a considerable drop in the LSTM
CNLM performance between generalization to
-r and Umlaut. Por un lado, the fact that
performance is still clearly above chance (y
autoencoder) in the latter condition shows that

472

the LSTM CNLM has a somewhat abstract
notion of number not tied to specific orthographic
exponents. En el otro, the -r vs. Umlaut dif-
ference suggests that the generalization is not com-
pletely abstract, as it works more reliably when
the target is a new suffixation pattern, albeit one
that is distinct from those seen in training, than
when it is a purely non-concatenative process.

4.2 Capturing syntactic dependencies

Words encapsulate linguistic information into
units that are then put into relation by syntac-
tic rules. A long tradition in linguistics has even
claimed that syntax is blind to sub-word-level
procesos (p.ej., Chomsky, 1970; Di Sciullo and
williams, 1987; Bresnan and Mchombo, 1995;
williams, 2007). Can our CNLMs, a pesar de la
lack of an explicit word lexicon, capture relational
syntactic phenomena, such as agreement and case
asignación? We investigate this by testing them
on syntactic dependencies between non-adjacent
palabras. We adopt the ‘‘grammaticality judgment’’
paradigm of Linzen et al. (2016). We create mini-
mal sets of grammatical and ungrammatical phrases
illustrating the phenomenon of interest, and let the
language model assign a likelihood to all items in
the set. The language model is said to ‘‘prefer’’
the grammatical variant if it assigns a higher like-
lihood to it than to its ungrammatical counterparts.
We must stress two methodological points. Primero,
because a character-level language model assigns
a probability to each character of a phrase, y el
phrase likelihood is the product of these values
(all between 0 y 1), minimal sets must be con-
trolled for character length. This makes existing
benchmarks unusable. Segundo, the ‘‘distance’’ of a
relation is defined differently for a character-level
is not straightforward to quan-
modelo, y eso
tify. Consider the German phrase in Example (1)
abajo. For a word model, two items separate the
article from the noun. Para (space-less) personaje
modelo, eight characters intervene until the noun
onset, but the span to consider will typically be
longer. Por ejemplo, Baum could be the beginning
of the feminine noun Baumwolle ‘cotton’, cual
would change the agreement requirements on the
artículo. So, until the model finds evidence that it
fully parsed the head noun, it cannot reliably check
agreement. This will typically require parsing at
least the full noun and the first character following
él. We again focus on German and Italian, as their

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 1: Accuracy in the German syntax tasks, as a function of number of intervening words.

richer inflectional morphology simplifies the task
of constructing balanced minimal sets.

4.2.1 Alemán
Article-noun gender agreement Each German
noun belongs to one of three genders (masculine,
feminine, neuter), morphologically marked on
the article. As the article and the noun can be
separated by adjectives and adverbs, we can probe
knowledge of lexical gender together with long-
distance agreement. We create stimuli of the form

(1)

{der, die, das}
el

sehr
muy

rote
rojo

Baum
árbol

where the correct nominative singular article (der,
en este caso) matches the gender of the noun.
We then run the CNLM on the three versions
of this phrase (removing whitespace) and record
the probabilities it assigns to them. If the model
assigns the highest probability to the version with
the right article, we count it as a hit for the model.
To avoid phrase segmentation ambiguities (as in
the Baum/Baumwolle example above), we present
phrases surrounded by full stops.

To build the test set, we select all 4,581 nomina-
tive singular nouns from the German UD treebank:
49.3% feminine, 26.4% masculine, y 24.3%
neuter. WordNLM OOV noun ratios are: 40.0%
for masculine, 36.2% for feminine, y 41.5% para
neuter. We construct four conditions varying the
number of adverbs and adjectives between article
and noun. We first consider stimuli where no
material intervenes. In the second condition, un
adjective with the correct case ending, randomly
selected from the training corpus, is added. Cru-
cialmente, the ending of the adjective does not reveal
the gender of the noun. We only used adjectives

occurring at least 100 veces, and not ending in
-r.11 We obtained a pool of 9,742 adjectives to
sample from, also used in subsequent experi-
mentos. Un total de 74.9% of these were OOV for
the WordNLM. In the third and fourth conditions,
uno (sehr) or two adverbs (sehr extrem) intervene
between article and adjective. These do not cue
gender either. We obtained 2,290 (m.), 2,261 (f.),
y 1,111 (n.) estímulos, respectivamente. To control for
surface co-occurrence statistics in the input, nosotros
constructed an n-gram baseline picking the article
most frequently occurring before the phrase in the
training data, breaking ties randomly. OOVs were
excluded from WordNLM evaluation, Resultando en
an easier test for this rival model. Sin embargo, aquí
and in the next two tasks, CNLM performance on
this reduced set was only slightly better, and we
do not report it. We report accuracy averaged over
nouns belonging to each of the three genders. Por
diseño, the random baseline accuracy is 33%.

Results are presented in Figure 1 (izquierda). WordNLM
performs best, followed by the LSTM CNLM.
The n-gram baseline performs similarly to the
CNLM when there is no intervening material,
which is expected, as a noun will often be pre-
ceded by its article in the corpus. Sin embargo, es
accuracy drops to chance level (0.33) en el
presence of an adjective, whereas the CNLM is
still able to track agreement. The RNN variant
is much worse. It is outperformed by the n-gram
model in the adjacent condition, and it drops
to random accuracy as more material intervenes.
We emphasized at the outset of this section that

11Adjectives ending in -r often reflect

lemmatization
problemas, as TreeTagger occasionally failed to remove the
inflectional suffix -r when lemmatizing. We needed to extract
lemmas, as we constructed the appropriate inflected forms on
their basis.

473

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

CNLMs must
track agreement across much
wider spans than word-based models. The LSTM
variant ability to preserve information for longer
might play a crucial role here.

Article-noun case agreement We selected the
two determiners dem and des, which unambig-
uously indicate dative and genitive case, respetar-
activamente, for masculine and neuter nouns:

(2)

a. {dem, des} sehr roten Baum

very red tree (dative)

el
{dem, des} sehr roten Baums
el

very red tree (genitive)

We selected all noun lemmas of the appropri-
ate genders from the German UD treebank, y
extracted morphological paradigms from Wik-
tionary to obtain case-marked forms, retaining
only nouns unambiguously marking the two cases
(4,509 nouns). We created four conditions, vary-
ing the amount of intervening material, como en el
gender agreement experiment (4,509 stimuli per
condición). Para 81.3% of the nouns, at least one of
the two forms was OOV for the WordNLM, y
we tested the latter on the full-coverage subset.
Random baseline accuracy is 50%.

Results are in Figure 1 (center). De nuevo, WordNLM
has the best performance, but the LSTM CNLM
is competitive as more elements intervene. Accu-
racy stays well above 80% even with three inter-
vening words. The n-gram model performs well
if there is no intervening material (again reflect-
ing the obvious fact that article-noun sequences
are frequent in the corpus), and at chance other-
wise. The RNN CNLM accuracy is above chance
with one and two intervening elements, but drops
considerably with distance.

Prepositional case subcategorization German
verbs and prepositions lexically specify their
object’s case. We study the preposition mit ‘with’,
which selects a dative object. We focus on mit, como
it unambiguously requires a dative object, and it is
extremely frequent in the Wikipedia corpus we are
usando. To build the test set, we select objects whose
head noun is a nominalized adjective, with regular,
overtly marked case inflection. We use the same
adjective pool as in the preceding experiments.
We then select all sentences containing a mit
prepositional phrase in the German Universal
Dependencies treebank, subject to the constraints

474

eso (1) the head of the noun phrase governed
by the preposition is not a pronoun (replacing
such items with a nominal object often results
in ungrammaticality), y (2) the governed noun
phrase is continuous, in the sense that it is not
interrupted by words that do not belong to it.12 We
obtained 1,629 such sentences. For each sentence,
we remove the prepositional phrase and replace
it by a phrase of the form

(3)

mit
con

der
el

sehr
muy

{rote, roten}
red one

where only the -en (dative) version of the adjec-
tive is compatible with the case requirement of the
preposition (and the intervening material does not
disambiguate case). We construct three conditions
by varying the presence and number of adverbs
(sehr ‘very’, sehr extrem ‘very extremely’, sehr
extrem unglaublich ‘very extremely incredibly’).
Note that here the correct form is longer than
the wrong one. As the overall likelihood is the
product of character probabilities ranging between
0 y 1, if this introduces a length bias, the latter
will work against the character models. Note also
that we embed test phrases into full sentences
(p.ej., Die Figur hat mit der roten gespielt und
meistens gewonnen. ‘The figure played with the
red one and mostly won’). We do this because this
will disambiguate the final element of the phrase
as a noun (not an adjective), and exclude the
reading in which mit is a particle not governing the
noun phrase of interest (Dudenredaktion, 2019).13
When running the WordNLM, we excluded OOV
adjectives as in the previous experiments, but did
not apply further OOV filtering to the sentence
marcos. For the n-gram baseline, we only counted
occurrences of the prepositional phrase, omitting
the sentential contexts. Random baseline accuracy
es 50%.

We also created control stimuli where all words
up to and including the preposition are removed
(the example sentence above becomes: der roten
gespielt und meistens gewonnen). If a model’s
accuracy is lower on these control stimuli than on
the full ones, its performance cannot be simply

12The main source of noun phrase discontinuity in the
German UD corpus is extraposition, a common phenomenon
where part of the noun phrase is separated from the rest by
the verb.

13An example of this unintended reading of mit is: Ich war
mit der erste, der hier war. ‘I was one of the first who arrived
here.’ In this context, dative ersten would be ungrammatical.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

explained by the different unigram probabilities
of the two adjective forms.

Results are shown in Figure 1 (bien). Solo
the n-gram baseline fails to outperform control
exactitud (dotted). Asombrosamente, the LSTM CNLM
slightly outperforms the WordNLM, even though
the latter is evaluated on the easier full-lexical-
coverage stimulus subset. Neither model shows
accuracy decay as the number of adverbs in-
creases. As before, the n-gram model drops to
chance as adverbs intervene, whereas the RNN
CNLM starts with low accuracy that progressively
decays below chance.

4.2.2 italiano

Article-noun gender agreement Similar
a
Alemán, Italian articles agree with the noun in
género; sin embargo, Italian has a relatively extended
paradigm of masculine and feminine nouns dif-
fering only in the final vowel (-o and -a, respetar-
activamente), allowing us to test agreement in fully
controlled paradigms such as the following:

(4)

a. {il, la} congeniale candidato

congenial candidate (m.)

{il, la} congeniale candidata
el

congenial candidate (f.)

The intervening adjective, ending in -e, does
not cue gender. We constructed the stimuli with
words appearing at least 100 times in the training
cuerpo. We required moreover the -a and -o
forms of a noun to be reasonably balanced in
frequency (neither form is more than twice as
frequent as the other), or both rather frequent
(appear at least 500 veces). As the prenominal
adjectives are somewhat marked, we only con-
sidered -e adjectives that occur prenominally
with at least 10 distinct nouns in the training
cuerpo. Here and below, stimuli were manually
checked, removing nonsensical adjective-noun
(abajo, adverb-adjective) combinations. Finalmente,
adjective-noun combinations that occurred in the
training corpus were excluded, so that an n-
gram baseline would perform at chance level.
We obtained 15,005 stimulus pairs in total. 35.8%
of them contained an adjective or noun that was
OOV for the WordNLM. De nuevo, we report this

CNLM

LSTM
93.1
99.5
99.0

RNN
79.2
98.9
84.5

WordNLM

97.4
99.5
100.0

Noun Gender
Adj. Gender
Adj. Número

Mesa 4: Italian agreement results. Random base-
line accuracy is 50% in all three experiments.

model’s results on its full-coverage subset, dónde
the CNLM performance is only slightly above the
one reported.

Results are shown on the first line of Table 4.
WordNLM shows the strongest performance,
closely followed by the LSTM CNLM. The RNN
CNLM performs strongly above chance (50%),
but again lags behind the LSTM.

Article-adjective gender agreement We next
consider agreement between articles and adjec-
tives with an intervening adverb:

(5)

il
el (m.) menos

meno {alieno, aliena}
alien one

la
el (f.) menos

meno {alieno, aliena}
alien one

where we used the adverbs pi`u ‘more’, meno
‘less’, tanto ‘so much’. We considered only ad-
jectives that occurred 1K times in the training
cuerpo (as adjectives ending in -a/-o are very com-
mon). We excluded all cases in which the adverb-
adjective combination occurred in the training
cuerpo, obtaining 88 stimulus pairs. Because of
the restriction to common adjectives, Había
no WordNLM OOVs. Results are shown on the
second line of Table 4; all three models perform
almost perfectly. Possibly, the task is made easy
by the use of extremely common adverbs and
adjectives.

Article-adjective number agreement Finally,
we constructed a version of the last test that
probed number agreement. For feminine forms,
it is possible to compare same-length phrases
como:

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

meno {aliena, aliene}

alien one(s)

meno {aliena, aliene}

la
el (s.) menos

le
el (p.) menos

alien one(s)

(6)

475

Stimulus selection was as in the last experiment,
but we used a 500-occurrences threshold for ad-
jectives, as feminine plurals are less common,
obtaining 99 pares. De nuevo, no adverb-adjective
combination was attested. There were no OOV
items for the WordNLM. Results are shown on
the third line of Table 4; the LSTMs perform
almost perfectly, and the RNN is strongly above
chance.

4.3 Semantics-driven sentence completion

We probe whether CNLMs are capable of track-
ing the shallow form of word-level semantics
required in a fill-the-gap test. We turn now to
Inglés, as for this language we can use the
Microsoft Research Sentence Completion task
(Zweig and Burges, 2011). The challenge consists
of sentences with a gap, con 5 possible choices
to fill it. Language models can be directly applied
to the task, by calculating the likelihood of sen-
tence variants with all possible completions, y
selecting the one with the highest likelihood.

The creators of the benchmark took mul-
tiple precautions to ensure that success on
the task implies some command of seman-
tics. The multiple choices were controlled for
frequency, and the annotators were encour-
aged to choose confounders whose elimination
required ‘‘semantic knowledge
and logical
(Zweig and Burges, 2011). Para
inference’’
the right choice in ‘‘Was she his
ejemplo,
[client|musings|discomfiture|choice|opportunity],
his friend, or his mistress? depends on the cue
that the missing word is coordinated with friend
and mistress, and the latter are animate entities.

The task domain (Sherlock Holmes novels) es
very different from the Wikipedia dataset on
which we originally trained our models. Para
fairer comparison with previous work, we re-
trained our models on the corpus provided with
the benchmark, consisting of 41 million words
from 19th century English novels (we removed
whitespace from this corpus as well).

Results are in Table 5. We confirm the im-
portance of in-domain training, as the models
trained on Wikipedia perform poorly (but still
above chance level, which is at 20%). Dentro-
domain training, the LSTM CNLM outperforms
many earlier word-level neural models, and is
only slightly below our WordNLM. The RNN is
not successful even when trained in-domain,

476

Our models (wiki/in-domain)
LSTM
RNN
WordNLM

34.1/59.0
24.3/24.0
37.1/63.3
From the literature
Skipgram
40.0
Skipgram + RNNs
45.0
PMI
56.0
Context-Embed
60.7

KN5
Word RNN
Word LSTM
LdTreeLSTM

48.0
58.9
61.4
65.1

Mesa 5: Results on MSR Sentence Completion. Para
our models (arriba), we show accuracies for Wikipedia
(izquierda) and in-domain (bien) training. comparamos
with language models from prior work (izquierda): Kneser-
Ney 5-gram model (Mikolov, 2012), Word RNN
(Zweig et al., 2012), Word LSTM and LdTreeLSTM
(Zhang et al., 2016). We further
report models
incorporating distributional encodings of semantics
(bien): Skipgram(+RNNs) from Mikolov et al. (2013a),
the PMI-based model of Woods (2016), y el
Context-Embedding-based approach of Melamud et al.
(2016).

contrasting with the word-based vanilla RNN from
the literature, whose performance, while still
below LSTMs, is much stronger. Once more, este
suggests that capturing word-level generalizations
with a word-lexicon-less character model requires
the long-span processing abilities of an LSTM.

4.4 Boundary tracking in CNLMs

The good performance of CNLMs on most tasks
above suggests that, although they lack a hard-
coded word vocabulary and they were trained on
unsegmented input, there is enough pressure from
the language modeling task for them to learn to
track word-like items, and associate them with
various morphological, syntactic, and semantic
propiedades. En esta sección, we take a direct look at
how CNLMs might be segmenting their input.
Kementchedjhieva and Lopez (2018) found a
single unit in their English CNLM that seems,
qualitatively, to be tracking morpheme/word bound-
Aries. Because they trained the model with white-
this unit could
espacio,
simply be to predict the very frequent whitespace
personaje. We conjecture instead (like them) eso
the ability to segment the input into meaningful
items is so important when processing language
that CNLMs will specialize units for boundary
tracking even when trained without whitespace.

the main function of

To look for ‘‘boundary units,’’ we created a
random set of 10,000 positions from the training
colocar, balanced between those corresponding to a

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 2: Examples of the LSTM CNLM boundary unit activation profile, with ground-truth word boundaries
marked in green. Inglés: It was co-produced with Martin Buttrich over at. . . . Alemán: Systeme, deren
Hauptaufgabe die transformati(-en) ‘systems, whose main task is the transformation. . . '. italiano: in seguito alle
dimissioni del Sommo Pontefice ‘following the resignation of the Supreme Pontiff. . . '.

word-final character and those occurring word-
initially or word-medially. We then computed,
for each hidden unit, the Pearson correlation be-
tween its activations and a binary variable that
takes value 1 in word-final position and 0 else-
dónde. For each language and model (LSTM or
RNN), we found very few units with a high cor-
relation score, suggesting that the models have
indeed specialized units for boundary tracking.
We further study the units with the highest corre-
laciones, cuales son, for the LSTMs, 0.58 (Inglés),
0.69 (Alemán), y 0.57 (italiano). For the RNNs,
the highest correlations are 0.40 (Inglés), y
0.46 (German and Italian).14

Examples We looked at the behavior of the
selected LSTM units qualitatively by extracting
random sets of 40-character strings from the
development partition of each language (izquierda-
aligned with word onsets) and plotting the cor-
responding boundary unit activations. Cifra 2
reports illustrative examples. In all languages,
most peaks in activation mark word boundaries.
Sin embargo, other interesting patterns emerge. En
Inglés, we see how the unit reasonably treats

14In an early version of this analysis, we arbitrarily imposed
a minimum 0.70 correlation threshold, missing the presence
of these units. We thank the reviewer who encouraged us to
look further into the matter.

co- and produced in co-produced as separate
elementos, and it also posits a weaker boundary
after the prefix pro-. As it proceeds left-to-right,
with no information on what follows, the network
posits a boundary after but in Buttrich. En el
German example, we observe how the complex
word Hauptaufgabe (‘main task’) is segmented
into the morphemes haupt, auf and gabe. Sim-
ilarly, in the final transformati- fragment, nosotros
observe a weak boundary after the prefix trans.
In the pronoun deren ‘whose’, the case suffix -n is
in seguito a is a lexi-
separated. In Italian,
calized multi-word sequence meaning ‘following’
(literally: ‘in continuation to’). The boundary
unit does not spike inside it. Similarmente,
el
fixed expression Sommo Pontefice (referring to
the Pope) does not trigger inner boundary unit
activation spikes. Por otro lado, we notice
peaks after di and mi in dimissioni. De nuevo, en
left-to-right processing, the unit has a tendency
to immediately posit boundaries when frequent
function words are encountered.

Detecting word boundaries To gain a more
quantitative understanding of how well
el
boundary unit is tracking word boundaries, nosotros
trained a single-parameter diagnostic classifier on
the activation of the unit (the classifier simply
sets an optimal threshold on the unit activation

477

LSTM
single
87.7
86.6
85.6

LSTM
full
93.0
91.9
92.2

RNN
single
65.6
70.4
71.3

RNN
full
90.5
85.0
91.5

Inglés
Alemán
italiano

LSTM
single
77.5
80.8
75.5

LSTM
full
90.0
79.7
82.9

RNN
single
65.9
67.0
71.4

RNN
full
76.8
75.8
75.9

Inglés
Alemán
italiano

Mesa 6: F1 of single-unit and full-hidden-state
word-boundary diagnostic classifiers, trained and
tested on uncontrolled running text.

to separate word boundaries from word-internal
positions). We ran two experiments. In the first,
following standard practice, we trained and tested
the classifier on uncontrolled running text. Nosotros
used 1k characters for training, 1M for testing,
both taken from the left-out Wikipedia test
partitions. We will report F1 performance on this
tarea.

to the constraint

We also considered a more cogent evaluation
regime, in which we split training and test data
so that the number of boundary and non-boundary
conditions are balanced, and there is no overlap
between training and test words. Específicamente,
we randomly selected positions from the test
partitions of the Wikipedia corpus, such that
half of these were the last character of a token,
and the other half were not. We sampled the
eso
test data points subject
la palabra (in the case of a boundary position)
or word prefix (in the case of a word-internal
posición) ending at the selected character does not
overlap with the training set. This ensures that a
classifier cannot succeed by looking for encodings
reflecting specific words. For each datapoint, nosotros
fed a substring of the 40 preceding characters to
the CNLM. We collected 1,000 such points for
training, and tested on 1M additional datapoints. En
este caso, we will report classification accuracy as
figure of merit. For reference, in both experiments
we also trained diagnostic classifiers on the full
hidden layer of the LSTMs.

Looking at the F1 results on uncontrolled run-
ning text (Mesa 6), we observe first that the
LSTM-based full-hidden-layer classifier has strong
performance in all 3 idiomas, confirming that
the LSTM model encodes boundary information.
Además, in all languages, a large proportion
of this performance is already accounted for by
the single-parameter classifier using boundary unit
activations. This confirms that tracking boundaries
is important enough for the network to devote a
specialized unit to this task. Full-layer RNN results

Mesa 7: Accuracy of single-unit and full-hidden-
state word-boundary
classifiers,
trained and tested on balanced data requiring
new-word generalization. Chance accuracy is at
50%.

diagnostic

are below LSTM-level but still strong. There is,
sin embargo, a stronger drop from full-layer to single-
unit classification. This is in line with the fact that,
as reported above, the candidate RNN boundary
units have lower boundary correlations than the
LSTM ones.

Results for the balanced classifiers tested on
new-word generalization are shown in Table 7
(because of the different nature of the experiments,
these are not directly comparable to the F1
results in Table 6). De nuevo, we observe a strong
performance of
the LSTM-based full-hidden-
layer classifier across the board. The LSTM
single-parameter classifier using boundary unit
activations is also strong, even outperforming
the full classifier in German. Además, en esto
more cogent setup, the single-unit LSTM classifier
is at least competitive with the full-layer RNN
classifier in all languages. The weaker results of
RNNs in the word-centric tasks of the previous
sections might in part be due to their poorer overall
ability to track word boundaries, as specifically
suggested by this stricter evaluation setup.

Error analysis As a final way to characterize
the function and behaviour of the boundary units,
we inspected the most frequent under- and over-
segmentation errors made by the classifier based
on the single boundary units, in the more difficult
balanced task. We discuss German here, as it is
the language where the classifier reaches highest
exactitud, and its tendency to have long, mor-
phologically complex words makes it particularly
interesante. Sin embargo, similar patterns were also
detected in Italian and, en un grado menor, Inglés
(in the latter, there are fewer and less interpretable
common oversegmentations, probably because
words are on average shorter and morphology
more limited).

Considering first the 30 most common under-
segmentations, the large majority (24 de 30) son

478

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Batch Size
Embedding Size
Dimension
Layers
Learning Rate
Decay
BPTT Length
Hidden Dropout
Embedding Dropout
Input Dropout
Nonlinearity

LSTM
Ge.
512
100
1024
2
2.0
1.0
50
0.0
0.01
0.0
–

En.
128
200
1024
3
3.6
0.95
80
0.01
0.0
0.001
–

Él.
128
200
1024
2
3.2
0.98
80
0.0
0.0
0.0
–

RNN
Ge.
256
50
2048
2
0.1
0.95
30
0.0
0.0
0.01
tanh

En.
256
200
2048
2
0.01
0.9
50
0.05
0.01
0.001
ReLu

Él.
256
50
2048
2
0.1
0.95
30
0.0
0.0
0.01
tanh

WordNLM
Ge.
128
200
1024
2
0.9
1.0
50
0.15
0.1
0.001
–

En.
128
1024
1024
2
1.1
1.0
50
0.15
0.0
0.01
–

Él.
128
200
1024
2
1.2
0.98
50
0.05
0.0
0.01
–

Mesa 8: Chosen hyperparameters.

to’),

common sequences of grammatical terms or very
frequent items that can sometimes be reasonably
re-analyzed as single function words or adverbs
(p.ej., bis zu, ‘up to’ (lit. ‘until
je nach
‘depending on’ (lit. ‘per after’), bis heute ‘to
date’ (lit. ‘until today’)). Three cases are multi-
word city names (Los Angeles). The final 3 casos
interestingly involve Bau ‘building’ followed by
von ‘of’ or genitive determiners der/des. In its
eventive reading,
this noun requires a patient
licensed by either a preposition or the genitive
determiner (p.ej., Bau der Mauer ‘building of the
wall’ (lit. ‘building the-GEN wall’)). Apparently
the model decided to absorb the case assigner into
the form of the noun.

We looked next at

el 30 most common
oversegmentations, eso es, at the substrings that
were wrongly segmented out of the largest number
of distinct words. We limited the analysis to
those containing at least 3 characters, porque
shorter strings were ambiguous and hard to
interpret. Among then top oversegmentations, 6
are prefixes that can also occur in isolation as
‘on’, nach
prepositions or verb particles (auf
‘after’, etc.). Seven are content words that form
many compounds (p.ej., haupt ‘main’, occurring
in Hauptstadt
‘main
station’; Land ‘land’, occurring in Deutschland
‘Germany’, Landkreis ‘district’). Otro 7 elementos
can be classified as suffixes (p.ej., -lich as in s¨udlich
‘southern’, wissenschaftlich ‘scientific’), a pesar de
segmentation is not always canonical
su
the expected -schaft
-chaft
(p.ej.,
in Wissenschaft ‘science’). Four very common
function words are often wrongly segmented out of
longer words (p.ej., sie ‘she’ from sieben ‘seven’).
The kom and kon cases are interesting, como el

‘capital’, Hauptbahnhof

instead of

model segments them as stems (or stem fragments)
in forms of the verbs kommen ‘to come’ and
k¨onnen ‘to be able to’, respectivamente (p.ej., kommt
and konnte), but it also treats them as pseudo-
affixes elsewhere (komponist ‘composer’, kontakt
‘contact’). El restante 3 oversegmentations,
rie, run and ter don’t have any clear interpretation.
Para concluir, the boundary unit, even when
analyzed through the lens of a classifier that was
optimized on word-level segmentation, is actually
tracking salient linguistic boundaries at different
niveles. Although in many cases these boundaries
naturally coincide with words (hence the high
classifier performance),
the CNLM is also
sensitive to frequent morphemes and compound
elementos, as well as to different types of multi-
word expressions. This is in line with a view
of wordhood as a useful but ‘‘soft’’, emergent
propiedad, rather than a rigid primitive of linguistic
Procesando.

5 Discusión

We probed the linguistic information induced by
a character-level LSTM language model trained
on unsegmented text. The model was found to
possess implicit knowledge about a range of
intuitively word-mediated phenomena, como
sensitivity to lexical categories and syntactic and
shallow-semantics dependencies. A model ini-
tialized with a word vocabulary and fed
tokenized input was in general superior, pero
the performance of the word-less model did
no
lag much behind, suggesting that word
priors are helpful but not strictly required. A
character-level RNN was less consistent
than
the latter’s ability
the LSTM, suggesting that

479

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

to track information across longer time spans is
important to make the correct generalizations. El
character-level models consistently outperformed
n-gram controls, confirming they are tapping into
more abstract patterns than local co-occurrence
Estadísticas.

As a first step towards understanding how
character-level models handle supra-character
phenomena, we searched and found specialized
boundary-tracking units in them. These units
are not only and not always sensitive to word
boundaries, but also respond to other salient items,
such as morphemes and multi-word expressions,
in accordance with an ‘‘emergent’’ and flexible
view of
idioma
the basic constituents of
(Schiering et al., 2010).

eso

Our results are preliminary in many ways. Nuestro
tests are relatively simple. We did not attempt,
Por ejemplo, to model long-distance agreement
in presence of distractors, a challenging task
even for humans (Gulordava et al., 2018). El
results on number classification in German sug-
gest
the models might not be capturing
linguistic generalizations of the correct degree
of abstractness, settling for shallower heuristics.
Still, as a whole, our work suggests that a large
cuerpo, combined with the weak priors encoded in
an LSTM, might suffice to learn generalizations
about word-mediated linguistic processes without
a hard-coded word lexicon or explicit wordhood
señales.

Nearly all contemporary linguistics recognizes
a central role to the lexicon (ver, p.ej., Sag et al.,
2003; Goldberg, 2005; Radford, 2006; Bresnan
et al., 2016; Jeˇzek, 2016, for very different
perspectives). Linguistic formalisms assume that
the lexicon is essentially a dictionary of words,
possibly complemented by other units, not unlike
the list of words and associated embeddings in
a standard word-based NLM. Intriguingly, nuestro
CNLMs captured a range of lexical phenomena
without anything resembling a word dictionary.
Any information a CNLM might acquire about
units larger than characters must be stored in
its recurrent weights. This suggests a radically
different and possibly more neurally plausible
view of the lexicon as implicitly encoded in a
distributed memory, that we intend to characterize
more precisely and test in future work (similar
ideas are being explored in a more applied NLP
perspectiva, p.ej., Gillick et al., 2016; Lee et al.,
2017; Cherry et al., 2018).

Concerning the model input, we would like
to study whether the CNLM successes crucially
depend on the huge amount of training data it
receives. Are word priors more important when
learning from smaller corpora? In terms of com-
parison with human learning, the Wikipedia text
we fed our CNLMs is far from what children
acquiring a language would hear. Trabajo futuro
should explore character/phoneme-level learning
from child-directed speech corpora. Still, por
feeding our networks ‘‘grown-up’’ prose, we are
arguably making the job of identifying basic
constituents harder than it might be when pro-
cessing the simpler utterances of early child-
directed speech (Tomasello, 2003).

idiomas)

Como se discutio, a rigid word notion is problematic
both cross-linguistically (cf. polysynthetic and
agglutinative
and within single
linguistic systems (cf. the view that the lexicon
hosts units at different levels of the linguistic
hierarchy, from morphemes to large syntactic
constructions; p.ej.,
Jackendoff, 1997; Croft
and Cruse, 2004; Goldberg, 2005). This study
provided a necessary initial check that word-free
models can account for phenomena traditionally
seen as word-based. Future work should test
whether such models can also account
para
grammatical patterns that are harder to capture
in word-based formalisms, exploring both a
typologically wider range of languages and a
broader set of grammatical tests.

Expresiones de gratitud

We would like to thank Piotr Bojanowski, Alex
Cristia, Kristina Gulordava, Urvashi Khandelwal,
Germ´an Kruszewski, Sebastián Riedel, Hinrich
Sch¨utze, and the anonymous
para
feedback and advice.

reviewers

Referencias

Yossi Adi, Einat Kermany, Yonatan Belinkov,
Ofer Lavi, and Yoav Goldberg. 2017. Fine-
grained analysis of sentence embeddings using
En procedimientos
auxiliary prediction tasks.
of ICLR Conference Track. Toulon, Francia.
Published online: https://openreview.
net/group?id=ICLR.cc/2017/conference

Afra Alishahi, Marie Barking, and Grzegorz
ella estaba crujiendo. 2017. Encoding of phonology in a

480

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

recurrent neural model of grounded speech.
In Proceedings of CoNLL, pages 368–378,
vancouver.

Moshe Bar. 2007. The proactive brain: Using anal-
ogies and associations to generate predictions.
Trends in Cognitive Science, 11(7):280–289.

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi,
Hassan Sajjad, and James Glass. 2017. Qué
do neural machine translation models learn
about morphology? In Proceedings of ACL,
pages 861–872, vancouver.

Balthasar Bickel and Fernando Z´u˜niga. 2017. El
‘word’ in polysynthetic languages: Phonol-
ogical and syntactic challenges, In Michael
Fortescue, Marianne Mithun, and Nicholas
evans, editores, Oxford Handbook of Polysynthe-
hermana, pages 158–186. prensa de la Universidad de Oxford,
Oxford.

Piotr Bojanowski, Armand Joulin, and Tomas
Mikolov. 2016. Alternative structures
para
character-level RNNs. In Proceedings of ICLR
Workshop Track. San Juan, Puerto Rico.
Published online: https://openreview.
net/group?id=ICLR.cc/2016/workshop.

Sabine Brants, Stefanie Dipper, Silvia Hansen,
Wolfgang Lezius, and George Smith. 2002.
En procedimientos de
The TIGER treebank.
the Workshop on Treebanks and Linguistic
Theories, volumen 168.

Michael Brent and Timothy Cartwright. 1996.
phonotactic
regularity
segmentation.
útil

Distributional
constraints
Cognición, 61:93–125.

son

para

Joan Bresnan, Ash Asudeh, Ida Toivonen, y
Stephen Wechsler. 2016. Lexical-Functional
Syntax, 2y ed., Blackwell, Malden, MAMÁ.

Joan Bresnan and Sam Mchombo. 1995. El
lexical
integrity principle: Evidencia de
Bantu. Natural Language and Linguistic
Teoría, 181–254.

Colin Cherry, George Foster, Ankur Bapna,
Orhan Firat, and Wolfgang Macherey. 2018.
Revisiting character-based neural machine
translation with capacity and compression.
arXiv preimpresión arXiv:1808.09943.

481

Noam Chomsky. 1970, Remarks on nominaliza-
ción, In Roderick Jacobs and Peter Rosenbaum,
editores, Readings in English Transformational
Grammar, pages 184–221. Ginn, Waltham,
MAMÁ.

Morten Christiansen, Christopher Conway, y
Suzanne Curtin. 2005. Multiple-cue integration
in language acquisition: A connectionist model
of speech segmentation and rule-like behav-
ior, James Minett and William Wang, editores,
Language Acquisition, Change and Emer-
gence: Essays in Evolutionary Linguistics,
pages 205–249. City University of Hong Kong
Prensa, Hong Kong.

Morten Christiansen, Allen Joseh, and Mark
segmento
1998. Aprendiendo
Seidenberg.
speech using multiple cues: A connectionist
modelo. Language and Cognitive Processes,
13(2/3):221–268.

Andy Clark. 2016. Surfing Uncertainty, Oxford

Prensa universitaria, Oxford.

Alexis Conneau, Germ´an Kruszewski, Guillaume
Lample, Lo¨ıc Barrault, and Marco Baroni.
2018. What you can cram into a single
$&!#* vector: Probing sentence embeddings
for linguistic properties. In Proceedings of ACL,
pages 2126–2136, Melbourne.

Ryan Cotterell, Sebastian J. Mielke,

Jason
Eisner, and Brian Roark. 2018. Are all
languages equally hard to language-model? En
Actas de la 2018 Conference of the
North American Chapter of the Association for
Ligüística computacional: Human Language
Technologies, Volumen 2 (Artículos breves),
volumen 2, pages 536–541.

William Croft and Alan Cruse. 2004. Cognitivo
Lingüística, Prensa de la Universidad de Cambridge,
Cambridge.

Anna-Maria Di Sciullo and Edwin Williams.
1987. On the Definition of Word, CON prensa,
Cambridge, MAMÁ.

Robert Dixon and Alexandra Aikhenvald, editores.
2002. Word: A cross-linguistic typology, Cambridge
Prensa universitaria, Cambridge.

Dudenredaktion. 2019, mit

(Adverb), Duden
en línea. https://www.duden.de/node/

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

152710/revision/152746, retrieved June 3,
2019.

Alex Graves. 2014. Generating sequences with re-
current neural networks. CORR, abs/1308.0850v5.

Jeffrey Elman. 1990. Finding structure in time.

Ciencia cognitiva, 14:179–211.

Allyson Ettinger, Ahmed Elgohary, Colin
Phillips, and Philip Resnik. 2018. evaluando
composition in sentence vector representations.
In Proceedings of COLING, pages 1790–1801.
Santa Fe, NM.

Robert Frank, Donald Mathis, and William
Badecker. 2013. The acquisition of anaphora
redes recurrentes. Idioma
by simple
Acquisition, 20(3):181–227.

Stefano Fusi, Earl Miller, and Mattia Rigotti.
2016. Why neurons mix: High dimensionality
for higher cognition. Opinión actual en
Neurobiología, 37:66–74.

Daniela Gerz, Ivan Vulic, Edoardo Maria Ponti,
Jason Naradowsky, Roi Reichart, and Anna
Korhonen. 2018. Language modeling for mor-
phologically rich languages: Character-aware
modeling for word-level prediction. Transac-
tions of
the Association for Computational
Lingüística, 6:451–465.

Dan Gillick, Cliff Brunk, Oriol Vinyals, y
Amarnag Subramanya. 2016. Plurilingüe
language processing from bytes. En procedimientos
of NAACL-HLT, pages 1296–1306.

Fr´ederic Godin, Kris Demuynck, Joni Dambre,
Wesley De Neve, and Thomas Demeester.
2018. Explaining
neural
networks for word-level prediction: Do they
discover linguistic rules? En procedimientos de
EMNLP. Bruselas.

character-aware

Adele Goldberg. 2005. Constructions at Work:
The Nature of Generalization in Language,
prensa de la Universidad de Oxford, Oxford.

Yoav Goldberg. 2017. Neural Network Methods
for Natural Language Processing, morgan &
Claypool, San Francisco, California.

Sharon Goldwater, Thomas L. Griffiths, and Mark
Johnson. 2009. A Bayesian framework for word
segmentation: Exploring the effects of context.
Cognición, 112(1):21–54.

Kristina Gulordava, Piotr Bojanowski, Edouard
Grave, Tal Linzen, and Marco Baroni. 2018.
Colorless green recurrent networks dream
hierarchically. In Proceedings of NAACL, paginas
1195–1205. Nueva Orleans, LA.

Martin Haspelmath. 2011. The indeterminacy
of word segmentation and the nature of
morphology and syntax. Folia Linguistica,
45(1):31–80.

Sepp Hochreiter y Jürgen Schmidhuber. 1997.
Memoria larga a corto plazo. Computación neuronal,
9(8):1735–1780.

Dieuwke Hupkes, Sara Veldhoen, and Willem
Zuidema. 2018. Visualisation and ‘‘diagnostic
classifiers’’ reveal how recurrent and recursive
neural networks process hierarchical structure.
Journal of Artificial Intelligence Research,
61:907–926.

Ray Jackendoff. 1997. Twistin’ the night away.

Idioma, 73:534–559.

Ray Jackendoff. 2002. Foundations of Language:
Cerebro, Significado, Grammar, Evolución, Oxford
Prensa universitaria, Oxford.

Elisabetta

2016. The Lexicon: Un
Introducción, prensa de la Universidad de Oxford, Oxford.

Jeˇzek.

`Akos K`ad`ar, Grzegorz Chrupała, and Afra
Alishahi. 2017. Representation of linguistic
form and function in recurrent neural networks.
Ligüística computacional, 43(4):761–780.

Herman Kamper, Aren Jansen, and Sharon
Goldwater. 2016. Unsupervised word segmen-
tation and lexicon discovery using acoustic
IEEE Transactions on
word embeddings.
Audio, Speech and Language Processing,
24(4):669–679.

Katharina Kann, Ryan Cotterell, and Hinrich
Sch¨utze. 2016. Neural morphological analysis:
Encoding-decoding canonical
En
Proceedings of EMNLP, pages 961–967.
austin, Texas.

segments.

Yova Kementchedjhieva and Adam Lopez.
2018. ‘‘Indicatements’’ that character language

482

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

models learn English morpho-syntactic units
and regularities. In Proceedings of the EMNLP
BlackboxNLP Workshop, pages 145–153.
Bruselas.

stimulus: Hierarchical generalization without a
hierarchical bias in recurrent neural networks.
In Proceedings of CogSci, pages 2093–2098,
Madison, Wisconsin.

Yoon Kim, Yacine Jernite, David Sontag, y
Alexander Rush. 2016. Character-aware neural
language models. In Proceedings of AAAI,
pages 2741–2749, Phoenix, AZ.

Christo Kirov and Ryan Cotterell. 2018. Recurrent
neural networks in linguistic theory: Revisando
Pinker and Prince (1988) and the past
tense debate. Transacciones de la Asociación
para Lingüística Computacional. arXiv preprint
arXiv:1807.04783v2.

Patricia Kuhl. 2004. Early language acquisition:
Cracking the speech code. Reseñas de naturaleza
Neurociencia, 5(11):831–843.

Jey Han Lau, Alejandro Clark, and Shalom Lappin.
2017. Grammaticality, acceptability, y probablemente-
habilidad: A probabilistic view of linguistic knowl-
borde. Ciencia cognitiva, 41(5):1202–1241.

Jason Lee, Kyunghyun Cho,

and Thomas
Hofmann. 2017. Fully character-level neural
machine translation without explicit segmen-
tation. Transactions of
la Asociación para
Ligüística computacional, 5:365–378.

Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan
Jurafsky. 2016. Visualizing and understanding
neural models in NLP. En procedimientos de
NAACL, pages 681–691, San Diego, California.

Tal Linzen, Grzegorz Chrupała,

and Afra
En procedimientos de
Alishahi, editores. 2018.
the EMNLP BlackboxNLP Workshop, LCA,
Bruselas.

Tal Linzen, Emmanuel Dupoux, and Yoav
Goldberg. 2016. Assessing the ability of LSTMs
to learn syntax-sensitive dependencies. Trans-
acciones de la Asociación de Computación
Lingüística, 4:521–535.

Jessica Maye, Janet Werker, and Lou Ann
Gerken. 2002. Infant sensitivity to distributional
information can affect phonetic discrimination.
Cognición, 82(3):B101–B111.

Thomas McCoy, Robert Frank,

and Tal
Linzen. 2018. Revisiting the poverty of the

483

ryan mcdonald,

Joakim Nivré, Yvonne
Quirmbach-Brundage, Yoav Goldberg, Dipanjan
El, Kuzman Ganchev, Keith Hall, Slav
Petrov, Hao Zhang, Oscar T¨ackstr¨om, et al.
2013. Universal dependency annotation for
multilingual parsing. En procedimientos de
el
51st Annual Meeting of the Association for
Ligüística computacional (Volumen 2: Short
Documentos), volumen 2, pages 92–97.

Oren Melamud,

Jacob Goldberger, and Ido
Dagan. 2016. context2vec: Learning generic
context embedding with bidirectional lstm. En
Proceedings of The 20th SIGNLL Conference
sobre el aprendizaje computacional del lenguaje natural,
pages 51–61.

Stephen Merity, Nitish Shirish Keskar, y
Richard Socher. 2018. An analysis of neural
language modeling at multiple scales. arXiv
preprint arXiv:1803.08240.

Tomas Mikolov. 2012. Statistical Language Mod-
els Based on Neural Networks. Dissertation,
Brno University of Technology.

Tomas Mikolov, Kai Chen, Greg Corrado, y
Jeffrey Dean. 2013a. Efficient estimation of
word representations in vector space. CORR,
abs/1301.3781.

Tomas Mikolov, Ilya Sutskever, Anoop Deoras,
and Jan
Hai-Son Le, Stefan Kombrink,
Cernock´y. 2011. Subword language modeling
with neural networks. http://www.fit.
vutbr.cz/∼imikolov/rnnlm/.

Tomas Mikolov, Wen-tau Yih, and Geoffrey
en
Zweig. 2013b. Lingüístico
regularities
continuous space word representations.
En
Proceedings of NAACL, pages 746–751,
Atlanta, Georgia.

Joe Pater. 2018. Generative linguistics and neural
networks at 60: Base, friction, and fusion.
Idioma. doi:10.1353/lan.2019.0005.

Alec Radford, Rafal

and Ilya
generate
Sutskever.
reviews and discovering sentiment. CORR,
abs/1704.01444.

J´ozefowicz,
a

2017. Aprendiendo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Andrew Radford. 2006. Minimalist syntax re-
visited. http://www.public.asu.edu/
∼gelderen/Radford2009.pdf.

Ivan Sag, Thomas Wasow, and Emily Bender.
2003. Syntactic Theory: A Formal Introduction,
CSLI, stanford, California.

Ren´e Schiering, Balthasar Bickel, and Kristine
Hildebrandt. 2010. The prosodic word is not
universal, but emergent. Journal of Linguistics,
46(3):657–709.

Helmut Schmid. 1999. Improvements in part-of-
speech tagging with an application to german,
In Natural Language Processing Using Very
Large Corpora, pages 13–25. Saltador.

Hinrich Schütze. 2017. Nonsymbolic text repre-
sentation. In Proceedings of EACL, pages 785–796.
Valencia.

Rico Sennrich. 2017. How grammatical

es
character-level neural machine translation?
contrastive
assessing MT quality with
In Proceedings of EACL
translation pairs.
(Artículos breves), pages 376–382, Valencia.

neural

con
Neural
pages 3104–3112.

redes.

In Advances

Information Processing

en
Sistemas,

Michael Tomasello. 2003. Constructing a Lan-
guage: A Usage-Based Theory of Lan-
guage Acquisition, Prensa de la Universidad de Harvard,
Cambridge, MAMÁ.

Edwin Williams. 2007. Dumping lexicalism, En
Gillian Ramchand and Charles Reiss, editores,
The Oxford Handbook of Linguistic Interfaces,
prensa de la Universidad de Oxford, Oxford.

Aubrie Woods. 2016. Exploiting linguistic fea-
tures for sentence completion. En procedimientos
of the 54th Annual Meeting of the Association
para Lingüística Computacional (Volumen 2: Short
Documentos), volumen 2, pages 438–442.

Xingxing Zhang, Liang Lu,

and Mirella
Lapata. 2016. Top-down tree long short-
term memory networks. En procedimientos de
el 2016 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
pages 310–320.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
2
8
3
1
9
2
3
5
4
3

/
t

a
C
_
a
_
0
0
2
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Xing Shi,

Inkit Padhi, and Kevin Knight.
2016. Does string-based neural MT learn
source syntax? In Proceedings of EMNLP,
pages 1526–1534, austin, Texas.

Geoffrey Zweig and Christopher Burges. 2011.
The Microsoft Research sentence completion
challenge, Technical Report MSR-TR-2011-
129, Microsoft Research.

Geoffrey Zweig, John C. Platón, Christopher Meek,
Christopher J. C. Burges, Ainur Yessenalina,
2012. computacional
and Qiang
En
completion.
oración
approaches
Proceedings of the 50th Annual Meeting of the
Asociación de Lingüística Computacional:
Long Papers-Volume 1, pages 601–610.

Liu.

Ilya Sutskever, James Martens, and Geoffrey
Hinton. 2011. Generating text with recurrent
neural networks. In Proceedings of ICML, paginas
1017–1024, Bellevue, Washington.

Ilya Sutskever, Oriol Vinyals, and Quoc V.
Le. 2014. Sequence to sequence learning

484 Tabula Nearly Rasa: Probing the Linguistic Knowledge of Character-level image

Descargar PDF