Report - Recherche en IA spécialisée au MIT

Report

Segmentability Differences Between
Child-Directed and Adult-Directed
Speech: A Systematic Test With
an Ecologically Valid Corpus

Alejandrina Cristia

1, Emmanuel Dupoux1,2,3, Nan Bernstein Ratner4, and Melanie Soderstrom5

un accès ouvert

journal

1Dept d’Etudes Cognitives, ENS, PSL University, EHESS, CNRS
2INRIA
3FAIR Paris
4Department of Hearing and Speech Sciences, University of Maryland
5Département de psychologie, University of Manitoba

Mots clés: computational modeling, learnability, infant word segmentation, statistical learning,
lexicon

ABSTRAIT

Previous computational modeling suggests it is much easier to segment words from
child-directed speech (CDS) than adult-directed speech (ADS). Cependant, this conclusion is
based on data collected in the laboratory, with CDS from play sessions and ADS between a
parent and an experimenter, which may not be representative of ecologically collected CDS
and ADS. Fully naturalistic ADS and CDS collected with a nonintrusive recording device
as the child went about her day were analyzed with a diverse set of algorithms. Le
difference between registers was small compared to differences between algorithms; it
reduced when corpora were matched, and it even reversed under some conditions.
These results highlight the interest of studying learnability using naturalistic corpora
and diverse algorithmic deﬁnitions.

Citation: Cristia A., Dupoux, E., Ratner,
N. B., & Soderstrom, M.. (2019).
Segmentability Differences Between
Child-Directed and Adult-Directed
Speech: A Systematic Test With an
Ecologically Valid Corpus. Open Mind:
Discoveries in Cognitive Science, 3,
13–22. https://doi.org/10.1162/opmi_
a_00022

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
o
p
m

je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2
o
p
m
_
un
_
0
0
0
2
2
1
8
6
8
3
4
8
o
p
m
_
un
_
0
0
0
2
2
p
d

EST CE QUE JE:
https://doi.org/10.1162/opmi_a_00022

INTRODUCTION

Supplemental Materials:
https://osf.io/th75g/

Reçu: 15 May 2018
Accepté: 11 Décembre 2018

Intérêts concurrents: None of the
authors declare any competing
interests.

Auteur correspondant:
Alejandrina Cristia
alecristia@gmail.com

droits d'auteur: © 2019
Massachusetts Institute of Technology
Publié sous Creative Commons
Attribution 4.0 International
(CC PAR 4.0) Licence

La presse du MIT

Although children are exposed to both child-directed speech (CDS) and adult-directed speech
(ADS), children appear to extract more information from the former than the latter (par exemple., Cristia,
2013; Shneidman & Goldin-Meadow, 2012). This has led some to propose that most or all lin-
guistic phenomena are more easily learned from CDS than ADS (par exemple., Fernald, 2000), with a
ﬂurry of empirical literature examining speciﬁc phenomena (see Guevara-Rukoz et al., 2018,
for a recent review). Deciding whether the learnability of linguistic units is higher in CDS than
ADS is difﬁcult for at least two reasons: It is difﬁcult to ﬁnd appropriate CDS and ADS corpora;
and one must have an idea of how children learn to check whether such a strategy is more
successful in one register than the other. In this article, we studied a highly ecological corpus
of CDS and child-overheard ADS with a variety of word segmentation strategies.

What is word segmentation? Since there are typically no silences between words in
running speech, infants may need to carve out, or segment, word forms from the continuous
stream. Several differences between CDS and ADS could affect word segmentation learnability.
Caregivers may speak in a more variable pitch, leading both to increased arousal in the child
(which should boost attention and overall performance; Thiessen, Hill, & Saffran, 2005) mais

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Segmentability of Child- and Adult-Directed Speech Cristia et al.

also increased acoustic variability (which makes word identiﬁcation harder; Guevara-Rukoz et al.,
2018). To study word segmentation controlling for other differences (par exemple., attention capture,
ﬁne-grained acoustics), we use computational models of word segmentation from phonolo-
gized transcripts. Word segmentation may still be easier in CDS than ADS: CDS is characterized
by short utterances, including a high proportion of isolated words (par exemple., Bernstein Ratner &
Rooney, 2001, Soderstrom, 2007, pp. 508–509, and Swingley & Humphrey, 2018, for em-
pirical arguments that frequency in isolation matters). Short utterances represent an easier seg-
mentation problem than long ones, since utterance boundaries are also word boundaries, et
proportionally more boundaries are provided for free. Other features of CDS may be beneﬁcial
or not depending on the segmentation strategy. Par exemple, CDS tends to have more partial
repetitions than ADS (“Where’s the dog? There’s the dog!»), which may be more helpful to
lexical algorithms (which discover recombinable units) than sublexical algorithms (that look
for local breaks, such as illegal within-word phonotactics or dips in transition probability).

Previous modeling research documents much higher segmentation scores for CDS than
ADS corpora (Batchelder, 1997, 2002; Daland & Pierrehumbert, 2011; Fourtassi, Borschinger,
Johnson, & Dupoux, 2013). Most of this work compared CDS recorded in the home or in
the lab (in the CHILDES database; MacWhinney, 2014), against lab-based corpora of adult–
adult interviews including open-ended questions ranging from profession to politics (par exemple., le
Buckeye corpus; Pitt, Johnson, Hume, Kiesling, & Raymond, 2005). Par conséquent, differences in
segmentability could be due to confounded variables: Home recordings capture more informal
speech than interviews do, with shorter utterances and reduced lexical diversity; moreover,
since different researchers transcribed the CDS and ADS corpora, their criteria for utterance
boundaries may not be the same.

Only two studies used matched corpora, which had been collected in the laboratory
as mothers talked to their children and an experimenter. Batchelder (2002) applied a lexical
algorithm onto the American English Bernstein Ratner corpus (Bernstein Ratner, 1984), et
found a 15% advantage for CDS over ADS. Ludusan, Mazuka, Bernard, Cristia, and Dupoux
(2017) applied two lexical and two sublexical algorithms to the Japanese-spoken Riken cor-
pus (Mazuka, Igarashi, & Nishikawa, 2006), where the CDS advantage was between 2% et
10%. Toujours, it is unclear whether either corpus is representative of the CDS and ADS children
hear every day. Being observed might affect parents’ CDS patterns, and thus segmentability.
De plus, ADS portions were elicited by unfamiliar experimenters, with whom mothers may
have been more formal than in children’s typical overheard ADS. Experimenter-directed ADS
can differ signiﬁcantly from ADS addressed to family members even in laboratory settings, à
the point that phonetic differences across registers are much reduced when using family-based
(rather than experimenter-based) ADS as a benchmark (E. K. Johnson, Lahey, Ernestus, & Cutler,
2013). Since prior work used laboratory-recorded samples, it is possible that it has over- ou
misestimated differences in segmentability between CDS and ADS.

Donc, we studied an ecological child-centered corpus containing both ADS and
CDS. We followed Ludusan and colleagues (2017) by using both lexical and sublexical algo-
rithms; in addition, we varied important parameters within these classes and further added two
baselines. In all, we aimed to provide a more accurate and generalizable estimate of the size
of segmentability differences in CDS versus ADS.

MÉTHODES

This article is reproducible thanks to the use of R, papaja, and knitr (Aust & Barth, 2015;
R Core Team, 2015; Xie, 2015). Raw data, supplementary explanations on the methods, et
supplementary analyses are also available (Cristia, 2018un).

OPEN MIND: Discoveries in Cognitive Science

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
o
p
m

je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2
o
p
m
_
un
_
0
0
0
2
2
1
8
6
8
3
4
8
o
p
m
_
un
_
0
0
0
2
2
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Segmentability of Child- and Adult-Directed Speech Cristia et al.

Tableau 1. Characteristics of the ADS and CDS portions of the corpus, depending on whether the human or automatic utterance boundaries
were considered.

Human

Sylls

10,051
24,933

Tokens

8,224
20,786

Types

1,342
2,015

MTTR

Utts

Sylls

0.93
0.89

1,772
5,320

10,100
24,933

ADS
CDS

Automatic

Types

1,342
2,012

Tokens

8,267
20,777

MTTR

Utts

0.93
0.89

1,892
5,630

Note. Tokens differ for the Human versus Automatic because utterances where human coders (mistakenly) changed register within a
continuation were dropped from the Human analyses. ADS = adult-directed speech; CDS = child-directed speech; Sylls = syllables;
tokens and types refer to words; MTTR = Moving average Type to Token Ratio (over a sliding 10-word window); Utts = utterances.

Corpus

The dataset consists of 104 recordings transcribed from the Winnipeg Corpus (Soderstrom,
Grauer, Dufault, & McDivitt, 2018; Soderstrom & Wittebolle, 2013; some of the recordings
are archived on homebank.talkbank.org—VanDam et al., 2016), gathered from 35 enfants
(19 boys), aged between 13 et 38 mois, recorded using the LENA system1 at home (14 chil-
les enfants), at home daycare (6), or at daycare center (13), with one more child recorded both at
home and home daycare. Soderstrom et al. (2018) report that, entre 9 a.m. et 5 p.m., là
were 1–4 adults in home recordings (median of 5-min units 1), 1–3 in home daycares (me-
dian 1), and 1–5+ in daycare centers (median 2). Although the caregivers’ sex was not sys-
tematically noted, a majority was female in all settings.

The ﬁrst 15 min, one hr into the recording (min 60–75), were independently transcribed
by two undergraduate assistants, who resolved any disagreements by discussion. Transcription
was done at the lexical level adapting the CHILDES minCHAT guidelines for transcription
(MacWhinney, 2009),2 without reproducing details of pronunciation (see Discussion). Le
transcribers also coded whether an utterance was directed to the target child, another child,
an adult, ou autre, using content and context. Utterances directed to the target child constituted
the CDS corpus; those directed to an adult constituted the ADS corpus.

Although LENA’s utterance boundaries were mostly accurate, coders sometimes split a
single LENA segment into two utterances. Since LENA may miss boundaries, we always divided
segments following human coding. En plus, coders sometimes considered a sequence of
segments as continuations of each other (6% of CDS utterances and 7% of ADS utterances).

We derived several versions of the ADS and CDS subcorpora crossing two factors (voir
Tableau 1 for characteristics). D'abord, we used the automatic utterance boundaries provided by the
LENA software (“A,” short for “automatic boundaries”), as well as combined together the text
from segments labeled as continuations of each other by coders (“H” for “human boundaries”).
Deuxième, since performance is dependent on corpus size (see Bernard et al., 2018), we had
three versions of each CDS corpus: the full one, a shortened CDS corpus to match the ADS
corpus in number of words, and a shortened CDS corpus to match the ADS corpus in number of
utterances. After crossing these two factors, performance could be compared between, on the
one hand, ADS-A/H (ADS with automatic or human utterance boundaries), et, on the other
main, one of (1) CDS-A/H-full (corresponding full CDS corpus), (2) CDS-A/H-WM (cut at the

1 The LENA Foundation built a hardware and software system to record and automatically analyze day-long

child-centered recordings. For more information, see Soderstrom and Wittebolle (2013).

2 The transcription manual is available from https://osf.io/rvdbq/.

OPEN MIND: Discoveries in Cognitive Science

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
o
p
m

je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2
o
p
m
_
un
_
0
0
0
2
2
1
8
6
8
3
4
8
o
p
m
_
un
_
0
0
0
2
2
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Segmentability of Child- and Adult-Directed Speech Cristia et al.

same number of word tokens found in the corresponding ADS), ou (3) CDS-A/H-UM (cut at the
same number of utterances). As shown in the results, these different boundaries and matching
conditions only clarify our main conclusions that CDS-ADS differences are very small.

Processing and Evaluation

Scripts used for corpus preprocessing, phonologization, and segmentation are available
(Cristia, 2018b). During preprocessing, all extraneous codes (such as marks for overlapping
speech or lexical reference for unusual pronunciations) were removed, leaving only the or-
thographic representation of the adults’ speech. These were phonologized using the American
English voice of Festival Text-to-Speech (Taylor, Noir, & Caley, 1998), which provides pho-
nemically based transcriptions, including syllable boundaries. These transcriptions emerge
mostly from dictionary lookup, but the system can also perform grapheme–phoneme con-
versions for neologisms, which are frequent in child-directed speech. Spaces between words
are removed from the resulting corpus to provide input to the algorithms. Each algorithm then
returns the corpus with spaces where word boundaries are hypothesized.

Each algorithm (with default parameters, except as noted below) was run using the
WordSeg package (Bernard et al., 2018), which also performs the evaluation. Due to space
restrictions, we cannot provide fuller descriptions here, but we refer readers to Bernard et al.
(2018), where the algorithms and the evaluation are explained. In a nutshell, both training
and evaluation are done over the whole corpus because these algorithms are unsupervised,
and thus there is no risk of overﬁtting. In the case of incremental algorithms, performance was
calculated on an output corpus that represented the algorithm’s segmentation level in the last
20% of the data.

We provide pseudo-conﬁdence intervals estimated as two standard deviations over
10 runs of resampling with replacement. C'est, we created 10 versions of each corpus by
resampling children’s transcripts to achieve approximately the same number of utterances as
in the original. Par exemple, in one of the runs, the ADS corpus may be composed of the data
from child 2’s day 1, 24’s day 3, 5’s day 1, et ainsi de suite. We then extracted the standard deviation
in performance across resamples for each algorithm and corpus version.

For comparability with previous work, we focus on lexical token F-scores, derived by
comparing the gold-standard version of the input against the parsed version returned by the
algorithme. Precision measures what proportion of the word tokens posited by a given algorithm
correspond to tokens found in the gold segmentation, while recall measures what proportion
of the gold word tokens were correctly segmented by the algorithm. Par exemple, for the gold
phrase “here we go,” if an algorithm returns “here wego,” precision is .5 (one out of two
posited tokens is correct) and recall is .3 (one out of three gold words is correct). The overall
F-score ranges from 0 à 1, as it is the harmonic mean of precision P and recall R, namely,
2 × (P × R/(P. + R.)), which is multiplied by 100 and reported as percentages here. Results for
all other possible alternative metrics, and further discussion on these methods, are provided
in the Supplemental Materials (Cristia, 2018un).

Segmentation Algorithms

There were two variants for each of two popular sublexical algorithms. The ﬁrst one, DiBS
(short for Diphone-Based Segmentation; Daland & Pierrehumbert, 2011), posits word bound-
aries where phonotactic probabilities are low. The “gold” version (phonotactic-gold) sets the
diphone probability threshold based on gold word boundaries. The unsupervised version
(phonotactic-unsupervised) sets the threshold using utterance boundaries only. The phonotactics

OPEN MIND: Discoveries in Cognitive Science

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
o
p
m

je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2
o
p
m
_
un
_
0
0
0
2
2
1
8
6
8
3
4
8
o
p
m
_
un
_
0
0
0
2
2
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Segmentability of Child- and Adult-Directed Speech Cristia et al.

were computed on the concatenation of CDS and ADS versions of the corpus. The second
algorithme, labeled TP, posits boundaries using transition probabilities between syllables, comme
proposed in Saffran, Aslin, and Newport
(1996). The ﬁrst version uses a relative dip in prob-
abilities (henceforth TP-relative). C'est, given the syllable sequence WXYZ, a boundary is
posited between X and Y if the transition probability between the X-Y is lower than between
W-X and Y-Z. The second version uses average transitions over all pairs of syllables in the
corpus as the threshold (TP-average; Saksida, Langus, & Nespor, 2017).

the three lexical algorithms,

two are variants on the Adaptor Grammar by
M.. Johnson and Goldwater (2009). In this system, there is a set of generic rules, such as “a
word is a sequence of phonemes, an utterance is a sequence of words,” and the algorithm
further learns, based on the corpus, particular instances of these rules (“d + o + g is a word”) comme
well as all of the rules’ probabilities. One variant relied on the simple rules just deﬁned (lexical-
unigram). The other variant, which we call lexical-multigram, is based on a more complicated
rule set with hierarchically deﬁned levels that are both smaller and larger than words (details
in the Supplemental Materials, Cristia, 2018un; M.. Johnson, Christophe, Dupoux, & Demuth,
2014). The third lexical algorithm, lexical-incremental, implements a very different approach
(Monaghan & Christiansen, 2010). It processes the corpus one utterance at a time. For each, it
checks whether the utterance contains a subsequence that is in its long-term lexicon; if so, it
checks whether extracting that subsequence would result in phonotactically legal remainders
(with phonotactics derived from the lexicon). Otherwise, the whole utterance is stored in its
lexicon.

To these seven algorithms we add two baselines, introduced to provide segmentation
results for relatively uninformed strategies. One posits word boundaries at utterance edges
(henceforth base-utt). The other posits word boundaries at syllable edges (henceforth base-
syll). The latter is likely to be effective for English CDS, which has a very high proportion of
monosyllabic words (par exemple., Swingley, 2005).

RÉSULTATS

Chiffre 1 illustrates token F-scores in CDS as a function of that in ADS, when using the full
corpora and the human-based utterance boundaries (for ﬁgures on all other conditions and
dependent variables, please see Supplemental Materials; Cristia, 2018un). If CDS input is eas-
ier to segment, then points should be above the 45-degree, equal-performance dotted line. Ce
is the case for most points. Cependant, the median difference across registers (CDS minus ADS,
in each algorithm separately) était 3%, ranging from −2% to 8%. De plus, for most points,
the pseudo-conﬁdence intervals (deﬁned as two times the standard deviation over 10 samples)
cross the equal performance line, meaning that only for lexical-incremental, TP-relative, et
base-utt are differences above this measure of sampling error. Enfin, Chiffre 1 conveys register
differences in the larger context: The greatest source of variation in performance clearly is due
to the different algorithms, with token F-scores for the full CDS corpus ranging from 10% à
75%. Ce 65% difference is much greater than the CDS–ADS differences (maximally 8%).

How stable are these differences as a function of utterance-boundary and size-matching
decisions? We looked at performance in various conditions, varying whether utterance bound-
aries were purely automatic (which is less likely to reﬂect human-coder bias than human-
utterance boundary placement) and whether CDS and ADS were matched in length (depuis
several algorithms’ performance is affected by corpus size). Positive difference scores, dans-
dicative of better CDS than ADS performance, were found in most matching conditions, concernant-
gardless of whether automatic or human-utterance boundaries were used (Tableau 2). Cependant,

OPEN MIND: Discoveries in Cognitive Science

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
o
p
m

je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2
o
p
m
_
un
_
0
0
0
2
2
1
8
6
8
3
4
8
o
p
m
_
un
_
0
0
0
2
2
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Segmentability of Child- and Adult-Directed Speech Cristia et al.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
o
p
m

je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

Chiffre 1. Token F-score (in percentage) achieved by each algorithm in child-directed speech
(CDS) as a function of that in adult-directed speech (ADS) in the full Winnipeg corpus with human-
set utterance boundaries. Error bars indicate two standard deviations (over 10 resamples; see main
text and Supplemental Materials, Cristia, 2018un, for details).

1
0
1
1
6
2
o
p
m
_
un
_
0
0
0
2
2
1
8
6
8
3
4
8
o
p
m
_
un
_
0
0
0
2
2
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tableau 2. CDS F-score minus ADS F-score (in percentages) by algorithm, type of match, and whether
human (H) or automatic (UN) utterance boundaries were considered.

Algo

H: full H: UM H: WM A: full

UN: UM A: WM Median

base-utt
2.9
1.3
base-syll
phonotactic-unsupervised −1.6
3
phonotactic-gold
5
TP-relative
−1.6
TP-average
7.9
lexical-incremental
3.8
lexical-unigram
3.4
lexical-multigram
3
Median

1.4
−0.2
−2.5
2.6
0.7
−2.9
−0.6
2.9
1.9
0.7

1.5
0
−2.5
2.8
0.9
−2.8
1.6
3.6
0.2
0.9

3.3
1.2
−1.3
3.1
5
−1.5
7.1
2.8
3.1
3.1

1.8
−0.2
−2.3
2.8
0.4
−2.9
1.2
2.3
−0.8
0.4

1.9
−0.3
−2.4
2.8
0.7
−3.1
2.3
2.4
0.4
0.7

1.85
−0.1
−2.35
2.8
0.8
−2.85
1.95
2.85
1.15
1.15

Note. Full means the full child-directed speech (CDS) corpus was used; UM = utterance match:
CDS corpus shortened to have as many utterances as the adult-directed speech (ADS) corpus;
WM = word match: idem for words.

OPEN MIND: Discoveries in Cognitive Science

Segmentability of Child- and Adult-Directed Speech Cristia et al.

phonotactic-unsupervised and TP-average showed a consistent CDS disadvantage in all bound-
ary and matching conditions. De plus, the difference between CDS and ADS was reduced
when considering automatic rather than human-utterance boundaries; and length-matched
CDS rather than the full CDS.

In short, we observe smaller CDS advantages than those found in previous work. À
check whether this was due to algorithms or corpora, we applied our extensive suite of algo-
rithms onto the Bernstein Ratner corpus (analyzed by Batchelder, 2002). The results showed
a more consistent and larger CDS advantage than in the Winnipeg corpus (median of 6%,
range −2–17%; see details in the Supplemental Materials; Cristia, 2018un).

DISCUSSION

Previous computational work using laboratory-based CDS and ADS corpora have documented
an impressive CDS advantage in segmentability (15% in Batchelder, 2002—although reduced
à 6% when more varied algorithms are considered; 10% in Ludusan et al., 2017). Cependant,
when applying these diverse segmentation algorithms to an ecological CDS–ADS corpus, le
evidence of increased segmentability for CDS than ADS was less compelling. The CDS advan-
tage was numerically small (median of 3%), and often within the margin of error estimated
via resampling (1–6%). These conclusions were based on the full CDS and ADS corpora, avec
human-coded utterance boundaries, where the CDS performance was based on twice the in-
put and potentially biased utterance-boundary decisions. The CDS advantage was even smaller
when considering length-matched corpora with automatic utterance boundaries (medians of
0.4–0.7%).

A key strength of the present work lies in the use of a unique corpus, in which both CDS
and ADS were collected from the children’s everyday input. It is unlikely that the difference
in conclusions drawn by previous authors and those we draw is due to corpus size or child
âge (see Table 3; note also that we and previous authors all considered corpus size differences
in some analyses). Plutôt, the most salient difference is the setting of the recording, which in
our case is at home or in the daycare, and the fact that our ADS arises naturally in this context,
rather than in an interview-like situation with an experimenter. By sampling from the home and
two types of daycare environments, the CDS is likely to represent a wide range of interactions
between children and a variety of caregivers, and the ADS captures speech between colleagues
(par exemple., professional carers in the daycares), partners (par exemple., mother and father in the home), et
other adult relationships (par exemple., visitor, delivery person, interlocutor over the phone). Note that
our ADS is only representative of the ADS present in infants’ input rather than all ADS styles
(from presidential speech to intimate bedside conversations). Another interesting feature of

Tableau 3. Characteristics of ADS and CDS studied in past and present work

Corpus

Addressee(s)

Bernstein Ratner

Riken

Winnipeg

Experimenter
Children 9–27 months
Experimenter
Children 18–24 months
Adults
Children 13–38 months

Tokens

19,753
30,996
22,844
51,315
8,224
20,786

Types

1,797
1,501
2,022
2,850
1,342
2,015

Utterances

2,668
8,252
3,582
14,570
1,772
5,320

Note. ADS = adult-directed speech; CDS = child-directed speech; MTTR for the Bernstein Ratner
ADS was .93; CDS .88.

OPEN MIND: Discoveries in Cognitive Science

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
o
p
m

je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2
o
p
m
_
un
_
0
0
0
2
2
1
8
6
8
3
4
8
o
p
m
_
un
_
0
0
0
2
2
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Segmentability of Child- and Adult-Directed Speech Cristia et al.

the Winnipeg corpus is that its automatic annotation contains utterance boundaries deﬁned
bottom-up (using talker switch or lengthy pauses). These features lead us to argue that our
results represent the input naturally available to English-learning Canadian children well, et,
in this input, CDS and ADS do not differ greatly in word segmentation learnability. Our results
are compatible with the hypothesis proposed by Benders (2013), entre autres, whereby CDS
is shaped less by the caregivers’ attempt to speciﬁcally promote language acquisition than
other potential functions (such as communicating affect).

Another strength of this work is that we employed multiple word segmentation algo-
rithms. This is important not only because results change even as minor parameters are set but
also because there is no clear evidence as to which algorithm infants use. Children may
even take advantage of diverse procedures depending on context and previous experience, pour
example, using transition probabilities when nothing else is available (Saffran et al., 1996) et
utilizing their budding lexicon when probabilities are less clear (Mersad & Nazzi, 2012). Dans-
creasing the diversity of algorithms allows us to revise Ludusan and colleagues’ (2017) conclu-
sion that there may be greater CDS advantages when using local cues (which perform overall
worse, at about 30% Token F-score in the Riken corpus) rather than lexical algorithms (avec
performance at about 50%). In contrast, we ﬁnd that sublexical algorithms can lead to poor
or good performances depending on their parametrization (compare phonotactic-gold versus
phonotactic-unsupervised; TP-average versus TP-relative; base-utt versus base-syll). Plus loin,
we do not see larger CDS advantages for better performing or lexical algorithms compared to
worse performing or sublexical algorithms. En fait, we see divergences even within two versions
of the same algorithm, avec, Par exemple, phonotactic-gold and TP-relative leading to a CDS
advantage, whereas phonotactic-unsupervised and TP-average lead to a CDS disadvantage.

We see two promising paths that future computational work should take. D'abord, même
though our algorithms covered a wide range of hypotheses regarding early word segmentation,
they may differ in critical ways from the algorithms and input used by infants. Par exemple,
words here were systematically attributed a pronunciation from a dictionary, and thus did not
capture the possible application of phonological rules and other sources of variation that cause
a single underlying word to have many different surface forms (see Buckler, Goy, & Johnson,
2018, for phonetic variability in CDS versus ADS differently; and Elsner, Goldwater, Feldman,
& Wood, 2013, for a possible incorporation of phonetic variability in segmentation algo-
rithms). Such variability will most greatly affect the discovery of paradigms (c'est à dire., ﬁguring out
that “what is that” can also be pronounced “whaz that”), and not necessarily segmentation of
word forms. Donc, it would be most interesting to study it in the context of morphologi-
cal discovery rather than only segmentation. Finalement, we may want to test algorithms that
operate directly from the acoustic representation (Ludusan et al., 2014; Versteegh et al., 2015).

Deuxième, we studied only North American English. We look forward to extending the
current approach to ecologically valid databases in additional typologically diverse languages,
although none containing both CDS and ADS is currently available, and therefore a priority in
future research should be to build larger, matched, multilingual corpora. We predict segmenta-
tion scores are lower in languages where words and syllable boundaries are less well-aligned
than in English (Loukatou, Stoll, Blasi, & Cristia, 2018), but regardless of overall performance
levels, there will be no or little learnability advantages for CDS versus ADS for segmentation:
North American English has been described as having more marked CDS–ADS differences than
other languages (par exemple., Japonais; Fernald et al., 1989). Donc, one might expect the greatest
learnability advantages to be found in North American English—suggesting that cross-linguistic
work is even less likely to ﬁnd results supporting a segmentation advantage for CDS.

OPEN MIND: Discoveries in Cognitive Science

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
o
p
m

je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2
o
p
m
_
un
_
0
0
0
2
2
1
8
6
8
3
4
8
o
p
m
_
un
_
0
0
0
2
2
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Segmentability of Child- and Adult-Directed Speech Cristia et al.

To conclude, we found that advantages in segmentability for CDS over ADS in an ecolog-
ical corpus were smaller and more inconsistent than previous estimations based on laboratory
CDS–ADS. Dans l'ensemble, our word segmentation results align with other work on sound discrim-
inability (Martin et al., 2015) and word discriminability (Guevara-Rukoz et al., 2018), sug-
gesting that the high learnability attributed to CDS may have been overestimated. Research
assessing the learnability properties of child-directed speech at other levels (par exemple., syntax) would
beneﬁt from using similarly natural corpora, as well as a variety of algorithmic approaches.

REMERCIEMENTS

We are grateful to Mark Johnson, Robert Daland, and Amanda Saksida for helpful discussions
and comments on previous versions of this manuscript; and to members of the LAAC, CoML,
and Language teams at the LSCP for helpful discussion.

INFORMATIONS SUR LE FINANCEMENT

AC acknowledges ﬁnancial support from Agence Nationale de la Recherche (ANR-14-CE30-
0003 MechELex); ED from European Research Council (ERC-2011-AdG-295810 BOOTPHON),
the Fondation de France, the Ecole de Neurosciences de Paris, the Region Ile de France (DIM
cerveau et pensée); MS from SSHRC (Insight Development Grant 430-2011-0459, and Insight
Grant 435-2015-0628). AC and ED acknowledge the institutional support of Agence Nationale
de la Recherche (ANR-17-EURE-0017).

CONTRIBUTIONS DES AUTEURS

AC: Conceptualisation: Lead; Conservation des données: Lead; Analyse formelle: Lead; Acquisition de financement:
Lead; Méthodologie: Lead; Gestion de projet: Lead; Ressources: Lead; Logiciel: Lead; Val-
idation: Lead; Visualisation: Lead; Rédaction – ébauche originale: Lead; Rédaction – révision & édition:
Lead. ED: Conceptualisation: Supporting; Analyse formelle: Supporting; Méthodologie: Sup-
porting; Logiciel: Supporting; Visualisation: Supporting; Rédaction – révision & édition: Support-
ing. NBR: Conceptualisation: Supporting; Ressources: Supporting; Rédaction – révision & édition:
Supporting. MS: Conceptualisation: Supporting; Méthodologie: Supporting; Ressources: Lead;
Validation: Supporting; Visualisation: Supporting; Rédaction – révision & édition: Supporting.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
o
p
m

je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2
o
p
m
_
un
_
0
0
0
2
2
1
8
6
8
3
4
8
o
p
m
_
un
_
0
0
0
2
2
p
d

RÉFÉRENCES

Aust, F., & Barth, M.. (2015). Papaja: Create APA manuscripts with
RMarkdown. Retrieved from https://github.com/crsh/papaja
Batchelder, E. Ô. (1997). Computational evidence for the use of
frequency information in discovery of the infant’s ﬁrst lexicon
(Unpublished doctoral dissertation). New York: The City University
of New York.

Batchelder, E. Ô. (2002). Bootstrapping the lexicon: A computa-
tional model of infant speech segmentation. Cognition, 83,
167–206.

Benders, T. (2013). Mommy is only happy! Dutch mothers’ re-
alisation of speech sounds in infant-directed speech expresses
emotion, not didactic intent. Infant Behavior and Development,
36, 847–862.

Bernard, M., Thiolliere, R., Saksida, UN., Loukatou, G., Larsen, E.,
Johnson, M., . . . Cristia, UN. (2018). WordSeg: Standardizing
unsupervised word form segmentation from text. Preprint.
Retrieved from https://osf.io/5qkm3/

Bernstein Ratner, N.

(1984). Patterns of vowel modiﬁcation in

mother–child speech. Journal of Child Language, 11, 557–578.

Bernstein Ratner, N., & Rooney, B.

(2001). How accessible is
the lexicon in Motherese? Language Acquisition and Language
Disorders, 23, 71–78.

Buckler, H., Goy, H., & Johnson, E. K. (2018). What infant-directed
speech tells us about the development of compensation for
assimilation. Journal of Phonetics, 66, 45–62.

Cristia, UN. (2013). Input to language: The phonetics and perception
of infant-directed speech. Language and Linguistics Compass, 7(3),
157–170.

Cristia, UN. (2018un, Avril 18). Segmentability differences between
child-directed and adult-directed speech: A systematic test with
an ecologically valid corpus. Open Mind: Discoveries in Cogni-
tive Science, 3, 13–22. Retrieved from https://osf.io/th75g/

Cristia, UN. (2018b). Segmentation recipes for CDS versus ADS in the
Winnipeg corpus. Computer code. Retrieved from https://github.
com/alecristia/seg_cds_ads_winnipeg

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

OPEN MIND: Discoveries in Cognitive Science

Segmentability of Child- and Adult-Directed Speech Cristia et al.

Daland, R., & Pierrehumbert, J.. B. (2011). Learning diphone-based

segmentation. Sciences cognitives, 35, 119–155.

Elsner, M., Goldwater, S., Feldman, N., & Wood, F. (2013). A joint
learning model of word segmentation, lexical acquisition, et
phonetic variability. In Proceedings of Empirical Methods in
Natural Language Processing (pp. 42–54). Seattle, WA.

Fernald, UN. (2000). Speech to infants as hyperspeech: Knowledge-driven

processes in early word recognition. Phonetica, 57, 242–254.

Fernald, UN., Taeschner, T., Dunn, J., Papousek, M., de Boysson-
Bardies, B., & Fukui, je. (1989). A cross-language study of pro-
sodic modiﬁcations in mothers’ and fathers’ speech to preverbal
infants. Journal of Child Language, 16, 477–501.

Fourtassi, UN., Borschinger, B., Johnson, M., & Dupoux, E.

(2013).
Whyisenglishsoeasytosegment. In Proceedings of the Fourth An-
nual Workshop on Cognitive Modeling and Computational Lin-
guistics (pp. 1-dix). Soﬁa, Bulgaria.

Guevara-Rukoz, UN., Cristia, UN., Ludusan, B., Thiolliére, R., Martine,
UN., Mazuka, R., & Dupoux, E. (2018). Are words easier to learn
from infant- than adult-directed speech? A quantitative corpus-
based investigation. Sciences cognitives, 42, 1586–1617.

Johnson, E. K., Lahey, M., Ernestus, M., & Cutler, UN. (2013). A multi-
modal corpus of speech to infant and adult listeners. The Journal
of the Acoustical Society of America, 134, EL534–EL540.

Johnson, M., Christophe, UN., Dupoux, E., & Demuth, K. (2014,
Juin). Modelling function words improves unsupervised word
the Annual Conference of
segmentation.
the Association for Computational Linguistics (pp. 282–292).
Baltimore, MARYLAND.

In Proceedings of

Johnson, M., & Goldwater, S. (2009). Improving nonparameteric
Bayesian inference: Experiments on unsupervised word segmen-
tation with adaptor grammars. In Proceedings of the Annual
the Association for Computational Linguistics
Conference of
(pp. 317–325). Suntec, Singapore.

Loukatou, G., Stoll, S., Blasi, D., & Cristia, UN. (2018, May). Modeling
infant segmentation of two morphologically diverse languages.
In V. Claveau & P.. Sébillot (Éd.), Actes de la conférence Traite-
ment Automatique de la Langue Naturelle, TALN 2018. Volume 1:
Articles longs, articles courts de TALN (pp. 47–57). Rennes,
France.

Ludusan, B., Mazuka, R., Bernard, M., Cristia, UN., & Dupoux, E.
(2017). The role of prosody and speech register in word segmen-
tation: A computational modelling perspective. In Proceedings of
the Annual Conference of the Association for Computational Lin-
guistics (Volume 2: Short papers) (pp. 178–183). Vancouver, Canada.
Ludusan, B., Versteegh, M., Jansen, UN., Gravier, G., Cao, X.-N., Johnson,
M., & Dupoux, E. (2014). Bridging the gap between speech tech-
nology and natural language processing: An evaluation toolbox
for term discovery systems. In Proceedings of Language Resources
and Evaluation Conference (pp. 560–576). Reykjavik, Iceland.

MacWhinney, B.

(2009). The CHILDES project part 1: The CHAT

transcription format. New York: Psychology Press.

MacWhinney, B. (2014). The CHILDES project part II: The database.

New York: Psychology Press.

Martine, UN., Schatz, T., Versteegh, M., Miyazawa, K., Mazuka, R.,
Dupoux, E., & Cristia, UN. (2015). Mothers speak less clearly to
infants than to adults: A comprehensive test of the hyperarticula-
tion hypothesis. Sciences psychologiques, 26, 341–347.

Mazuka, R., Igarashi, Y., & Nishikawa, K. (2006). Input for learning
Japonais: RIKEN Japanese Mother-Infant Conversation Corpus.
Technical Report of IEICE, Tl2006-16, 106(165), 11–15.

Mersad, K., & Nazzi, T.

(2012). When Mommy comes to the
rescue of statistics: Infants combine top-down and bottom-up
cues to segment speech. Language Learning and Development,
8, 303–315.

Monaghan, P., & Christiansen, M.. H. (2010). Words in puddles of
sound: Modelling psycholinguistic effects in speech segmenta-
tion. Journal of Child Language, 37, 545–564.

Pitt, M.. UN., Johnson, K., Hume, E., Kiesling, S., & Raymond, W.
(2005). The Buckeye corpus of conversational speech: Labeling
conventions and a test of transcriber reliability. Speech Commu-
nication, 45, 89–95.

R Core Team. (2015). R.: A language and environment for statistical
computing. Vienna, Austria: R Foundation for Statistical Comput-
ing. Extrait de http://www.R-project.org/

Saffran, J.. R., Aslin, R.. N., & Newport, E. L. (1996). Statistical learn-

ing by 8-month-old infants. Science, 274, 1926–1928.

Saksida, UN., Langus, UN., & Nespor, M.. (2017). Co-occurrence sta-
tistics as a language-dependent cue for speech segmentation.
Developmental Science, 20(3). est ce que je:10.1111/desc.12390

Shneidman, L. UN., & Goldin-Meadow, S.

(2012). Language input
and acquisition in a Mayan village: How important is directed
speech? Developmental Science, 15, 659–673.

Soderstrom, M.. (2007). Beyond babytalk: Re-evaluating the nature
and content of speech input to preverbal infants. Developmental
Review, 27, 501–532.

Soderstrom, M., Grauer, E., Dufault, B., & McDivitt, K. (2018). Inﬂu-
ences of number of adults and adult:child ratios on the quantity
of adult language input across childcare settings. First Language,
38, 563–581.

Soderstrom, M., & Wittebolle, K. (2013). When do caregivers talk?
The inﬂuences of activity and time of day on caregiver speech
and child vocalizations in two childcare environments. PloS One,
8(11), e80646.

Swingley, D.

(2005). Statistical clustering and the contents of the

infant vocabulary. Psychologie Cognitive, 50, 86–132.

Swingley, D., & Humphrey, C. (2018). Quantitative linguistic pre-
dictors of infants’ learning of speciﬁc English words. Child De-
velopment, 89, 1247–1267.

Taylor, P., Noir, UN. W., & Caley, R.. (1998, Novembre). The architec-
ture of the FESTIVAL speech synthesis system. In Proceedings of
the 3rd European Speech Communication Association Workshop
on Speech Synthesis (pp. 147–151). Jenolan Caves, Australia.

Thiessen, E. D., Hill, E., & Saffran, J.. R..

(2005).

Infant-directed

speech facilitates word segmentation. Infancy, 7, 53–71.

VanDam, M., Warlaumont, UN. S., Bergelson, E., Cristia, UN.,
Soderstrom, M., De Palma, P., & MacWhinney, B. (2016). Home-
Bank: An online repository of daylong child-centered audio
recordings. Seminars in Speech and Language, 37(2), 128.

Versteegh, M., Thiolliere,R., Schatz, T., Cao, X.-N., Anguera,
X., Jansen, UN., & Dupoux, E. (2015, Septembre). The Zero Re-
source Speech Challenge 2015. In Proceedings of Interspeech
(pp. 316–3173). Dresden, Allemagne.