REPORT - IA de Investigación especializada en el MIT

INFORME

SAYCam: A Large, Longitudinal Audiovisual
Dataset Recorded From the Infant’s Perspective

Jessica Sullivan1, Michelle Mei1, Andrew Perfors2, Erica Wojcik1, and Michael C. Frank3

un acceso abierto

diario

Palabras clave: headcam, ﬁrst-person video, child development

1Skidmore College

2Universidad de Melbourne

3Universidad Stanford

ABSTRACTO

We introduce a new resource: the SAYCam corpus. Infants aged 6–32 months wore a
head-mounted camera for approximately 2 hr per week, over the course of approximately
two-and-a-half years. The result is a large, naturalistic, longitudinal dataset of infant- y
child-perspective videos. Encima 200,000 words of naturalistic speech have already been
transcribed. Similarmente, the dataset is searchable using a number of criteria (p.ej., age of
partícipe, ubicación, configuración, objects present). The resulting dataset will be of broad use to
psychologists, linguists, and computer scientists.

INTRODUCCIÓN

From the roots of children’s language learning to their experiences with objects and faces, nat-
uralistic data about children’s home environment provides an important constraint on theories
of development (Fausey et al., 2016; MacWhinney, 2000; Oller et al., 2010). By analyzing the
information available to children, researchers can make arguments about the nature of chil-
dren’s innate endowment and their learning mechanisms (Marrón, 1973; Rozin, 2001). Más,
children’s learning environments are an important source of individual variation between chil-
niños (Fernald et al., 2012; Hart & Risley, 1992; Sperry et al., 2019); hence, characterizing these
environments is an important step in developing interventions to enhance or alter children’s
learning outcomes.

When datasets—for example, the transcripts stored in the Child Language Data Exchange
Sistema (MacWhinney, 2000)—are shared openly, they form a resource for the validation of
theories and the exploration of new questions (Sanchez et al., 2019). Más, while tabular and
transcript data were initially the most prevalent formats for data sharing, advances in storage
and computation have made it increasingly easy to share video data and associated metadata.
For developmentalists, one major advance is the use of Databrary, a system for sharing devel-
opmental video data that allows ﬁne-grained access control and storage of tabular metadata
(Gilmore & Adolph, 2017). The advent of this system allows for easy sharing of video related
to children’s visual experience. Such videos are an important resource for understanding chil-
dren’s perceptual, conceptual, linguistic, and social development. Because of their richness,
such videos can be reused across many studies, making the creation of open video datasets a
high-value enterprise.

Citación: sullivan, J., Mei, METRO.,
Perfors, A., Wojcik, MI., & Franco, METRO. C.
(2021). SAYCam: A Large, Longitudinal
Audiovisual Dataset Recorded From
the Infant’s Perspective. Mente abierta:
Descubrimientos en ciencia cognitiva,
5, 20–29. https://doi.org/10.1162
/opmi_a_00039

DOI:
https://doi.org/10.1162/opmi_a_00039

Materiales suplementarios:
https://nyu.databrary.org/volume/564

Recibió: 19 Julio 2020
Aceptado: 11 Febrero 2021

Conflicto de intereses: Los autores
declare no conﬂict of interest.

Autor correspondiente:
Jessica Sullivan
jsulliv1@skidmore.edu

Derechos de autor: © 2021
Instituto de Tecnología de Massachusetts
Publicado bajo Creative Commons
Atribución 4.0 Internacional
(CC POR 4.0) licencia

La prensa del MIT

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
oh
pag
metro

i
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2
oh
pag
metro
_
a
_
0
0
0
3
9
1
9
1
9
8
3
6
oh
pag
metro
_
a
_
0
0
0
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

SAYCam Dataset

Sullivan et al.

One especially promising method for characterizing children’s visual experience is the
head-mounted camera (Aslin, 2012; Franchak et al., 2011), a lightweight camera or eye-tracker
that is worn by the child, often mounted on a hat or harness system. A “headcam” allows access
to information from the perspective of the child–albeit with some differences in view angle,
resolution, and orienting latency (Pusiol et al., 2014; Smith et al., 2015). Headcam data have
been used in recent years to understand a variety of questions about children’s visual input,
including the prevalence of social signals (Fausey et al., 2016), how children’s bodies and
hands shape their attention (Bambach et al., 2016), how children interact with adults (Yu &
Herrero, 2013), which objects are in their visual ﬁeld (Cicchino et al., 2011), and how their
motor development shapes their visual input (Kretch et al., 2014; Sanchez et al., 2019). Estos
datasets also let researchers access some of the “source data” for children’s generalizations
about object identity or early word learning (Clerkin et al., 2017). Data from headcams are
also an important dataset for studies of unsupervised learning in computer vision (Bambach
et al., 2016). Yet there are relatively few headcam datasets available publicly for reuse, y
those that are almost exclusively report cross-sectional, in-lab data (Franchak et al., 2018;
Sanchez et al., 2018), or sample from one developmental time point (Bergelson, 2017).

The current project attempts to ﬁll this gap by describing a new, openly accessible dataset
de más de 415 hours of naturalistic, longitudinal recordings from three children. The SAY-
Cam corpus contains longitudinal videos of approximately two hr per week for three children
spanning from approximately six months to two and a half years of age. The data include un-
structured interactions in a variety of contexts, both indoor and outdoor, as well as a variety of
individuals and animals. The data also include structured annotations of context together with
full transcripts for a subsample (described below), and are accompanied by monthly parent
reports on vocabulary and developmental status. Juntos, these data present the densest look
into the visual experience of individual children currently available.

METHOD

Participantes

Three families participated in recording head-camera footage (ver tabla 1). Two families lived
in the United States during recording, while one family lived in Australia. All three families
spoke English exclusively. In all three families, the mother was a psychologist. All three children
had no siblings during the recording window. This research was approved by the institutional
review boards at Stanford University and Skidmore College. All individuals whose faces appear
in the dataset provided verbal assent, y, when possible, written consent. All videos have been
screened for ethical content.

Sam is the child of the family who lived in Australia. He wore the headcam from
6 months to 30 months of age. The family owned two cats and lived in a semirural neigh-
borhood approximately 20 miles from the capital city of Adelaide. Sam was diagnosed with

Mesa 1. Participant information

Participant

Sam (S)
Alice (A)

Asa (Y)

Location

Age at ﬁrst recording (meses) Age at last recording (meses)

Adelaide, Australia
San Diego, California, USA and
Saratoga Springs, Nueva York, EE.UU
Saratoga Springs, Nueva York, EE.UU

6 metro.
8 metro.

7 metro.

30 metro.
31 metro.

24 metro.

MENTE ABIERTA: Descubrimientos en ciencia cognitiva

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
oh
pag
metro

i
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2
oh
pag
metro
_
a
_
0
0
0
3
9
1
9
1
9
8
3
6
oh
pag
metro
_
a
_
0
0
0
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

SAYCam Dataset

Sullivan et al.

autism spectrum disorder at age 3; as of this writing (at age 7), Sam is fully mainstreamed, tiene
amigos, and does not require any special support.

Alice is the child of a family who lived in the United States. She wore the headcam from
8 months to 31 months of age. The family also owned two cats throughout the recording period.
For the ﬁrst half of the recordings, the family lived on a rural farm and kept other animals on or
near their property, including chickens; the family frequently visited a major city (San Diego).
For the second half of the recordings, the family lived in the suburbs in New York State.

Asa is the child of a family who lived in the United States. He wore the headcam from
7 months to 24 months of age. Data collection for this child terminated prematurely due to
the onset of the COVID-19 pandemic and the birth of a younger sibling. The family owns no
pets, and lives in the suburbs in New York State.

Materials

Parents completed the MacArthur-Bates Communicative Development Inventories (MCDI) en-
ventory each month (Fenson et al., 2007), as well as the Ages and Stages Questionnaire (escuderos
et al., 1997) during the time periods required by the instrument. These materials are included
in the corpus, and can be used to identify the age at which each participant acquired particular
palabras, particular motor capacities, and particular social and cognitive capacities. These data
may be useful for characterizing the children (p.ej., for their representativeness), but also for
identifying particular videos (described below) recorded immediately before or after particu-
lar developmental milestones (p.ej., for ﬁnding videos from before a child learned a particular
palabra, or became able to stand independently).

Videos were recorded using a Veho head-mounted camera on a custom mounting head-
banda (Cifra 1). Each Veho MUVI Pro micro DV camcorder camera was equipped with a
magnetic mount that allowed the attachment of a wide-angle ﬁsheye lens to broaden the MUVI
Pro’s native 47- by 36-degree viewing angle to 109 por 70 degrees. We selected the Veho MUVI
camera after testing several headcams in-lab, and determining it provided the biggest vertical
visual angle and had physical dimensions that allowed for comfortable mounting near the
center of the forehead (giving a perspective that would be most similar to the child’s; see Long
et al., 2020, for a comparison of headcams). According to the specs provided at purchase, el
camera captures video at a resolution of 480 pag, and at up to 30 frames per s, although in prac-
tice, the frame rate sometimes dipped to approximately 20 frames per s. Audio quality from
the cameras was highly variable, and in some cases, the camera’s default audio recordings are
low quality.

On occasions where a child persistently refused to wear the headcam, the camera was
used as a normal camera (es decir., placed nearby and carried to new locations if necessary).
A spreadsheet detailing which videos are third person accompanies our data on Databrary
(https://nyu.databrary.org/volume/564/slot/47832/edit?asset=273993).

PROCEDURE

Videos were recorded naturalistically, in and around the homes, carros, neighborhoods, y
workplaces where the child spent time. The intention was to record videos twice per week;
once at a ﬁxed time and once at a randomly chosen time. Given the practical constraints of
a multiyear project like this one, at times there were deviations from the planned procedure.
Por ejemplo, in the Alice dataset, there were some weeks where both recording times were

MENTE ABIERTA: Descubrimientos en ciencia cognitiva

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
oh
pag
metro

i
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2
oh
pag
metro
_
a
_
0
0
0
3
9
1
9
1
9
8
3
6
oh
pag
metro
_
a
_
0
0
0
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

SAYCam Dataset

Sullivan et al.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
oh
pag
metro

i
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2
oh
pag
metro
_
a
_
0
0
0
3
9
1
9
1
9
8
3
6
oh
pag
metro
_
a
_
0
0
0
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 1. Participant (7 months old) wearing Veho camera with ﬁsh eye lens.

determined pseudo-randomly because the ﬁxed recording time was not feasible. Similarmente, en
the Sam dataset, scheduling constraints occasionally meant that there was less variation in
the timing of recordings than one might otherwise expect. In the Asa dataset, the pseudo-
random time often occurred at a ﬁxed time due to feasibility issues (p.ej., attendance at day
care), and occasional weeks were skipped due to travel or other logistical barriers; además,
most recordings for the Asa dataset are sampled from between 7 soy. y 10 soy. Técnico
asuntos, such as memory card or camera failure, although infrequent, resulted in some skipped
recording sessions.

When recording, we obtained written consent or verbal assent (in accordance with in-
stitutional review board) for participants whose face and voice were featured in the videos.
Videos for which we were unable to secure assent/consent for people who were potentially
identiﬁable are omitted from the dataset.

Each recording session lasted until the battery on the Veho camera failed, or once
90 min had elapsed. In most cases, this resulted in recording sessions that lasted approxi-
mately 60 a 80 mín.. Each recording session typically resulted in multiple video ﬁles, con un
maximum single-ﬁle duration of 30 mín..

Coding

One goal of our coding efforts was to tag each video with its contents, so that researchers
could search for content relevant to their research interests. Para hacer esto, videos were skimmed at

MENTE ABIERTA: Descubrimientos en ciencia cognitiva

SAYCam Dataset

Sullivan et al.

an effective 10x-speed playback, and visually assessed for
locations and location
cambios. Coding occurred by playing each video in VLC (VideoLAN Client), and coding us-
ing custom spreadsheets with macros. Coders were highly trained and deeply familiar with
the dataset.

The following pieces of information were coded for each video. Primero, the coder brieﬂy
described the contents of the video (p.ej., “Picked grass to play with during walk”; “On ﬂoor
looking at mirror with mom”; “watching parents in kitchen”). The coder also recorded the lo-
catión (p.ej., hallway, bedroom, cocina, outdoors, auto, …), the activity or activities (p.ej., ser
held, walking, cooking, drinking, eating, listening to music, …), the individual(s) present (p.ej.,
niño, dad, mom, grandparent, animal, …), the body parts visible (p.ej., arms/hands, legs/feet,
head/face), and the most salient objects (p.ej., ﬂora, drink, tool, alimento, bed, libro, …). See Table 2
for a full list of coding classiﬁcations. Criteria for each coding decision are outlined in our Read
Me document on Databrary (https://nyu.databrary.org/volume/564/slot/47832/-?asset=254904).
The coder also noted any potential ethical concerns for further screening, and any additional
notas.

Every time there was a signiﬁcant change in location or in activity, we coded each com-
ponent of the video anew. Por ejemplo, if the child moved from the bedroom to the bathroom,
we coded all elements of the video (p.ej., objects, actividad, body parts, individuals) once for the
bedroom and once for the bathroom.

The intention for coding videos was to allow other researchers to identify videos of in-
terest, not for analysis of the codes themselves. While such an analysis may be both possible
and extremely interesting in the future, we encourage caution at this point for several reasons.
Primero, relative frequencies of activities will not be accurately computed until the entire dataset is
coded. Segundo, while our coding scheme remained stable throughout the majority of coding,
there are a few items (p.ej., the location “Laundry Room,” the item “cream,” the activity “pat-
ting”) that were added to our coding scheme after coding began. Their dates of addition are
described in the Read Me ﬁle. Finalmente, some objects, activities, or locations may appear brieﬂy
in particular videos, but not be tagged (p.ej., if the child is in the bedroom, throws a ball, y
brieﬂy looks into the hallway to see if the ball landed there, only “location: bedroom” may be
tagged). The decision for whether or not to tag a particular entity within the video was subjec-
tivo, and was guided by the intention of coding the videos in a way that would be helpful for
future researchers to identify videos that would be useful for their particular research projects.
Por ejemplo, in the current dataset, the tag “Object: Clothing” labels 114 videos (es decir., el
videos where clothing was a salient feature of the child’s experience), and researchers who are
interested in children’s interactions with clothing will ﬁnd at least one salient child–clothing
interaction in each of those videos. Sin embargo, the absence of the “Object: Clothing” tag for the
remaining videos certainly does not imply that everyone was naked.

Researchers interested in contributing to the ongoing coding effort should contact the

Autor correspondiente.

Transcription

One goal of this project was to transcribe all of the utterances in each video. Videos were
assigned to college students and transcribed. Interested readers can contribute to the ongoing
transcription effort by contacting the corresponding author; our original training materials are
provided here: https://nyu.databrary.org/volume/564/slot/47832/-?asset=254904. Transcribers
noted the timestamp for the onset of each utterance, the contents of the utterance itself, el

MENTE ABIERTA: Descubrimientos en ciencia cognitiva

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
oh
pag
metro

i
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2
oh
pag
metro
_
a
_
0
0
0
3
9
1
9
1
9
8
3
6
oh
pag
metro
_
a
_
0
0
0
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Mesa 2.

Locations, Activities, Living Things, and Objects that are searchable in our corpus

Activity

Living Things

Objects

Body Parts

SAYCam Dataset

Sullivan et al.

Locations

Bathroom

Bedroom

California Room

Car

Closet

Deck/Porch

Garage

Hallway

Kitchen

Laundry Room

Living Room

Off Property

Ofﬁce

Being Held

Cleaning

Cooking

Coversing

Crawling

Crying

Drawing

Drinking

Eating

Examining

Exploring

Gardening

Getting Changed/Dressed

Outside On Property

Getting Parental Attention

Piano Room

Imitating

Many Locations

Listening to Music

Stairway

Lying Down

Nursing

Overhearing Speech

Painting

Playing Music

Preparing for Outing

Reading

Running

Sitting

Standing

Taking a Walk

Tidying

Walking

Watching

Bird(s)

Appliance

Cat (with speciﬁc cat)

Dad

Dog

Fish

Grandma

Grandpa

Horse

Mom

Other Adult

Other Child

Participant

Bag

Bed

Book

Edificio

Car

Chair

Clothing

Computadora

Container

Cream

Crib

Diapers/Wipes/Potty

Doll/Stuffed Toy

Door

Drawing

Drawing/Writing Implements

Drink

Flora

Food

Improvised Toy

Laundry Machine

Linen/Cloths

Mirror

Musical Instrument

Outdoors

Phone

Picture

Play Equipment

Puzzle

Mesa

Tool

Toy

Wagon

Window

Present/Absent

Head/Face

Body

Arms/Hands

Legs/Feet

Otro

Phone*

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
oh
pag
metro

i
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2
oh
pag
metro
_
a
_
0
0
0
3
9
1
9
1
9
8
3
6
oh
pag
metro
_
a
_
0
0
0
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Nota. En algunos casos, individuals were present in the dataset via phone and therefore were not embodied; for those cases, nosotros
indicated the individual’s presence via “Phone” in the “Body Parts” section.

MENTE ABIERTA: Descubrimientos en ciencia cognitiva

SAYCam Dataset

Sullivan et al.

speaker, any notable, signiﬁcant, or contextually important actions, and any objects that were
relevant, visually salient, or interacted with. Transcribers were instructed to record any utter-
ances they heard verbatim. A new entry was created every time there was a change in speaker,
there was a meaningful pause
in conversation, o
there was a meaningful
between speaking. Each entry begins by recording the time in the video that the utterance in
question begins.

shift

RESULTADOS

There are over 1,800 videos with more than 460 unique recording sessions, yielding well over
500 hr of footage. Actualmente, encima 235,800 words have been transcribed, representing 16.5%
of the present dataset; 28.3% of the videos have been coded for location, actividad, living things,
body parts, and objects and are now searchable.

Researchers can access our dataset, which contains AVI and MP4 videos, PDFs, and Excel
sheets representing our data at the Supplementary Materials online, and can search for rele-
vant videos here: https://skidmoreheadcam.shinyapps.io/SAYcam-TagSearch/. Por ejemplo, a
researcher interested in studying the auditory input surrounding infant nursing could visit our
search tool, and search for segments containing nursing. Actualmente, this would yield 53 videos;
the output table automatically includes a link to the relevant video, and a link to a Google Sheet
containing any available transcriptions. The researcher could then choose to analyze those 53
videos, or to further restrict their search (p.ej., by age, other individuals present, other activities
surrounding the event). O, let’s say a researcher was interested in a child’s exposure to cats
prior to acquiring the word “cat.” The researcher would then access the child’s MCDI vocabu-
lary measures on Databrary in order to identify the month at which the child acquired the word
“cat.” They could then use our search tool to search for videos from that child from before the
child learned the word, and access those videos via the provided Databrary link.

Individuals can apply to access the data using Databrary’s (https://nyu.databrary.org/)
standard application process. Researchers will need to demonstrate that they are authorized
investigators from their institution, that they completed ethics training, and that their institution
has an institutional review board.

DISCUSIÓN

Beginning with early diary studies and continuing with video recordings of parents–child in-
teraction, naturalistic corpora have helped researchers both ensure the ecological validity of
theories of learning and development and generate new research questions (p.ej., Marrón, 1973;
MacWhinney, 2000). The current dataset is a novel contribution to naturalistic developmental
corpora in several ways. Primero, the use of head-mounted cameras constrains the video record-
ings to the child’s point of view, allowing researchers to not simply observe the visual stimuli
surrounding a child, but also access the visual stimuli that is actually available to a child (ver
Yurovsky et al., 2013, for an example of how this ﬁrst-person perspective can lead to novel in-
sights into learning mechanisms). Segundo, the biweekly recording schedule led to the creation
of a dataset with ﬁner grained sampling than in most other longitudinal corpora (see Adolph
& robinson, 2011, for a discussion of the importance of sampling rate). Tercero, the two-year
span of recordings is longer than most longitudinal datasets, allowing for researchers to track
development in a way that is not possible for shorter projects. Por último, the open sharing of the
data on Databrary gives researchers (and their students) the rare opportunity to freely and easily
access the entire corpus.

MENTE ABIERTA: Descubrimientos en ciencia cognitiva

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
oh
pag
metro

i
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2
oh
pag
metro
_
a
_
0
0
0
3
9
1
9
1
9
8
3
6
oh
pag
metro
_
a
_
0
0
0
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

SAYCam Dataset

Sullivan et al.

We anticipate our data will be a valuable tool for researchers in developmental science,
and in particular to those interested in categorization and language learning. One of the most
important questions in language acquisition is how children connect words with their mean-
ings. This is a difﬁcult problem for the language learner, because for any amount of evidence
about the use of a word, there is an inﬁnite range of meanings that are consistent with that
evidencia (Quine, 1960). Viewed as a categorization problem, the challenge posed to children
is to identify the extension of categories in the world and, simultaneously, to infer which labels
map onto those categories. Por ejemplo, a child must both determine that the word “cat” refers
to this particular cat and also that more generally it refers to a concept that includes other cats,
but not dogs or horses. This process may depend critically on the distribution of the objects
and entities in the child’s environment, the use of object labels in the child’s environment, y
the distributional relationship of the overlap between objects and labels—the present dataset
uniquely allows us to track children’s naturalistic audio and visual input, and to relate it to the
trajectory of their word learning.

Además, this dataset will be of use for those interested in quantifying children’s early
naturalistic environments. Trabajo previo (Yoshida & Herrero, 2008) has shown that headcam
data is a viable proxy for infants’ eye movements, and that headcam footage is valuable for
exploring the composition of a child’s visual ﬁeld (Aslin, 2009). We have previously used
headcam data to investigate how changing motor abilities affect children’s experience (Franco
et al., 2013), while others have used headcams to show the changing composition of children’s
visual input in early development (Fausey et al., 2016). Our longitudinal, naturalistic, primero-
person dataset will allow researchers to ask and answer questions about the statistic of dyadic
interactions across time: our archive provides researchers with access to data on joint attention,
infant gaze, parental dyadic engagement, attachment, in addition to the basic statistics of the
infant’s perceptual experience. En efecto, early work using our data already suggests interesting
insights can be extracted by considering how social information is modulated by different
activity contexts (Long et al., 2020).

Finalmente, the density of our dataset is extremely important for the use of new neural network
algoritmos, which tend to be “data hungry.” Already, researchers have explored what learning
progress is possible when new unsupervised visual learning algorithms are applied to our
conjunto de datos (Orhan et al., 2020; Zhuang et al., 2020). De hecho, as Orhan et al.
(2020) nota, incluso
though our dataset is quite large by conventional standards, a child’s visual experience from
birth to age two and a half would in fact be two orders of magnitude larger still, and the machine
learning literature suggests that such an increase in data would be likely to lead to higher
performance and new innovations. De este modo, from a broader perspective, we hope our dataset is
part of the “virtuous cycle” of greater data access leading to algorithmic innovation—in turn
leading to new datasets being created.

EXPRESIONES DE GRATITUD

Our grateful acknowledgment goes to the students who coded and transcribed these data,
and to the individuals who were ﬁlmed in the recording sessions. Special thanks to Julia
Iannucci for work managing elements of this dataset.

INFORMACIÓN DE FINANCIACIÓN

JS, National
ID: R03 #HD09147.

Institutes of Health (h t t p : / / d x . d o i . o r g / 1 0 . 1 3 0 3 9 /100000002), Award

MENTE ABIERTA: Descubrimientos en ciencia cognitiva

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
oh
pag
metro

i
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2
oh
pag
metro
_
a
_
0
0
0
3
9
1
9
1
9
8
3
6
oh
pag
metro
_
a
_
0
0
0
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

SAYCam Dataset

Sullivan et al.

CONTRIBUCIONES DE AUTOR

JS: Conceptualización: Secundario; Metodología: Equal; Recopilación de datos: Equal; Fondos: Lead;
Supervisión: Lead; Writing – Original Draft: Equal; Writing – Revisar & Editing: Lead. MM:
Analysis and Coding: Lead; Recopilación de datos: Secundario; Supervisión: Secundario; Writing –
Original Draft: Equal; Writing – Revisar & Editing: Equal. AP: Conceptualización: Equal; Datos
Collection: Equal; Metodología: Equal; Supervisión: Secundario; Writing – Original Draft: Equal;
Writing – Revisar & Editing: Secundario. EW: Conceptualización: Secundario; Recopilación de datos:
Equal; Metodología: Secundario; Supervisión: Secundario; Visualización: Secundario; Writing –
Original Draft: Equal; Writing – Revisar & Editing: Secundario. MCF: Conceptualización: Equal;
Recopilación de datos: Secundario; Fondos: Secundario; Metodología: Equal; Supervisión: Support-
En g; Writing – Original Draft: Equal; Writing – Revisar & Editing: Secundario.

REFERENCIAS

Adolph, k. MI., & robinson, S. R.

(2011). Sampling development.
Journal of Cognition and Development, 12(4), 411–423. DOI:
https://doi.org/10.1080/15248372.2011.608190, PMID: 22140355,
PMCID: PMC3226816

Aslin, R. (2009). How infants view natural scenes gathered from a
head-mounted camera. Optometry and Vision Science, 86(6),
561–565. DOI: https://doi.org/10.1097/OPX.0b013e3181a76e96,
PMID: 19417702, PMCID: PMC2748119

Aslin, R. (2012). Infant eyes: A window on cognitive development.
Infancy, 17(1), 126–140. DOI: https://doi.org/10.1111/j.1532
-7078.2011.00097.X, PMID: 22267956, PMCID: PMC3259733
Bambach, S., Herrero, l., Crandall, D., & Yu, C. (2016). Objects in the
center: How the infant’s body constrains infant scenes. Proceed-
ings of the Joint IEEE International Conference on Development
and Learning and Epigenetic Robotics (páginas. 132–137).
IEEE.
https://ieeexplore.ieee.org/document/7846804, DOI: https://doi
.org/10.1109/DEVLRN.2016.7846804

Bergelson, mi.

(2017). SEEDLingS 6 Month. Databrary. Retrieved

Octubre 8, 2020 from http://doi.org/10.17910/B7.330

Cicchino, J., Aslin, r., & Rakison, D.

Marrón, R. (1973). A ﬁrst language: The early stages. Harvard Univer-
sity Press. DOI: https://doi.org/10.4159/harvard.9780674732469
(2011). Correspondence
between what infants see and know about causal and self-
propelled motion. Cognición, 118(2), 171–192. DOI: https://
doi.org/10.1016/j.cognition.2010.11.005, PMID: 21122832,
PMCID: PMC3038602

Clerkin, mi. METRO., Hart, MI., Rehg,

j. METRO., Yu, C., & Herrero, l. B.
(2017). Real-world visual statistics and infants’ ﬁrst-learned ob-
ject names. Philosophical Transactions of the Royal Society B:
Ciencias Biologicas, 372(1711), Article 20160055. DOI: https://
doi.org/10.1098/rstb.2016.0055, PMID: 27872373, PMCID:
PMC5124080

Fausey, C., Jayaraman, S., & Herrero, l. (2016). From faces to hands:
Changing visual input in the ﬁrst two years. Cognición, 152(2),
101–107. DOI: https://doi.org/10.1016/j.cognition.2016.03.005,
PMID: 27043744, PMCID: PMC4856551

Fenson, l., Marchman, V. A., Thal, D. J., Valle, PAG. S., Reznick, j. S.,
(2007). MacArthur-Bates Communicative Develop-
& Bates, mi.
ment Inventories: User’s guide and technical manual (2y ed.).
Brookes Publishing. DOI: https://doi.org/10.1037/t11538-000

Fernald, A., Marchman, v., & Weisleder, A.

(2012). SES differ-
ences in language processing skill and vocabulary are evident
en 18 meses. Developmental Science, 16(2), 234–248. DOI:
https://doi.org/10.1111/desc.12019, PMID: 23432833, PMCID:
PMC3582035

Franchak, J., Kretch, K., & Adolph, k.

(2018). See and be seen:
free play.
Infant-caregiver social
Developmental Science, 21(4), Article e12626. DOI: https://
doi.org/10.1111/desc.12626,
PMCID:
PMC5920801

looking during locomotor

29071760,

PMID:

Franchak, J., Kretch, K., Soska, K., & Adolph, k.

(2011). Head-
mounted eye-tracking: A new method to describe infant looking.
Child Development, 82(6), 1738–1750. DOI: https://doi.org/10
.1111/j.1467-8624.2011.01670.x, PMID: 22023310, PMCID:
PMC3218200

Franco, METRO. C., Simmons, K., Yurovsky, D., & Pusiol, GRAMO. (2013).
Developmental and postural changes in children’s visual access
to faces. En m. Knauff (Ed.), Proceedings of the 35th Annual Con-
ference of the Cognitive Science Society (páginas. 454–459). Cog-
nitive Science Society. http://langcog.stanford.edu/papers/FSYP
-cogsci2013.pdf

Gilmore, r., & Adolph, k.

(2017). Video can make behavioural
science more reproducible. Nature Human Behavior, 1(7), 1–2.
DOI: https://doi.org/10.1038/s41562-017-0128, PMID: 30775454,
PMCID: PMC6373476

Hart, B., & Risley, t.

(1992). American parenting of language-
learning children: Persisting differences in family-child interac-
tions observed in natural home environments. Developmental
Psicología, 28(6), 1096–1105. DOI: https://doi.org/10.1037
/0012-1649.28.6.1096

Kretch, K., Franchak, J., & Adolph, k. (2014). Crawling and walk-
ing infants see the world differently. Child Development, 85(4),
1503–1518. DOI: https://doi.org/10.1111/cdev.12206, PMID:
24341362, PMCID: PMC4059790

MENTE ABIERTA: Descubrimientos en ciencia cognitiva

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
oh
pag
metro

i
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2
oh
pag
metro
_
a
_
0
0
0
3
9
1
9
1
9
8
3
6
oh
pag
metro
_
a
_
0
0
0
3
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

SAYCam Dataset

Sullivan et al.

Largo, B., Kachergis, GRAMO., Agrawal, K., & Franco, METRO. C. (2020). Detect-
ing social
information in a dense dataset of infants’ natural
visual experience. PsyArXiv. https://psyarxiv.com/z7tdg/. DOI:
https://doi.org/10.31234/osf.io/z7tdg

MacWhinney, B.

(2000). The CHILDES project: Tools for analyz-
ing talk. Transcription format and programs (volumen. 1). Psicología
Prensa.

Oller, D., niyogi, PAG., Gray, S., Richards, J., Gilkerson, J., Xu, D.,
Yapanel, Ud., & Warren, S. F. (2010). Automated vocal analysis
of naturalistic recordings from children with autism, idioma
delay, and typical development. Actas del Nacional
Academy of Science, 107(30), 13354–13359. DOI: https://doi
.org/10.1073/pnas.1003882107, PMID: 20643944, PMCID:
PMC2922144

Orhan, A. MI., Gupta, V. v., & Lago, B. METRO.

(2020). Self-supervised
learning through the eyes of a child. ArXiv. https://arxiv.org/abs
/2007.16189

Pusiol, GRAMO., Soriano, l., Fei-Fei, l., & Franco, METRO.

(2014). Discov-
ering the signatures of joint attention in child-caregiver interac-
ción. Actas de la Reunión Anual de la Ciencia Cognitiva
Sociedad, 36, 2805–2810.

Quine, W.. V. (1960). Word and object. CON prensa.
Rozin, PAG. (2001). Social psychology and science: Some lessons from
Solomon Asch. Personality and Social Psychology Review, 5(1),
2–14. DOI: https://doi.org/10.1207/S15327957PSPR0501_1
Sanchez, A., Largo, B., Kraus, A. METRO., & Franco, METRO. C. (2018). Postural
developments modulate children’s visual access to social infor-
In Proceedings of the 40th Annual Conference of the
formación.
Sociedad de ciencia cognitiva (páginas. 2412–2417). Ciencia cognitiva
Sociedad.
http://langcog.stanford.edu/papers_new/sanchez-long
-2018-cogsci.pdf. DOI: https://doi.org/10.31234/osf.io/th92b
Sanchez, A., Meylan, S., Braginsky, METRO., macdonald, K., Yurovsky,
D., & Franco, METRO. (2019). CHILDES-db: A ﬂexible and reproducible
interface to the child language data exchange system. Comportamiento

Research Methods, 51(4), 1928–1941. DOI: https://doi.org/10
.3758/s13428-018-1176-7, PMID: 30623390

Herrero, l., Yu, C., Yoshida, h., & Fausey, C.

(2015). Contributions
of head-mounted cameras to studying the visual environments
of infants and young children. Journal of Cognition and Develop-
mento, 16(3), 407–419. DOI: https://doi.org/10.1080/15248372
.2014.933430, PMID: 26257584, PMCID: PMC4527180

Sperry, D. MI., Sperry, l. l., & Molinero, PAG. j. (2019). Reexamining the
verbal environments of children from different socioeconomic
backgrounds. Child Development, 90(4), 1303–1318. DOI:
https://doi.org/10.1111/cdev.13072, PMID: 29707767

escuderos, J., Bricker, D., & Potter, l.

(1997). Revision of a parent-
completed developmental screening tool: Ages and stages ques-
tionnaires. Journal of Pediatric Psychology, 22(3), 313–328. DOI:
https://doi.org/10.1093/jpepsy/22.3.313, PMID: 9212550

Yoshida, h., & Herrero, l.

(2008). What’s in view for toddlers? Us-
Infancy, 13(3),
ing a head camera to study visual experience.
229–248. DOI: https://doi.org/10.1080/15250000802004437,
PMID: 20585411, PMCID: PMC2888512

Yu, C., & Herrero, l. (2013). Joint attention without gaze following:
Human infants and their parents coordinate visual attention
to objects through eye-hand coordination. PLoS ONE, 8, Article
79659. DOI: https://doi.org/10.1371/journal.pone.0079659,
PMID: 24236151, PMCID: PMC3827436

Yurovsky, D., Herrero, l. B., & Yu, C. (2013). Statistical word learning
at scale: The baby’s view is better. Developmental Science,
16(6), 959–966. DOI: https://doi.org/10.1111/desc.12036,
PMID: 24118720, PMCID: PMC4443688

Zhuang, C., yan, S., Nayebi, A., Schrimpf, METRO., Franco, METRO. C.,
(2020). Unsupervised neu-
DiCarlo, j. J., & Yamins, D. l. k.
ral network models of the ventral visual system. BioRxiv. DOI:
https://doi.org/10.1101/2020.06.16.155556