RESEARCH ARTICLE
“Um…, It’s Really Difficult to… Um…
Speak Fluently”: Neural Tracking
of Spontaneous Speech
Galit Agmon1,2
Martin G. Bleichner3,5
, Manuela Jaeger3
, Reut Tsarfaty4
, and Elana Zion Golumbic1
,
1The Gonda Center for Multidisciplinary Brain Research, Bar-Ilan University, Ramat Gan, Israel
2Frontotemporal Degeneration Center, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PAPÀ, USA
3Neurophysiology of Everyday Life Group, Department of Psychology, University of Oldenburg, Oldenburg, Germany
4Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel
5Research Center for Neurosensory Science, University of Oldenburg, Oldenburg, Germany
Keywords: disfluencies, speech rate, spontaneous speech, syntactic boundaries, TRF analysis
ABSTRACT
Spontaneous real-life speech is imperfect in many ways. It contains disfluencies and ill-formed
utterances and has a highly variable rate. When listening to spontaneous speech, the brain
needs to contend with these features in order to extract the speaker’s meaning. Here, we studied
how the neural response is affected by four specific factors that are prevalent in spontaneous
colloquial speech: (1) the presence of fillers, (2) the need to detect syntactic boundaries in
disfluent speech, E (3) variability in speech rate. Neural activity was recorded (using
electroencephalography) from individuals as they listened to an unscripted, spontaneous
narrative, which was analyzed in a time-resolved fashion to identify fillers and detect syntactic
boundaries. When considering these factors in a speech-tracking analysis, which estimates a
temporal response function (TRF) to describe the relationship between the stimulus and the
neural response it generates, we found that the TRF was affected by all of them. This response
was observed for lexical words but not for fillers, and it had an earlier onset for opening words
vs. closing words of a clause and for clauses with slower speech rates. These findings broaden
ongoing efforts to understand neural processing of speech under increasingly realistic
conditions. They highlight the importance of considering the imperfect nature of real-life
spoken language, linking past research on linguistically well-formed and meticulously
controlled speech to the type of speech that the brain actually deals with on a daily basis.
INTRODUCTION
Neural speech tracking has become an increasingly useful tool for studying how the brain
encodes and processes continuous speech (Brodbeck & Simone, 2020; Obleser & Kayser,
2019). Importantly, characterizing linguistic attributes of speech on a continuous basis gives
a new angle to auditory attention and neurolinguistics, as researchers have been able to dis-
sociate neural responses driven by the acoustics of speech from those capturing higher-order
processes in a dynamically changing speech signal, such as phonological identity and seman-
tic expectations (Brodbeck et al., 2018; Gillis et al., 2021; Inbar et al., 2020; Keitel et al.,
2018). And yet, the speech stimuli used in these studies are generally highly scripted and edi-
ted, taken—for example—from audiobooks or TED talks, which are also extensively rehearsed
and are delivered by professionals. These stimuli are in many ways different from colloquial
a n o p e n a c c e s s
j o u r n a l
Citation: Agmon, G., Jaeger, M.,
Tsarfaty, R., Bleichner, M. G., & Zion
Golumbic, E. (2023). “Um…, it’s really
difficult to… um… speak fluently”:
Neural tracking of spontaneous
speech. Neurobiology of Language,
4(3), 435–454. https://doi.org/10.1162
/nol_a_00109
DOI:
https://doi.org/10.1162/nol_a_00109
Supporting Information:
https://doi.org/10.1162/nol_a_00109
Received: 14 Luglio 2022
Accepted: 5 May 2023
Competing Interests: The authors have
declared that no competing interests
exist.
Corresponding Author:
Galit Agmon
galit.agmon@pennmedicine.upenn.edu
Handling Editor:
Sonja A. Kotz
Copyright: © 2023
Istituto di Tecnologia del Massachussetts
Pubblicato sotto Creative Commons
Attribuzione 4.0 Internazionale
(CC BY 4.0) licenza
The MIT Press
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
N
o
/
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
3
4
3
5
2
1
5
6
0
0
4
N
o
_
UN
_
0
0
1
0
9
P
D
/
.
l
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Neural tracking of spontaneous speech
speech that we hear every day (Blaauw, 1994; Face, 2003; Goldman-Eisler, 1968, 1972;
Haselow, 2017; Huber, 2007; Mehta & Cutler, 1988). Spontaneous speech is also markedly
less fluent than scripted speech, peppered with fillers and pauses, and often includes partial or
grammatically incorrect syntactic structures (Auer, 2009; Haselow, 2017; Linell, 1982;
Shriberg, 2001). Spontaneous speech is also more variable than scripted speech in terms of
its speech rate (Goldman-Eisler, 1961; Miller et al., 1984). Silent pauses in spontaneous speech
occur less consistently on syntactic boundaries compared to reading (Goldman-Eisler, 1972;
Wang et al., 2010). These differences may render the syntactic analysis of spoken speech less
trivial compared to the planned, well-structured sentences in scripted speech materials. By
focusing mostly on scripted speech materials, past research might have overlooked important
processes that are essential for understanding naturalistic speech.
Addressing this gap, in the current electroencephalography (EEG) study we assess neural
speech tracking of spontaneously generated speech and focus on the specific challenges the
brain has to cope with when processing spontaneous speech: the abundant presence of fillers,
online segmentation, and detection of syntactic boundaries and variability in speech rate.
Disfluency and Fillers
A prominent feature of spontaneous speech is that it contains frequent pauses, self-corrections
and repetitions (Bortfeld et al., 2001; Clark & Wasow, 1998; Fox Tree, 1995). These disfluencies
are generally accompanied by fillers, which are nonlexical utterances (or filled pauses) ad esempio
“um” or “uh,” or discourse markers such as “you know” and “I mean” (Fox Tree & Schrock,
2002; Tottie, 2014). Fillers can take on different forms, Tuttavia, their prevalence across a mul-
titude of spoken languages (per esempio., Tian et al., 2017; Wieling et al., 2016) suggests it is a core
feature of spontaneous speech. Although fillers do not, in and of themselves, contribute specific
lexical information, they are also not mere glitches in speech production. Piuttosto, they likely
serve several important communicative goals, helping the speaker in transforming their internal
thoughts to speech and helping the listener interpret this speech. Specific roles that have been
attributed to fillers include signaling hesitation in speech planning (Corley & Stewart, 2008),
conveying lack of certainty in the content (Brennan & Williams, 1995; Smith & Clark, 1993),
serving as a cue to focus attention on upcoming words or complex syntactic phrases (Clark &
Fox Tree, 2002; Fox Tree, 2001; Fraundorf & Watson, 2011; Watanabe et al., 2008), signaling
unexpected information to come (Arnold et al., 2004; Barr & Seyfeddinipur, 2010; Corley et al.,
2007), and disambiguating syntactic structures (Bailey & Ferreira, 2003). Inoltre, studies
have shown that the presence of fillers improves accuracy and memory of speech content
(Brennan & Schober, 2001; Corley et al., 2007; Fox Tree, 2001; Fraundorf & Watson, 2011)
and that surprise-related neural responses (event-related potentials; ERPs) to target words are
reduced if they are preceded by a filler (Collard et al., 2008; Corley et al., 2007). And yet,
despite their clearly important role in the production and perception of spontaneous speech,
fillers and other disfluencies are generally absent in planned speech that is used in most lab
speech-tracking experiments. Therefore, how the brain processes fillers has not been studied
extensively.
The Unorganized Nature of Spontaneous Sentences
Unlike written text or highly edited spoken scripts, spontaneous speech is constructed “on the
fly,” and represents the speaker’s unedited and somewhat unpolished internal train of thought.
As a consequence, spontaneous speech does not always contain clear sentence endings, is not
always grammatically correct, and sentences can seem extremely long and less concise (per esempio.,
Neurobiology of Language
436
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
N
o
/
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
3
4
3
5
2
1
5
6
0
0
4
N
o
_
UN
_
0
0
1
0
9
P
D
.
/
l
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Neural tracking of spontaneous speech
Syntactic parsing:
The process of analyzing a string of
words into a coherent structure. Questo
term can refer either to the cognitive
process of analyzing speech input
and representing its syntactic
structure, or to the computational
process performed by natural
language processing tools.
“and, so, then we went to the bus, but it, like, didn’t come, the bus back home I mean, so we
had to wait, I really don’t know for how long”; Auer, 2009; Halliday, 1989; Haselow, 2017;
Linell, 1982). This poses a challenge to the listener of how to parse the continuous input
stream into meaningful syntactic units correctly.
Syntactic parsing is the process of analyzing the string of words into a coherent structure
and is critical for speech comprehension, even under assumptions of a “good-enough” or
noisy parse (Ferreira & Patson, 2007; Traxler, 2014). To detect syntactic boundaries in spoken
lingua, listeners rely on their accumulated syntactic analysis of an utterance as well as on
prosodic cues such as pauses and changes in pitch and duration (Ding et al., 2016; Fodor &
Bever, 1965; Garrett et al., 1966; Har-Shai Yahav & Zion Golumbic, 2021; Hawthorne &
Gerken, 2014; Kaufeld et al., 2020; Langus et al., 2012; Strangert, 1992; Strangert & Strangert,
1993). There are some behavioral and neural indications that words occurring at the final posi-
tion of syntactic structures have a special status. In production, words in final positions tend to
be prosodically marked (Cooper & Paccia-Cooper, 1980; Klatt, 1975). In comprehension,
reading studies show that sentence-final words have prolonged reading and fixation times
as well as increased ERP responses. These effects are known as wrap up effects, which is a
general name for the integrative processes triggered by the final word (Just & Carpenter,
1980; Stowe et al., 2018, for review). Independently, there is also evidence for neural responses
associated with detecting prosodic breaks (Pannekamp et al., 2005; Peter et al., 2014;
Steinhauer et al., 1999), which often coincide with syntactic boundaries. Tuttavia, the neural
correlates of syntactic boundaries have seldom been studied in the context of spoken
lingua, and particularly not for spontaneous speech, where sentence boundaries are not
as well-formed as in scripted language materials.
Speech Rate
Another characteristic of spontaneous speech studied here is speech rate across different sen-
tences. Spontaneous speech is produced “on-the-fly,” which can yield speech that at times is
highly coherent, fast and excited, and at times is prolonged and interspersed with pauses and
hesitations (Goldman-Eisler, 1961, 1972; Miller et al., 1984). Generally, a higher speech rate
means that information needs to be integrated in a shorter amount of time, which imposes
higher cognitive load on the listener and can affect the processing of speech in many ways.
Per esempio, processing compressed speech decreases speech comprehension and intellig-
ibility (Ahissar & Ahissar, 2005; Ahissar et al., 2001; Chan & Lee, 2005; Vaughan & Letowski,
1997; Verschueren et al., 2022). Additionally, phonological and lexical decoding are affected
by speech rate and dynamically adjusted to local changes in speech rate (Dilley & Pitt, 2010;
Dupoux & Verde, 1997; Miller et al., 1986). Several studies have shown that neural tracking of
continuous speech can be affected by artificially manipulating speech rate (Ahissar et al.,
2001; Müller et al., 2019; Verschueren et al., 2022). Tuttavia, few have looked at the natural
variations in speech rate in spontaneous discourse.
To summarize, spontaneous speech differs in many ways from scripted speech. Here we
focused on three key characteristics of spontaneous speech and ask how the presence of
fillers, the need to detect syntactic boundaries online, and the natural variations in speech rate
affect listeners’ neural response to speech. To do so, we analyzed EEG-recorded neural
responses from individuals listening to a 6 min long monologue rich in those features,
recorded from a speaker spontaneously recounting a personal experience. We first analyzed
the monologue to identify fillers and syntactic boundaries and estimate speech rate, then used
these as data-driven regressors for analyzing the neural activity, using multivariate speech-
Neurobiology of Language
437
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
N
o
/
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
3
4
3
5
2
1
5
6
0
0
4
N
o
_
UN
_
0
0
1
0
9
P
D
.
/
l
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Neural tracking of spontaneous speech
tracking analysis of the EEG data. In doing so, this study aimed to bridge the gap between the
vast literature studying brain responses to meticulously controlled speech and language mate-
rials and the type of speech materials that the brain deals with on a regular daily basis.
MATERIALS AND METHODS
Participants
Twenty participants took part in this experiment. All participants were native Hebrew speakers,
right-handed (16:4 F:M; age range 20–30, average 23.1 ± 2.65). Prior to their participation,
participants signed informed consent approved by the internal review board of Bar-Ilan Uni-
versity and were compensated for their participation with credit or payment.
Stimuli and Procedure
The stimulus was a single 6 min recording of a personal narrative, told in the first person by a
female native Hebrew speaker. The narrative was neutral, unscripted, and spontaneously gen-
erated, and described the speaker’s participation in a Facebook group called (translated from
the Hebrew) “Questions With No Point” and a social face-to-face meet-up organized by mem-
bers of the group. The only instruction given to participants was to listen passively to the story
presented without breaks. Questo 6 min session was used as an interlude between two parts of
another experiment by Paz Har-Shai Yahav and colleagues, currently in progress, that focused
on changes in low-level auditory responses over time and is orthogonal in its goals to the data
reported here.
Linguistic Analysis of the Speech Stimulus
The speech narrative was transcribed manually by two independent annotators who were
native speakers of Hebrew, and verified by an expert linguist (GA). The onset and duration
of each word and of fillers were identified and time-stamped by the two annotators and con-
firmed by the linguist, using the software Praat (Boersma & Weenink, 2021).
Based on the transcription, an expert linguist (GA) parsed the speech stimulus into major
syntactic units, most notably clauses. We marked boundaries of all main clauses, defined as
the minimal unit containing a predicate and all its complements and modifiers. This includes
clauses in coordinate constructions (starting with “and” or “but”). In many cases, a clause can
contain a subordinate clause, whose boundaries we also marked. These included complement
clauses (per esempio., “we decided [that we would meet in one of the gardens]"), adverbial clauses
(per esempio., “we met [because we needed to talk]"), and relative clauses (per esempio., “we met a delegation
[that came from Japan]"). We also marked the boundaries of heavy phrases such as appositives
(per esempio., “we decided to go there, [me and my friends]") or ellipses (per esempio., “I was fifteen years old
Poi, [in the tenth grade]").
Although syntactic parsing was done primarily based on the speech transcript, it was
double-checked relative to the audio-recording in search of cases where the spoken prosody
suggested a different intended parsing than the textual syntactic analysis. Per esempio, the text
“I went home with my friends” could be considered a single clause from a purely textual per-
spective. Tuttavia, in the audio recording, the speaker inserted a pause after the word
“home,” and hence, a listener would likely have identified the word “home” as the final word
in the clause, before the speaker decided to continue it with a prepositional phrase (“with my
friends”). Due to this perceptual consideration, in such cases, we marked both the word
“home” and the word “friends” as the final word in the clause.
Neurobiology of Language
438
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
N
o
/
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
3
4
3
5
2
1
5
6
0
0
4
N
o
_
UN
_
0
0
1
0
9
P
D
.
/
l
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Neural tracking of spontaneous speech
The time-stamped transcription and syntactic parsing were used to annotate the speech
stimulus according to the following four word-level features, which were used to analyze
the neural response (described below and summarized in Figure 1):
(cid:129) Fillers vs. non-fillers: Fillers were identified and time-stamped as part of the transcription
processi. The definition of fillers included both filled pauses (per esempio., “um,” “uh”) and filler
discourse markers, which are words that are lexical units, but their use in the utterance is
not related to their original lexical meaning (per esempio., “like,” “well”). A total of 92 fillers were
detected in the speech stimulus.
(cid:129) Words at syntactic boundaries (opening vs. closing words): Opening and closing words
were identified based on the marking of syntactic clause boundaries described above.
There were a total of 166 opening words and 142 closing words (since some clauses
were syntactically incomplete and did not have a clear closing word).
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
N
o
/
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
4
3
4
3
5
2
1
5
6
0
0
4
N
o
_
UN
_
0
0
1
0
9
P
D
/
.
l
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figura 1. Summary of speech stimulus used to analyze the neural response. (UN) Excerpt from the
speech stimulus used, demonstrating the disfluencies of spontaneous speech. (B) Example of the
four word-level features of spontaneous speech analyzed here. (C) Example of the quantification
of speech rate at the clause level.
Neurobiology of Language
439
Neural tracking of spontaneous speech
(cid:129) Word length (short vs. long words): The length of each word was evaluated based on the
time-stamped transcription and the median length was used to distinguish between short
and long words (median length: 319 ms).
(cid:129) Information content (function vs. content words): We also differentiated between words
that carry the most information (cioè., content words, defined as nouns, verbs, adjectives,
and adverbs) and words that mostly play a syntactic role (cioè., function words, defined as
pronouns, auxiliary verbs, prepositions, conjunctions, and determiners). In this data set,
there were 413 content and 250 function words.
Analysis of Speech Rate
Potential effects of variability in speech rate cannot be assessed at a single-word level, Ma
require assessing the rate of speech over longer periods of times. Here we chose to quantify
the mean speech rate within each main clause (including any embedded complement clauses
and restrictive relative clauses), following the rationale that a clause is the basic unit over
which information needs to be integrated during online listening.
We tested two metrics for operationalizing speech rate: syllable rate and word rate. These
metrics are highly correlated with each other (r = 0.76, P < 10–11 in the current data set) but
emphasize slightly different aspects of information transfer, with syllable rate capturing the rate
of acoustic input and word rate capturing the rate of linguistic input and general fluency. The
word rate of each clause was quantified as the number of words in a clause (not including
fillers) divided by its length. Similarly, the syllable rate was quantified as the number of sylla-
bles in a clause (not including fillers) divided by its length.
EEG Recordings
EEG was recorded using a 64 Active-Two system (BioSemi) with Ag-AgCl electrodes, placed
according to the 10–20 system, at a sampling rate of 1024 Hz. Additional external electrodes
were used to record from the mastoids bilaterally and both vertical and horizontal electrooc-
ulography electrodes were used to monitor eye movements. The experiment was conducted in
a dimly lit, acoustically and electrically shielded booth. Participants were seated on a com-
fortable chair and were instructed to keep as still as possible and breathe and blink naturally.
Experiments were programmed and presented to participants using PsychoPy (Open Science
Tools, 2019; Peirce et al., 2019).
EEG Preprocessing and Speech-Tracking Analysis
EEG preprocessing and analysis were performed using the MATLAB-based FieldTrip toolbox
(Oostenveld et al., 2011) as well as custom-written scripts. Raw data were first visually
inspected, and time points with gross artifacts exceeding ±50 μV (that were not eye move-
ments) were removed. Independent component analysis was performed to identify and
remove components associated with horizontal or vertical eye movements as well as heart-
beats (Onton et al., 2006). Any remaining noisy electrodes that exhibited either extreme
high-frequency activity (>40 Hz) or low-frequency activity/drifts (<1 Hz), were replaced with
the weighted average of their neighbors using an interpolation procedure. The clean EEG data
were filtered between 1 and 10 Hz. broadband envelope the speech was extracted
using equally spaced filterbank 100 10000 Hz based on Liberman’s cochlear
frequency map (Liberman, 1982). narrowband signals summed across bands
after taking absolute value Hilbert transform for each one, resulting in a broadband
Neurobiology Language
440
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
>