RESEARCH ARTICLE

“Um…, It’s Really Difficult to… Um…
Speak Fluently”: Neural Tracking
of Spontaneous Speech

Galit Agmon1,2
Martin G. Bleichner3,5

, Manuela Jaeger3

, Reut Tsarfaty4
, and Elana Zion Golumbic1

1The Gonda Center for Multidisciplinary Brain Research, Bar-Ilan University, 拉马特甘, 以色列
2Frontotemporal Degeneration Center, Perelman School of Medicine, 宾夕法尼亚大学, 费城, PA, 美国
3Neurophysiology of Everyday Life Group, 心理学系, University of Oldenburg, Oldenburg, 德国
4计算机科学系, Bar-Ilan University, 拉马特甘, 以色列
5Research Center for Neurosensory Science, University of Oldenburg, Oldenburg, 德国

关键词: disfluencies, speech rate, spontaneous speech, syntactic boundaries, TRF analysis

抽象的

Spontaneous real-life speech is imperfect in many ways. It contains disfluencies and ill-formed
utterances and has a highly variable rate. When listening to spontaneous speech, 大脑
needs to contend with these features in order to extract the speaker’s meaning. 这里, we studied
how the neural response is affected by four specific factors that are prevalent in spontaneous
colloquial speech: (1) the presence of fillers, (2) the need to detect syntactic boundaries in
disfluent speech, 和 (3) variability in speech rate. Neural activity was recorded (使用
electroencephalography) from individuals as they listened to an unscripted, spontaneous
narrative, which was analyzed in a time-resolved fashion to identify fillers and detect syntactic
边界. When considering these factors in a speech-tracking analysis, which estimates a
temporal response function (TRF) to describe the relationship between the stimulus and the
neural response it generates, we found that the TRF was affected by all of them. This response
was observed for lexical words but not for fillers, and it had an earlier onset for opening words
与. closing words of a clause and for clauses with slower speech rates. These findings broaden
ongoing efforts to understand neural processing of speech under increasingly realistic
状况. They highlight the importance of considering the imperfect nature of real-life
spoken language, linking past research on linguistically well-formed and meticulously
controlled speech to the type of speech that the brain actually deals with on a daily basis.

介绍

Neural speech tracking has become an increasingly useful tool for studying how the brain
encodes and processes continuous speech (Brodbeck & 西蒙, 2020; Obleser & Kayser,
2019). 重要的, characterizing linguistic attributes of speech on a continuous basis gives
a new angle to auditory attention and neurolinguistics, as researchers have been able to dis-
sociate neural responses driven by the acoustics of speech from those capturing higher-order
processes in a dynamically changing speech signal, such as phonological identity and seman-
tic expectations (Brodbeck et al., 2018; Gillis et al., 2021; Inbar et al., 2020; Keitel et al.,
2018). 但是, the speech stimuli used in these studies are generally highly scripted and edi-
特德, taken—for example—from audiobooks or TED talks, which are also extensively rehearsed
and are delivered by professionals. These stimuli are in many ways different from colloquial

开放访问

杂志

引文: Agmon, G。, Jaeger, M。,
Tsarfaty, R。, Bleichner, 中号. G。, & Zion
Golumbic, 乙. (2023). “Um…, it’s really
difficult to… um… speak fluently”:
Neural tracking of spontaneous
speech. Neurobiology of Language,
4(3), 435–454. https://doi.org/10.1162
/nol_a_00109

DOI:
https://doi.org/10.1162/nol_a_00109

支持信息:
https://doi.org/10.1162/nol_a_00109

已收到: 14 七月 2022
公认: 5 可能 2023

利益争夺: 作者有
声明不存在竞争利益
存在.

通讯作者:
Galit Agmon
galit.agmon@pennmedicine.upenn.edu

处理编辑器:
Sonja A. Kotz

版权: © 2023
麻省理工学院
在知识共享下发布
归因 4.0 国际的
(抄送 4.0) 执照

麻省理工学院出版社

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
3
4
3
5
2
1
5
6
0
0
4
n
哦
_
A
_
0
0
1
0
9
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Neural tracking of spontaneous speech

speech that we hear every day (Blaauw, 1994; Face, 2003; Goldman-Eisler, 1968, 1972;
Haselow, 2017; Huber, 2007; Mehta & 卡特勒, 1988). Spontaneous speech is also markedly
less fluent than scripted speech, peppered with fillers and pauses, and often includes partial or
grammatically incorrect syntactic structures (Auer, 2009; Haselow, 2017; Linell, 1982;
Shriberg, 2001). Spontaneous speech is also more variable than scripted speech in terms of
its speech rate (Goldman-Eisler, 1961; Miller et al., 1984). Silent pauses in spontaneous speech
occur less consistently on syntactic boundaries compared to reading (Goldman-Eisler, 1972;
王等人。, 2010). These differences may render the syntactic analysis of spoken speech less
trivial compared to the planned, well-structured sentences in scripted speech materials. 经过
focusing mostly on scripted speech materials, past research might have overlooked important
processes that are essential for understanding naturalistic speech.

Addressing this gap, in the current electroencephalography (EEG) study we assess neural
speech tracking of spontaneously generated speech and focus on the specific challenges the
brain has to cope with when processing spontaneous speech: the abundant presence of fillers,
online segmentation, and detection of syntactic boundaries and variability in speech rate.

Disfluency and Fillers

A prominent feature of spontaneous speech is that it contains frequent pauses, self-corrections
and repetitions (Bortfeld et al., 2001; 克拉克 & Wasow, 1998; Fox Tree, 1995). These disfluencies
are generally accompanied by fillers, which are nonlexical utterances (or filled pauses) 例如
“um” or “uh,” or discourse markers such as “you know” and “I mean” (Fox Tree & Schrock,
2002; Tottie, 2014). Fillers can take on different forms, 然而, their prevalence across a mul-
titude of spoken languages (例如, Tian et al., 2017; Wieling et al., 2016) suggests it is a core
feature of spontaneous speech. Although fillers do not, in and of themselves, contribute specific
lexical information, they are also not mere glitches in speech production. 相当, they likely
serve several important communicative goals, helping the speaker in transforming their internal
thoughts to speech and helping the listener interpret this speech. Specific roles that have been
attributed to fillers include signaling hesitation in speech planning (Corley & 斯图尔特, 2008),
conveying lack of certainty in the content (Brennan & 威廉姆斯, 1995; 史密斯 & 克拉克, 1993),
serving as a cue to focus attention on upcoming words or complex syntactic phrases (克拉克 &
Fox Tree, 2002; Fox Tree, 2001; Fraundorf & 沃森, 2011; Watanabe et al., 2008), signaling
unexpected information to come (Arnold et al., 2004; Barr & Seyfeddinipur, 2010; Corley et al.,
2007), and disambiguating syntactic structures (贝利 & 费雷拉, 2003). 而且, 学习
have shown that the presence of fillers improves accuracy and memory of speech content
(Brennan & Schober, 2001; Corley et al., 2007; Fox Tree, 2001; Fraundorf & 沃森, 2011)
and that surprise-related neural responses (event-related potentials; ERPs) to target words are
reduced if they are preceded by a filler (Collard et al., 2008; Corley et al., 2007). 但是,
despite their clearly important role in the production and perception of spontaneous speech,
fillers and other disfluencies are generally absent in planned speech that is used in most lab
speech-tracking experiments. 所以, how the brain processes fillers has not been studied
extensively.

The Unorganized Nature of Spontaneous Sentences

Unlike written text or highly edited spoken scripts, spontaneous speech is constructed “on the
fly,” and represents the speaker’s unedited and somewhat unpolished internal train of thought.
作为结果, spontaneous speech does not always contain clear sentence endings, 不是
always grammatically correct, and sentences can seem extremely long and less concise (例如,

Neurobiology of Language

436

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
3
4
3
5
2
1
5
6
0
0
4
n
哦
_
A
_
0
0
1
0
9
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Neural tracking of spontaneous speech

Syntactic parsing:
The process of analyzing a string of
words into a coherent structure. 这
term can refer either to the cognitive
process of analyzing speech input
and representing its syntactic
结构, or to the computational
process performed by natural
language processing tools.

“and, 所以, then we went to the bus, 但它, 喜欢, didn’t come, the bus back home I mean, so we
had to wait, I really don’t know for how long”; Auer, 2009; Halliday, 1989; Haselow, 2017;
Linell, 1982). This poses a challenge to the listener of how to parse the continuous input
stream into meaningful syntactic units correctly.

Syntactic parsing is the process of analyzing the string of words into a coherent structure
and is critical for speech comprehension, even under assumptions of a “good-enough” or
noisy parse (费雷拉 & Patson, 2007; Traxler, 2014). To detect syntactic boundaries in spoken
语言, listeners rely on their accumulated syntactic analysis of an utterance as well as on
prosodic cues such as pauses and changes in pitch and duration (Ding et al., 2016; 福多尔 &
贝弗, 1965; Garrett et al., 1966; Har-Shai Yahav & Zion Golumbic, 2021; Hawthorne &
Gerken, 2014; Kaufeld et al., 2020; Langus et al., 2012; Strangert, 1992; Strangert & Strangert,
1993). There are some behavioral and neural indications that words occurring at the final posi-
tion of syntactic structures have a special status. In production, words in final positions tend to
be prosodically marked (库珀 & Paccia-Cooper, 1980; Klatt, 1975). In comprehension,
reading studies show that sentence-final words have prolonged reading and fixation times
as well as increased ERP responses. These effects are known as wrap up effects, 这是一个
general name for the integrative processes triggered by the final word (Just & Carpenter,
1980; Stowe et al., 2018, for review). Independently, there is also evidence for neural responses
associated with detecting prosodic breaks (Pannekamp et al., 2005; Peter et al., 2014;
Steinhauer et al., 1999), which often coincide with syntactic boundaries. 然而, the neural
correlates of syntactic boundaries have seldom been studied in the context of spoken
语言, and particularly not for spontaneous speech, where sentence boundaries are not
as well-formed as in scripted language materials.

Speech Rate

Another characteristic of spontaneous speech studied here is speech rate across different sen-
时态. Spontaneous speech is produced “on-the-fly,” which can yield speech that at times is
highly coherent, fast and excited, and at times is prolonged and interspersed with pauses and
hesitations (Goldman-Eisler, 1961, 1972; Miller et al., 1984). 一般来说, a higher speech rate
means that information needs to be integrated in a shorter amount of time, which imposes
higher cognitive load on the listener and can affect the processing of speech in many ways.
例如, processing compressed speech decreases speech comprehension and intellig-
能力 (Ahissar & Ahissar, 2005; Ahissar et al., 2001; Chan & 李, 2005; Vaughan & Letowski,
1997; Verschueren et al., 2022). 此外, phonological and lexical decoding are affected
by speech rate and dynamically adjusted to local changes in speech rate (Dilley & Pitt, 2010;
Dupoux & 绿色的, 1997; Miller et al., 1986). Several studies have shown that neural tracking of
continuous speech can be affected by artificially manipulating speech rate (Ahissar et al.,
2001; Müller et al., 2019; Verschueren et al., 2022). 然而, few have looked at the natural
variations in speech rate in spontaneous discourse.

总结一下, spontaneous speech differs in many ways from scripted speech. Here we
focused on three key characteristics of spontaneous speech and ask how the presence of
fillers, the need to detect syntactic boundaries online, and the natural variations in speech rate
affect listeners’ neural response to speech. 这样做, we analyzed EEG-recorded neural
responses from individuals listening to a 6 min long monologue rich in those features,
recorded from a speaker spontaneously recounting a personal experience. We first analyzed
the monologue to identify fillers and syntactic boundaries and estimate speech rate, then used
these as data-driven regressors for analyzing the neural activity, using multivariate speech-

Neurobiology of Language

437

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
3
4
3
5
2
1
5
6
0
0
4
n
哦
_
A
_
0
0
1
0
9
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Neural tracking of spontaneous speech

tracking analysis of the EEG data. 在这样做, this study aimed to bridge the gap between the
vast literature studying brain responses to meticulously controlled speech and language mate-
rials and the type of speech materials that the brain deals with on a regular daily basis.

材料和方法

参加者

Twenty participants took part in this experiment. All participants were native Hebrew speakers,
right-handed (16:4 F:中号; age range 20–30, average 23.1 ± 2.65). Prior to their participation,
participants signed informed consent approved by the internal review board of Bar-Ilan Uni-
versity and were compensated for their participation with credit or payment.

Stimuli and Procedure

The stimulus was a single 6 min recording of a personal narrative, told in the first person by a
female native Hebrew speaker. The narrative was neutral, unscripted, and spontaneously gen-
erated, and described the speaker’s participation in a Facebook group called (translated from
the Hebrew) “Questions With No Point” and a social face-to-face meet-up organized by mem-
bers of the group. The only instruction given to participants was to listen passively to the story
presented without breaks. 这 6 min session was used as an interlude between two parts of
another experiment by Paz Har-Shai Yahav and colleagues, currently in progress, that focused
on changes in low-level auditory responses over time and is orthogonal in its goals to the data
reported here.

Linguistic Analysis of the Speech Stimulus

The speech narrative was transcribed manually by two independent annotators who were
native speakers of Hebrew, and verified by an expert linguist (遗传算法). The onset and duration
of each word and of fillers were identified and time-stamped by the two annotators and con-
firmed by the linguist, using the software Praat (Boersma & Weenink, 2021).

Based on the transcription, an expert linguist (遗传算法) parsed the speech stimulus into major
syntactic units, most notably clauses. We marked boundaries of all main clauses, 定义为
the minimal unit containing a predicate and all its complements and modifiers. This includes
clauses in coordinate constructions (starting with “and” or “but”). 在很多情况下, a clause can
contain a subordinate clause, whose boundaries we also marked. These included complement
条款 (例如, “we decided [that we would meet in one of the gardens]”), adverbial clauses
(例如, “we met [because we needed to talk]”), and relative clauses (例如, “we met a delegation
[that came from Japan]”). We also marked the boundaries of heavy phrases such as appositives
(例如, “we decided to go there, [me and my friends]”) or ellipses (例如, “I was fifteen years old
然后, [in the tenth grade]”).

Although syntactic parsing was done primarily based on the speech transcript, 它是
double-checked relative to the audio-recording in search of cases where the spoken prosody
suggested a different intended parsing than the textual syntactic analysis. 例如, 文本
“I went home with my friends” could be considered a single clause from a purely textual per-
观望的. 然而, in the audio recording, the speaker inserted a pause after the word
“home,” and hence, a listener would likely have identified the word “home” as the final word
in the clause, before the speaker decided to continue it with a prepositional phrase (“with my
friends”). Due to this perceptual consideration, in such cases, we marked both the word
“home” and the word “friends” as the final word in the clause.

Neurobiology of Language

438

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
3
4
3
5
2
1
5
6
0
0
4
n
哦
_
A
_
0
0
1
0
9
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Neural tracking of spontaneous speech

The time-stamped transcription and syntactic parsing were used to annotate the speech
stimulus according to the following four word-level features, which were used to analyze
the neural response (described below and summarized in Figure 1):

(西德:129) Fillers vs. non-fillers: Fillers were identified and time-stamped as part of the transcription
过程. The definition of fillers included both filled pauses (例如, “um,” “uh”) and filler
discourse markers, which are words that are lexical units, but their use in the utterance is
not related to their original lexical meaning (例如, “like,” “well”). A total of 92 fillers were
detected in the speech stimulus.

(西德:129) Words at syntactic boundaries (opening vs. closing words): Opening and closing words
were identified based on the marking of syntactic clause boundaries described above.
There were a total of 166 opening words and 142 closing words (since some clauses
were syntactically incomplete and did not have a clear closing word).

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
哦

我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
3
4
3
5
2
1
5
6
0
0
4
n
哦
_
A
_
0
0
1
0
9
p
d

我

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 1. Summary of speech stimulus used to analyze the neural response. (A) Excerpt from the
speech stimulus used, demonstrating the disfluencies of spontaneous speech. (乙) Example of the
four word-level features of spontaneous speech analyzed here. (C) Example of the quantification
of speech rate at the clause level.

Neurobiology of Language

439

Neural tracking of spontaneous speech

(西德:129) Word length (short vs. long words): The length of each word was evaluated based on the
time-stamped transcription and the median length was used to distinguish between short
and long words (median length: 319 多发性硬化症).

(西德:129) Information content (function vs. content words): We also differentiated between words
that carry the most information (IE。, content words, defined as nouns, 动词, 形容词,
and adverbs) and words that mostly play a syntactic role (IE。, function words, 定义为
pronouns, auxiliary verbs, 介词, conjunctions, and determiners). In this data set,
there were 413 content and 250 function words.

Analysis of Speech Rate

Potential effects of variability in speech rate cannot be assessed at a single-word level, 但
require assessing the rate of speech over longer periods of times. Here we chose to quantify
the mean speech rate within each main clause (including any embedded complement clauses
and restrictive relative clauses), following the rationale that a clause is the basic unit over
which information needs to be integrated during online listening.

We tested two metrics for operationalizing speech rate: syllable rate and word rate. 这些
metrics are highly correlated with each other (r = 0.76, p < 10–11 in the current data set) but emphasize slightly different aspects of information transfer, with syllable rate capturing the rate of acoustic input and word rate capturing the rate of linguistic input and general fluency. The word rate of each clause was quantified as the number of words in a clause (not including fillers) divided by its length. Similarly, the syllable rate was quantified as the number of sylla- bles in a clause (not including fillers) divided by its length. EEG Recordings EEG was recorded using a 64 Active-Two system (BioSemi) with Ag-AgCl electrodes, placed according to the 10–20 system, at a sampling rate of 1024 Hz. Additional external electrodes were used to record from the mastoids bilaterally and both vertical and horizontal electrooc- ulography electrodes were used to monitor eye movements. The experiment was conducted in a dimly lit, acoustically and electrically shielded booth. Participants were seated on a com- fortable chair and were instructed to keep as still as possible and breathe and blink naturally. Experiments were programmed and presented to participants using PsychoPy (Open Science Tools, 2019; Peirce et al., 2019). EEG Preprocessing and Speech-Tracking Analysis EEG preprocessing and analysis were performed using the MATLAB-based FieldTrip toolbox (Oostenveld et al., 2011) as well as custom-written scripts. Raw data were first visually inspected, and time points with gross artifacts exceeding ±50 μV (that were not eye move- ments) were removed. Independent component analysis was performed to identify and remove components associated with horizontal or vertical eye movements as well as heart- beats (Onton et al., 2006). Any remaining noisy electrodes that exhibited either extreme high-frequency activity (>40 赫兹) or low-frequency activity/drifts (<1 Hz), were replaced with the weighted average of their neighbors using an interpolation procedure. The clean EEG data were filtered between 1 and 10 Hz. broadband envelope the speech was extracted using equally spaced filterbank 100 10000 Hz based on Liberman’s cochlear frequency map (Liberman, 1982). narrowband signals summed across bands after taking absolute value Hilbert transform for each one, resulting in a broadband Neurobiology Language 440 l D o w n o a d e d f r o m h t t p : >
RESEARCH ARTICLE image

下载pdf