High-Order Areas and Auditory Cortex Both Represent - Ricerca sull'intelligenza artificiale specializzata al MIT

High-Order Areas and Auditory Cortex Both Represent
the High-Level Event Structure of Music

Jamal A. Williams1, Elizabeth H. Margulis1, Samuel A. Nastase1, Janice Chen2,
Uri Hasson1, Kenneth A. Norman1, and Christopher Baldassano3

Astratto

■ Recent fMRI studies of event segmentation have found that
default mode regions represent high-level event structure during
movie watching. In these regions, neural patterns are relatively
stable during events and shift at event boundaries. Music, like
narratives, contains hierarchical event structure (per esempio., sections
are composed of phrases). Here, we tested the hypothesis that
brain activity patterns in default mode regions reflect the high-
level event structure of music. We used fMRI to record brain
activity from 25 participants (male and female) as they listened
to a continuous playlist of 16 musical excerpts and additionally
collected annotations for these excerpts by asking a separate
group of participants to mark when meaningful changes occurred

in each one. We then identified temporal boundaries between
stable patterns of brain activity using a hidden Markov model
and compared the location of the model boundaries to the loca-
tion of the human annotations. We identified multiple brain
regions with significant matches to the observer-identified
boundaries, including auditory cortex, medial prefrontal cortex,
parietal cortex, and angular gyrus. From these results, we con-
clude that both higher-order and sensory areas contain informa-
tion relating to the high-level event structure of music. Inoltre,
the higher-order areas in this study overlap with areas found in
previous studies of event perception in movies and audio narra-
tives, including regions in the default mode network. ■

INTRODUCTION

Recent work has demonstrated that the brain processes
information using a hierarchy of temporal receptive win-
dows, such that sensory regions represent relatively short
events (per esempio., milliseconds to seconds) and higher-order
regions represent longer events (per esempio., minutes) while
inheriting some of the lower-level structure from sensory
regions (Baldassano et al., 2017; Chen et al., 2017; Hasson,
Chen, & Honey, 2015). Per esempio, Baldassano et al.
(2017) used a hidden Markov model (HMM) to find tran-
sitions between stable patterns of neural activity in BOLD
data acquired from participants who watched an episode
of the TV series Sherlock. The HMM temporally divides
data into “events” with stable patterns of activity, punctu-
ated by “event boundaries” where activity patterns rapidly
shift to a new stable pattern. They found that, in sensory
regions such as early visual cortex, the data were best-fit by
a model with short-lasting chunks, presumably corre-
sponding to low-level perceptual changes in the episode;
by contrast, when they applied the model to data from
a higher-order area such as posterior medial cortex, IL
best-fitting model segmented the data into longer-
lasting chunks corresponding to more semantically
meaningful scene changes. Critically, human annotations

1Princeton University, 2Johns Hopkins University, 3Columbia
Università

of important scene changes most closely resembled the
model-identified boundary structure found in frontal and
posterior medial cortex, which are key hubs in the brain’s
default mode network (DMN; Raichle et al., 2001; Shulman
et al., 1997). Studies have also found that the same event-
specific neural patterns are activated in default-mode
regions by audiovisual movies and by verbal narratives
describing these events (Baldassano, Hasson, & Norman,
2018; Baldassano et al., 2017; Zadbood, Chen, Leong,
Norman, & Hasson, 2017), providing further evidence
that these regions represent the underlying meanings
of the events and not only low-level sensory information.
Jackendoff and Lerdahl (2006) suggest that music and
language are structured into meaningful events that help
people comprehend moments of tension and relaxation
between distant events. If music resembles language in
Da questa parte, then the representation of hierarchical event
structure in music (per esempio., at the level of phrases, sections,
and entire songs) and in verbal and audiovisual narratives
may be supported by similar neural substrates. Infatti,
some evidence already exists for shared neural resources
for processing music and language (Asano, Boeckx, &
Seifert, 2021; Lee, Jung, & Loui, 2019; Jantzen, Large, &
Magne, 2016; Peretz, Vuvan, Lagroi, & Armory, 2015;
Tillmann, 2012; Koelsch, 2011; Patel, 2011; Fedorenko,
Patel, Casasanto, Winawer, & Gibson, 2009; Tallal & Gaab,
2006; Koelsch et al., 2002). This connection between
music and language is also supported by recent behavioral

Journal of Cognitive Neuroscience 34:4, pag. 699–714
https://doi.org/10.1162/jocn_a_01815

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
j

o
C
N
UN
R
T
io
C
e
–
P
D

F
/

3
4
4
6
9
9
1
9
9
6
6
9
5

/
j

o
C
N
_
UN
_
0
1
8
1
5
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

studies showing that instrumental music has the capacity
to drive shared narrative engagement across people
(Margulis, Wong, Turnbull, Kubit, & McAuley, 2021;
McAuley, Wong, Mamidipaka, Phillips, & Margulis, 2021;
Margulis, Wong, Simchy-Gross, & McAuley, 2019). Nel
current work, we test the hypothesis that DMN regions,
which represent high-level event structure in narratives,
also play a critical role in representing high-level event
structure in music.

In our paradigm, we presented fMRI participants with
examples of complex real-world music belonging to
genres familiar to our participant population: jazz and
classical. A separate group of behavioral participants were
asked to annotate meaningful events within each of the
excerpts. Using a whole-brain searchlight method, we
applied HMMs to measure event structure represented in
cortical response patterns throughout the brain. The goal
of this analysis was to identify brain regions that chunk
the stimuli in a way that matched the human annotations.
By fitting the model at each ROI and then comparing the
observed boundary structure to that of the annotators, we
show that—in a group of passive listeners—regions in the
DMN and also sensory areas are involved in representing
the high-level event structure in music (cioè., these regions
show neural pattern shifts that line up with human annota-
tions of event boundaries). We also show that these event
representations become coarser as they propagate up the
cortical processing hierarchy.

METHODS

Participants

We collected fMRI data from a total of 25 participants
(12 women, ages 21–33 years), which is roughly equal to
the number of participants recruited in recent studies on
event perception for narratives (per esempio., Baldassano et al.,
2018; Chen et al., 2017). We also recruited seven human
annotators for a separate behavioral task (described
below). Thirteen of the fMRI participants were native
English speakers. The experimental protocol was
approved by the institutional review board of Princeton
Università, and all participants gave their written informed
consent.

Stimuli

Sixteen musical excerpts were selected based on the crite-
rion that changes between subsections would likely be
recognized by people without formal music training
(per esempio., change from piano solo to drum solo). Excerpts also
had to be instrumental (cioè., lack vocals). Excerpts were
drawn from two different genres (eight classical and eight
jazz). Excerpts were then randomly selected to be trun-
cated (with the introductions kept intact) to one of four
different durations (90, 135, 180, E 225 sec), such that
there were four excerpts of each length. Inoltre,

two excerpts of each duration were sampled from each
genre. Per esempio, only two classical excerpts had a
duration of 90 sec, and only two jazz excerpts had a
duration of 90 sec. The total duration of the playlist was
approximately 45 min, and there were no breaks between
excerpts.

Experimental Design and Statistical Analysis

The experiment took place over three consecutive days
(Figura 1): On the first 2 days, participants heard a play-
list of 16 musical excerpts (once for each day), and on the
third day, they heard the same playlist for two separate
runs while we recorded changes in their BOLD activity
using fMRI. Altogether, each participant heard the playlist
four times. Each time that a given participant heard the
playlist, the excerpts were presented in a different order.
Tuttavia, within a given phase of the experiment (per esempio.,
the first scanner run on Day 3), the order of excerpts
was kept the same across participants. To promote stable
representations of the music, participants listened to the
playlist on each of the 2 days before scanning. During
these listening sessions, we collected ratings from partic-
ipants about their enjoyment, engagement, and familiar-
ity with each piece (only familiarity ratings are discussed
in this article); these ratings were collected immediately
after hearing each piece. Answers for each rating category
were given on a 5-point Likert scale where 1 = very
unfamiliar and 5 = very familiar. We found an increase
in average familiarity from Day 1 to Day 2, T(22) = 9.04,
P < .0001, indicating that participants remembered the music played in the first prescan session. Two partici- pants were excluded from this analysis because their Day 2 ratings were lost. After each of these listening sessions, participants took a short recognition test where they heard 32 randomly drawn 3-sec clips of a piece that were either from the actual listening session or a lure (i.e., different piece by the same artist) and made a response using a 5-point Likert scale indicating whether they recognized the excerpt as having been presented previously. In addition to the familiarity ratings across the two prescan days, this measure helped us determine if participants had learned the music after each behavioral listening session. Partici- pants showed above-chance discrimination (i.e., higher recognition scores for presented excerpts vs. lures) on both days: Day 1: t(24) = 12.2, p < .0001; Day 2: t(24) = 15.1, p < .0001 (Figure 2). On the third day, participants returned for the scanning session in which they listened to the playlist twice (with excerpts played in a different order for the two scanning runs; as noted above, the order of excerpts within a run was the same across participants). During each run, partic- ipants were asked to perform a white noise detection task. Specifically, during each excerpt, a brief (1 sec) white noise pulse was played at a randomly chosen time point within the middle 60% of each excerpt. The onset of 700 Journal of Cognitive Neuroscience Volume 34, Number 4 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 1. Top: Example of a 45-min scanning run, with classical excerpts depicted in pink and jazz excerpts in blue. Each block in the timeline represents an excerpt, and block lengths reflect excerpt durations. Bottom: Overview of experiment. Participants heard the playlist four times (once on each of the 2 days before scanning and twice on the third day while being scanned). The excerpts were presented in a different order each of the four times that a given participant heard the playlist, but—within a given phase of the experiment (e.g., Run 1 on Day 3)—the order of excerpts was kept the same across participants. each noise pulse was also randomized across participants. Participants were told to make a button response to indi- cate that they heard the noise. This manipulation served to keep participants attentive throughout each excerpt. Following both scanning runs, participants took a final recognition test and then completed a brief demo- graphic survey. Event Annotations by Human Observers In a separate behavioral experiment, we asked seven dif- ferent raters (only one rater reported having extensive musical training) to listen to our stimuli one at a time, with the task of pressing a button when a “meaningful” transi- tion occurred within each piece (similar to the method used by Sridharan, Levitin, Chafe, Berger, & Menon, 2007). The number of event boundaries identified by the observers varied across excerpts ranging from 3 to 17 boundaries (with a mean of 7.06 and a standard deviation of 0.91 across excerpts). It is worth noting that excerpt durations also varied, with a range of 90–225 sec (dura- tions were 90, 135, 190, or 225 sec) and an average dura- tion of 157.5 sec and a standard deviation of 50.3 sec across excerpts. A time point was considered to be an event boundary when at least five annotators marked a boundary within 3 sec before or after a given time point (method used from Baldassano et al., 2017). The mean number of consensus boundaries across excerpts acquired using this method roughly matched the mean number of boundaries assigned by individual participants across all of the excerpts (with a mean of 7.98 and a standard deviation of 2.98 across excerpts). Figure 2. Recognition test scores for both prescan days. Plot shows that presented excerpts were given higher recognition scores than lures. The y-axis represents a 5-point Likert scale where 1 = not studied and 5 = studied. Error bars represent SEM. Scanning Parameters and Preprocessing Imaging data were acquired on a 3-T full-body scanner (Siemens Prisma) with a 64-channel head coil. Data were Williams et al. 701 collected using a multiband accelerated T2-weighted EPI sequence (release R015) provided by a C2P agreement with University of Minnesota (Cauley, Polimeni, Bhat, Wald, & Setsompop, 2014; Auerbach, Xu, Yacoub, Moeller, & Uğurbil, 2013; Sotiropoulos et al., 2013; Xu et al., 2013; Setsompop et al., 2012; Moeller et al., 2010): 72 inter- leaved transverse slices, in-plane resolution = 2.0 mm, slice thickness = 2.0 mm with no interslice gap, field of view = 208 mm, base resolution = 104, repetition time (TR) = 1000 msec, echo time (TE) = 37 msec, flip angle (FA) = 60°, phase-encoding (PE) direction = anterior to posterior, multiband acceleration factor = 8. Three spin-echo volume pairs were acquired matching the BOLD EPI slice prescription and resolution in opposing PE directions (anterior to posterior and posterior to anterior) for susceptibility distortion correction: TR/TE = 8000/66.60 msec, FA/refocus FA = 90°/180°, acquisition time = 32 sec (Andersson, Skare, & Ashburner, 2003). Additionally, a whole-brain T1-weighted volume was collected: 3-D magnetization-prepared rapid gradient- echo sequence, 176 sagittal slices, 1.0 mm3 resolution, field of view = 256 mm, base resolution = 256, TR/TE = 2300/2.88 msec, inversion time = 900 msec, FA = 9°, PE direction = anterior to posterior, IPAT mode = GRAPPA 2×, acquisition time = 5 min 20 sec. The EPI volumes were realigned using a six-parameter rigid-body registration (MCFLIRT; Jenkinson, Bannister, Brady, & Smith, 2002). Given the short effective TR of 1 sec, slice time correction was not performed. Susceptibility- induced distortions were modeled in the opposing spin- echo volume pairs using the FSL topup tool, and the resulting off-resonance field output was provided as input to distortion correct the time series of fMRI data using the FSL applywarp tool (Andersson et al., 2003). The sus- ceptibility distortion correction and realignment were applied in a single interpolation step to minimize blurring. Remaining preprocessing and coregistration steps were performed using FEAT ( Woolrich, Behrens, Beckmann, Jenkinson, & Smith, 2004; Woolrich, Ripley, Brady, & Smith, 2001). This included linear detrending, high-pass filtering (330 sec cutoff ), and spatial normalization to the MNI152 template released with FSL. Whole-brain Searchlight Procedure We conducted our primary analysis using a whole-brain searchlight approach (Figure 3A). First, all participants’ volumetric data were averaged together and divided into overlapping spherical searchlights, each with a radius of 10 voxels and a stride of 5 voxels (Figure 3B). This resulted in 2483 searchlights that spanned the whole cortex in MNI space. Only searchlights containing at least 30 voxels were included in the analysis, and the mean number of voxels per searchlight was 381.76 voxels, with a standard deviation of 168.09 voxels. We assigned the output value for a given searchlight to all voxels within a 5-voxel radius to account for the stride and then averaged the values for voxels where overlap occurred. All analyses below were run separately within each searchlight. Event Segmentation Analysis For each searchlight, we fit an HMM (Baldassano et al., 2017) to the timeseries for each excerpt, setting the num- ber of states in the HMM equal to the number of segments specified by our human annotators for each excerpt. Furthermore, although we provide the HMM with a spe- cific number of events, we do not give it any information about where these events are in the data. Therefore, the model is unsupervised in terms of locating the boundaries between events. We used a specialized HMM variant developed by Baldassano et al. (2017) that is optimized for event segmentation (i.e., identifying jumps in neural patterns). This HMM variant seeks to model the fMRI time series as a set of successive transitions between stable states, where—in our variant of the HMM—the model is not permitted to return to a state once it leaves that state. Fitting the model to the data involves estimating the voxel pattern for each stable event state as well as the timing of transitions between these patterns; this HMM variant was Figure 3. Diagram of analysis pipeline. From left to right: (A) For each participant (n = 25), voxels from an ROI were selected using a searchlight approach; we then extracted song-specific time courses ( Voxels × TRs [TR = 1 sec]) from the selected voxels (black circle). Inflated brain image was created using PySurfer (https://github.com/nipy/PySurfer/). (B) RT courses were averaged across participants (aligned in volumetric MNI space). (C) An HMM was used to identify boundary time points, when there was a change in the spatial pattern of activity across voxels. HMM boundaries (white dashed lines) and human annotations (black lines) were considered to match (downward arrows) when HMM boundaries fell within three TRs (3 sec) of a human annotation. Then, true match scores were compared with a null distribution constructed by comparing shuffled HMM boundaries to human annotations, resulting in a z score for each ROI. 702 Journal of Cognitive Neuroscience Volume 34, Number 4 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 implemented using the EventSegment function in BrainIAK (Kumar et al., 2022). For our primary analysis, we were interested in finding brain regions whose transition structure most closely resembled the event boundary structure given by our annotators (Figure 3C). After acquiring boundary esti- mates from the HMM, we evaluated how closely in time the boundaries found by the model matched the bound- aries supplied by our annotators. To quantify the degree of match, we counted the number of human-annotated boundaries for which there was an HMM boundary within three TRs (3 sec) of that human-annotated boundary. Note that all human boundaries were shifted later by five TRs (5 sec) to account for the hemodynamic lag. We created a null model by randomly selecting time points as bound- aries (keeping the number of events the same, as in Baldassano et al., 2017) and computed the number of matches for these null boundaries, repeating this process 1000 times to produce a null distribution. We computed a z value of the real result versus the null distribution by subtracting the average of the permuted match scores from the true match score and dividing this difference by the standard deviation of the permuted scores. This procedure was repeated at every searchlight. By acquiring z scores at each searchlight for all 32 excerpts (16 distinct excerpts × 2 runs), we obtained 32 separate spatial maps of z scores. Next, we averaged the two z maps corre- sponding to each distinct excerpt (one from each run), resulting in 16 total z maps. To summarize across the z scores for the 16 distinct excerpts, we ran a one-sample t test against zero to see which voxels had the most reli- able matches across all excerpts. The resulting t values were converted to p values and then adjusted for multi- ple tests to control for the false discovery rate (FDR) at a value q (Benjamini & Hochberg, 1995). To visualize the results, each spatial map of t values was displayed on the cortical surface (masked to include only vertices that exhibited a significant effect). Because each analysis was performed in volumetric space, volume data were pro- jected to the cortical surface using the automatic volume to surface rendering algorithm within PySurfer (https:// github.com/nipy/PySurfer/). Controlling for Acoustic Features To further determine whether regions of the DMN repre- sent high-level musical event structure, as opposed to surface-level acoustic information, we repeated the searchlight analysis, this time regressing out musical fea- tures extracted from each auditory stimulus before fitting the HMM. All feature extraction was performed using Librosa (McFee et al., 2015), a Python package developed for audio and music analysis. These features consisted of mel-frequency cepstral components (MFCCs; i.e., timbre information), chromagrams (tonal information), tempo- grams (rhythmic information), and spectrograms. For MFCCs, the top 12 channels were extracted because these lower-order coefficients contain most of the information about the overall spectral shape of the source-filter trans- fer function (Poorjam, 2018). Chromagrams consisted of 12 features, each corresponding to a distinct key in the chromatic scale. Tempograms initially consisted of 383 features, each representing the prevalence of certain tempi (in beats per minute) at each moment in time. Because most of the tempo-related variance was explained by a much smaller set of features, we reduced the 383 fea- tures to 12 features using PCA (variance explained was 99%) to match the number of features used for MFCCs and chromagrams. Spectrograms were extracted using the short-time Fourier transform (STFT) and then con- verted to a decibel-scaled spectrogram. Then, we also used PCA to reduce the dimensionality of the spectro- grams to 12 components, which explained 98% of the frequency-related variance. For the final step of this analy- sis, we applied the HMM to the residuals after the musical features were regressed out of the neural data. Identifying Preferred Event Timescales After identifying brain regions with neural event bound- aries that matched the human annotations (using the pro- cedures described in the Event Segmentation Analysis section above), we ran a follow-up analysis to further probe the properties of four such regions (bilateral audi- tory cortex, bilateral angular gyrus, bilateral medial pre- frontal cortex [mPFC], and bilateral precuneus). Specifi- cally, the goal of this follow-up analysis was to assess the preferred timescales of these regions. Angular gyrus, mPFC, and precuneus were selected (in addition to audi- tory cortex) because activity patterns in these regions have been found to exhibit high-level event structure in recent studies using naturalistic stimuli such as movies (Geerligs, van Gerven, Campbell, & Güçlü, 2021; Ben-Yakov & Henson, 2018; Baldassano et al., 2017; Honey et al., 2012) and spoken narratives (Lerner, Honey, Silbert, & Hasson, 2011). In contrast to our primary event segmenta- tion analysis (which used a fixed number of events for each excerpt, matching the number of human-annotated events for that excerpt), here we tried models with different num- bers of events and assessed how well the model fit varied as a function of the number of events. The measure of model fit we used was the average pattern similarity between pairs of time point-specific multivoxel patterns falling “within” the same event, minus the average pat- tern similarity between patterns falling “across” events (Baldassano et al., 2017). We call this measure the “WvA score” (short for “Within vs. Across”); higher WvA scores indicate a better fit of the event boundaries to the data. The ROIs for this analysis were defined by selecting voxels within functionally defined parcellations (Schaefer et al., 2018) corresponding to bilateral auditory cortex, bilateral angular gyrus, bilateral mPFC, and bilateral precuneus and then (for extra precision) intersecting these parcels with voxels that were also significant in our primary searchlight Williams et al. 703 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 analysis looking for neural boundaries that matched human-annotated boundaries (q < 0.01). For each ROI, we fit HMMs to each song with differing numbers of events ranging from 3 to 45. For each HMM fit, we mea- sured the maximum event duration and then identified all pairs of time points whose temporal distance was less than this duration. The constraint of using time points whose distance was less than the maximum event duration was used so that the number of within- and across-event pairs would be roughly equal (regardless of the number of events). The WvA score was computed as the average spatial pattern correlation for pairs of time points falling in the same (HMM-derived) event minus the average correlation for pairs of time points falling in different events. We then averaged the results across excerpts. Note that, because the excerpts are different lengths, a given number of events might correspond to different average event lengths for different excerpts (e.g., a three-event model applied to a 180-sec excerpt has an average event length of 60 sec, but a three-event model applied to a 90-sec excerpt would have an average event length of 30 sec). Because our goal was to find each area’s preferred event length, we converted our WvA results for each excerpt to be a function of the average event length (in seconds) rather than the number of events and averaged these results across excerpts. Finally, to compute the preferred event length for each ROI, we identified the range of event lengths that were within 5% of the maximum WvA score for that ROI; we report the midpoint of this range as the preferred event length. To test whether the preferred event length in auditory cortex was shorter than that of angular gyrus, precuneus, and mPFC, we performed a bootstrap analysis, repeating the above analysis 1000 times for different bootstrap resamples of the original data set. At each iteration of the bootstrap, we applied the analysis to a sample of par- ticipants drawn randomly with replacement from the orig- inal data. We computed p values by finding the proportion of bootstraps where the preferred length for auditory cor- tex was greater than the preferred length for angular gyrus, precuneus, and mPFC. RESULTS Neural Boundary Match to Behavioral Annotations We wanted to test the hypothesis that behaviorally defined event boundaries could be identified in higher-order corti- cal regions, especially those overlapping with the DMN. For this analysis, we fit an HMM to BOLD data averaged across both runs and then compared the HMM boundaries to the specific time points labeled as boundaries by the annota- tors. We found significant matches between model bound- aries and human annotations in auditory cortex, angular gyrus, precuneus, and mPFC, with a greater number of model boundaries and human boundaries having low tem- poral distance than expected by chance (Figure 4). Results for this analysis are split by Run 1 and Run 2 in Appendix B. Figure 4. Distance to boundary searchlight results. For 2483 searchlights spanning the entire cortex, we tested whether the average match between neural and annotated boundaries across all songs was significantly greater than zero. Significant voxels overlapped with auditory cortex as well as areas of the DMN such as precuneus, mPFC, and angular gyrus. Results are thresholded via FDR (q < 0.01). Influence of Acoustic Features To determine the extent to which the neural event bound- aries were driven by acoustic features, we also performed a version of the searchlight analysis in which we controlled for spectral, timbral, harmonic, and rhythmic information. Overall, this reduced the number of searchlights passing the q < 0.01 FDR threshold (Figure 5) compared with the original searchlight analysis. However, searchlights in DMN regions (precuneus, angular gyrus, and mPFC) did pass the q < 0.01 threshold, with voxels in mPFC being (numerically) least affected by the feature removal. When we set a more liberal FDR threshold (q < 0.05; results shown in Appendix A), the relationship between neural event boundaries and human annotations was still largely conserved in precuneus, angular gyrus, and auditory cor- tex. This suggests that, although voxels in precuneus and angular gyrus are more sensitive to acoustic features than mPFC, event boundaries found in these regions do not directly correspond to simple changes in the acoustic features and may instead be related to more complex representations of the event structure (e.g., nonlinear combination of acoustic features). Notably, significant searchlights in auditory cortex were also observed (partic- ularly in right auditory cortex), indicating that—even in sensory areas—the event boundaries were being driven (at least in part) by more high-level aspects of the music. Comparing Annotated Event Boundaries to Changes in Acoustic Features In a follow-up analysis, we sought to further investigate the relationship between the event boundaries and changes in the acoustic features by assessing how often the behaviorally defined event boundaries occurred at 704 Journal of Cognitive Neuroscience Volume 34, Number 4 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 (i.e., MFCC, chromagram, tempogram, and spectrogram) extracted from each of the excerpt audio files. One advan- tage of using the GSBS algorithm for this analysis is that GSBS can automatically identify the optimal number of states that maximizes the difference between within- versus across-event similarity. After acquiring the optimal set of GSBS event boundaries for each excerpt, we compared them to the human annotations by computing the probabil- ity that a shift in an acoustic feature generated a matching human annotation (within 3 sec). Additionally, we assessed whether this probability was greater than what would be predicted by chance by establishing a null distribution whereby we shuffled the feature boundaries for each excerpt while preserving the distances between boundaries. We found that feature boundaries did align with human- annotated boundaries more often than in the null model, but that most feature changes did not result in a human- annotated boundary ( p(annotation | chroma boundary) = 0.143 vs. null value of 0.115 [p < .001], p(annotation | MFCC boundary) = 0.493 vs. null value of 0.299 [p < .001], p(annotation | tempo boundary) = 0.198 vs. null value of 0.179 [p < .05], p(annotation | spectrogram bound- ary) = 0.160 vs. null value of 0.128 [p < .001]; illustrated using an example excerpt in Figure 6A). We also computed the distribution (across human- annotated boundaries) of the number of acoustic feature types that changed within 3 sec of each annotated bound- ary (e.g., if chroma and tempo both changed, that would be two feature types). We compared this distribution to a null model that we obtained by shuffling the human- Figure 5. Searchlight results accounting for acoustic features. We recomputed the match between HMM-derived neural boundaries and human annotations after regressing out acoustic features from each participant’s BOLD data before fitting the HMM. Significant effects were still observed in parts of the DMN as well as auditory cortex, suggesting that boundaries detected in these areas do not necessarily depend on acoustic features. Results are thresholded via FDR (q < 0.01). the same time as changes in each of the acoustic features. In other words, how often does a change in an acoustic feature generate a human boundary? To estimate the number and locations of state changes within each of the excerpts, we applied the Greedy State Boundary Search (GSBS) event segmentation model (Geerligs, van Gerven, & Güçlü, 2021) to each of the acoustic features Figure 6. (A) Example of acoustic features (from My Favorite Things by John Coltrane) showing overlap between human annotations (red) and feature boundaries (white dashed lines). For each acoustic feature, we identified time points at which changes occurred using the GSBS event segmentation model (white dashed lines). We then compared the locations of these feature boundaries to the locations of the human annotations (red lines); see text for results. (B) Number of acoustic features that change at human-annotated event boundaries. Counting how many acoustic features exhibit a boundary at the same time as a human-annotated boundary (blue) versus a null distribution (orange), we find that the observed distribution is shifted upward relative to the null distribution, such that human-annotated boundaries are more likely to occur in response to two or more feature changes. Furthermore, some human annotations occur in the absence of any feature change. Williams et al. 705 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 annotated boundaries for each excerpt while preserving the distances between boundaries. The results of this analysis are shown in Figure 6B. The fact that the observed distribution is shifted upward relative to the null tells us that the probability of human boundaries coinciding with auditory the feature changes is higher than would be expected due to chance (χ2 = 19.54, p < .001 by permu- tation test). The figure also shows that, though the majority of human boundaries occurred at points where two or more acoustic feature changes were present, some human boundaries occurred at time points where no acoustic feature changes were present. Preferred Event Lengths across ROIs How do we reconcile the role of auditory cortex in high- level event representation (as shown in the above analy- ses) with its well-known role in representing low-level auditory features? Importantly, these claims are not mutu- ally exclusive. Our analyses, which set the number of event states in the model to equal the number of human- annotated boundaries, show that auditory cortex has some (statistically reliable) sensitivity to high-level events, but this does not mean that this is the “only” event information coded in auditory cortex or that it is the “preferred” level of event representation. We defined the preferred timescale of each region (ROI selection is discussed in the Experimental Design and Sta- tistical Analysis section) by running HMMs with different numbers of event states and finding the average event length (in seconds) that produced the best model fits across songs (Figure 7A). Using a bootstrap analysis, we found that auditory cortex’s preferred event length (13.81 sec) was significantly shorter than the preferred event length of mPFC (25.59 sec; p = .009) but was not significantly shorter than the preferred length of angular gyrus (13.36 sec; p = .664) or precuneus (14.61 sec; p = .338). The preferred event length in mPFC was also signif- icantly longer than the preferred event length for precu- neus ( p = .017) and angular gyrus ( p = .004). In addition to varying the timescale (i.e., in the best- fitting number of events), regions could differ in the quality of this fit; some regions may exhibit sharper event transitions, with large pattern changes across HMM event boundaries and highly stable patterns within events. We therefore tested whether the model fit (maximum WvA score) was different between the four ROIs (Figure 7B). We found that the model fit for angular gyrus was l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 7. (A) Longer states were preferred in mPFC (average event length 25.59 sec) than in auditory cortex (13.81 sec), precuneus (14.61 sec), and angular gyrus (13.36 sec). The preferred event length did not significantly differ between auditory cortex, precuneus, and angular gyrus. (B) The overall within-event pattern similarity was highest in angular gyrus, suggesting that the stability of musical event representations was higher than in other ROIs. There was no difference in within-event pattern similarity between precuneus and auditory cortex; however, pattern similarity was significantly less in mPFC than in auditory cortex ( p < .05). (C) Similarity matrices (for the first 90 sec of the excerpt Capriccio Espagnole by Nikolai Rimsky-Korsakov) showing pattern similarity over time for each ROI with human-annotated boundaries shown in black. mPFC exhibits the coarsest event structure relative to auditory cortex, precuneus, and angular gyrus. 706 Journal of Cognitive Neuroscience Volume 34, Number 4 significantly greater than auditory cortex ( p < .001), pre- cuneus ( p < .001), and mPFC ( p < .001), indicating that the temporal event structure was strongest in angular gyrus. For analyses of preferred event length and model fit in a more complete set of DMN ROIs and in hippo- campus, see Appendices C and D, respectively. DISCUSSION In this study, we sought to determine whether brain areas that have been implicated in representing high-level event structure for narrative-based stimuli, such as movies and spoken narratives, are also involved in representing the high-level event structure of music in a group of passive listeners. We provide evidence that regions of the DMN are involved in representing the event structure of music as characterized by human annotators. The durations of these human-annotated events lasted on the order of a few seconds up to over a minute. Our results indicate that high-level structure is repre- sented in both high-level DMN regions but also in auditory cortex. Auditory cortex, however, may not explicitly repre- sent high-level events at the level of human annotators; that is, the behaviorally identified event boundaries are likely a subset of the finer-grained event boundaries encoded in auditory cortex. When we force the HMM to match the number of human-annotated boundaries, the HMM finds them, demonstrating that coding in auditory cortex is modulated by high-level event structure. How- ever, when we remove this constraint and allow the num- ber of events to vary, auditory cortex prefers shorter events on average relative to mPFC but not precuneus and angular gyrus (Figure 7A), whereas mPFC preferred the longest events compared with the other three ROIs. The finding that the preferred event length of auditory cor- tex was not significantly different from that of precuneus and angular gyrus was surprising given the prediction that auditory cortex, which is generally thought to respond to fast-changing aspects of a stimulus, would represent shorter events than higher-order brain areas (Baldassano et al., 2017; Farbood, Heeger, Marcus, Hasson, & Lerner, 2015; Lerner et al., 2011; Hasson, Yang, Vallines, Heeger, & Rubin, 2008); we discuss this point further in the limita- tions section below. In addition to measuring each area’s preferred timescale, we also measured within-event stabil- ity across the four ROIs; here, we found that angular gyrus exhibits the strongest within-event activity relative to pre- cuneus, mPFC, and auditory cortex. Next, we showed that—when we regress out acoustic features corresponding to timbre, harmony, rhythm, and frequency amplitude and rerun the analysis—voxels in higher-order areas (mPFC, angular gyrus, and precuneus), as well as auditory cortex, still significantly match with the annotations. These results suggest that event boundaries in these regions are not purely driven by acoustic changes in the music but are also tracking more complex event structure in musical pieces. These findings are consistent with findings from Abrams et al. (2013), who found that naturalistic music elicited reliable synchronization in auditory cortex as well as higher-order cortical areas after controlling for acoustic features; they concluded that this synchronization was not purely driven by low-level acous- tical cues and that it was likely driven by structural ele- ments of the music that occurred over long timescales. To further determine how much event boundaries were driven by changes in acoustic features, we ran a follow-up analysis where we first identified event transitions in each of the acoustic features corresponding to timbre, tonality, rhythm, and frequency amplitudes for each excerpt using an unsupervised algorithm (GSBS); then, we computed the probability that a human annotation was generated by changes in each of the different types of acoustic fea- tures. We found that the probability of human-annotated boundaries coinciding with acoustic feature changes was higher than the rate expected because of chance, but the relationship was complex: Although changes in each of the individual acoustic feature types were significantly related to the occurrence of annotated boundaries, none of these features came close to fully predicting the anno- tated boundaries, and although the majority of annotated boundaries occurred at time points where two or more acoustic features changed, some annotated boundaries did not correspond to changes in any of the acoustic features that we tracked. This adds further support to the possibility that boundaries marking the shift between large-scale segments within the DMN and auditory areas could be driven by a complex shift in a combination of the acoustic properties and/or possibly emotional (Daly et al., 2015) or narrative (Margulis et al., 2019, 2021; McAuley et al., 2021) changes within the excerpts, rather than a change in a single feature. Importantly, our findings of high-level coding in audi- tory cortex converge with other recent work demonstrat- ing that hierarchical neural representations of music are distributed across primary and nonprimary auditory cortex (Landemard et al., 2021) and that higher-order represen- tations of music in these areas may even support complex behaviors such as genre recognition in humans (Kell, Yamins, Shook, Norman-Haignere, & McDermott, 2018). Our study contributes to this growing literature by showing that auditory cortex codes for musical event rep- resentations at intermediate timescales (∼14 sec). Note also that auditory cortex coding for these intermediate- scale events is not mutually exclusive with it “also” coding for shorter-timescale events. When discussing limitations of our study below (limitation point number 4), we pro- vide some reasons why our design was not ideal for detect- ing neural coding of short-timescale events. In our study, we provide strong evidence for the involve- ment of mPFC in representing high-level musical event structure. Recent fMRI studies of naturalistic stimulus pro- cessing (i.e., audiovisual movies) have shown that mPFC may perform event segmentation and integration during continuous memory formation (Antony et al., 2021; Liu, Williams et al. 707 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Shi, Cousins, Kohn, & Fernández, 2021) and that events in this region can last up to hundreds of seconds (Geerligs et al., 2021; Chen et al., 2017; Hasson et al., 2015). We also show that the preferred event length in mPFC was ∼25 sec (which was roughly equal to the preferred timescale found for mPFC in the study by Geerligs et al., 2021, in which a movie was used rather than music), adding further support to the hypothesis that mPFC plays an important role in representing long-timescale information in naturalistic stimuli. Furthermore, our findings go beyond the assump- tion that areas of the DMN only represent long-timescale information for narrative-based stimuli and instead suggest that areas of the DMN represent long-timescale informa- tion across a range of naturalistic stimuli, including music. The recruitment of mPFC during music process- ing has also been found in a previous study (Blood & Zatorre, 2001). Specifically, Blood and Zatorre showed that activity in vmPFC was correlated with pleasure response ratings to music, suggesting that frontal areas, which represent long-timescale event structure for music, may also play a role in processing reward and affect in response to music. Our findings that precuneus, mPFC, and angular gyrus were involved in representing high-level musical event structure contrast with those in Farbood et al. (2015), who found that regions that responded reliably to stories did not respond reliably to music. Furthermore, in their study, there was minimal overlap between voxels in angu- lar gyrus and mPFC that responded to stories and voxels that responded to music. In our study, we show that, at a regional level, these areas are indeed involved in repre- senting the high-level event structure in music. One major way in which our studies differed was our use of an HMM to detect evidence of musical event structure in higher- order areas. The HMM is optimized to detect periods of relative stability punctuated by shifts in response patterns, which one would expect for an area encoding high-level event structure (i.e., there should be stability within events and changes across events). Temporal intersubject corre- lation analysis (the analysis method used in the study by Farbood et al., 2015) is designed to pick up on “any” kind of reliable temporal structure and is not specifically designed to detect the “stability punctuated by shifts” structure that we associate with event cognition, making it less sensitive to this kind of structure when it is present. This highlights one of the advantages of using HMMs for detecting meaningful brain activity related to the temporal dynamics of naturalistic stimuli, such as music. Our study had several limitations: 1. In our feature regression analysis, the acoustic fea- tures we selected may not represent the full range of acoustic dynamics occurring throughout each excerpt. Previous studies using encoding models to examine brain activity evoked by music employed a range of acoustic features, such as the modulation transfer function (Norman-Haignere, Kanwisher, & McDermott, 2015; Patil, Pressnitzer, Shamma, & Elhilali, 2012) as well as music-related models repre- senting mode, roughness, root mean square energy, and pulse clarity (Nakai, Koide-Majima, & Nishimoto, 2021; Toiviainen, Alluri, Brattico, Wallentin, & Vuust, 2014; Alluri et al., 2012). However, the types of infor- mation captured by these features are also roughly captured by the features used in this study. For exam- ple, features representing roughness and root mean square capture timbral information, whereas pulse clarity captures rhythmic information. On the other hand, although these features capture some informa- tion related to the ones used in this study, they may nonetheless still be useful for capturing additional information not fully captured by our features. Future work is needed to determine how higher-order areas are affected by a larger set of acoustic features. 2. Another caveat is that we only scanned participants listening to prefamiliarized musical stimuli—as such, it is unclear whether the observed pattern of DMN results (showing engagement of these regions in long-timescale segmentation) would extend to unfa- miliar musical stimuli. Consistent with this view, the work by Castro et al. (2020) showed that familiar music engaged DMN more strongly than unfamiliar music. However, a study by Taruffi, Pehrs, Skouras, and Koelsch (2017) showed that DMN was engaged for unfamiliar music, particularly for sad music com- pared with happy music. Future work investigating high-level musical event structure representation can address this by scanning participants while they listen to both unfamiliar and familiar stimuli. 3. The white noise detection task that participants per- formed may have influenced DMN responding. The DMN has been shown to activate during mind- wandering or stimulus-independent thought (Mason et al., 2007). Because the white noise was spectrally distinct from the music, participants could conceiv- ably perform the white noise detection task without attending to the music, leaving room for them to mind-wander in between white noise bursts; conse- quently, some of the DMN responding could (in prin- ciple) have been driven by mind-wandering instead of music listening. However, stimulus-independent mind-wandering cannot explain our key finding that neural event boundaries in DMN regions align with the annotated event boundaries—this result clearly demonstrates that these DMN areas are tracking structural aspects of the music. It is possible that our estimates of preferred event length for different ROIs were biased by the range of event lengths present in our stimulus set. In par- ticular, a lack of short (vs. long) events may have resulted in an upward bias in our estimates of pre- ferred event length. This bias, however, cannot explain the relative differences that we observed between ROIs’ preferred timescales, such as mPFC 4. 708 Journal of Cognitive Neuroscience Volume 34, Number 4 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 preferring longer events than auditory cortex, precu- neus, and angular gyrus. However, the relative scar- city of short events may have impaired our ability to resolve timescale differences between regions at the short end of the timescale continuum; in partic- ular, this might help to explain why we did not observe significant differences in preferred time- scales between primary auditory cortex (which, based on prior work, we expected to have a short timescale preference) and DMN regions. Future work can shed light on this by using stimuli with a broader range of event lengths. However, even if we include stimuli with shorter events, our ability to detect these more rapid event transitions may be inherently limited by the slow speed of the fMRI BOLD response. Conclusion In this study, we sought to determine whether certain regions in the DMN, which have been shown to be involved in representing the high-level event structure in narratives, were also involved in representing the high-level event structure in real-world music. Recent fMRI work, not using music, has shown that HMMs can help us understand how the brain represents large-scale event structure. By using HMMs to segment fMRI response patterns over time according to the event struc- ture provided by a separate group of human annotators, we found that areas of the DMN were indeed involved in representing the high-level event structure (e.g., phrases, sections) in music in a group of passive listeners. Of par- ticular importance are the findings that mPFC has a chunking response that is close to that of human observers and survives the boundary alignment search- light analysis even after controlling for acoustic features. This suggests that mPFC plays an important role in high- level event representation not only for movies and stories (Geerligs et al., 2021; Baldassano et al., 2017; Chen et al., 2017; Hasson et al., 2015; Lerner et al., 2011) but also for instrumental music. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 APPENDIX A / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure A1. Distance to boundary regression results at q < 0.05. Plots show distance to boundary regression results in which we regress out MFCCs, chromagrams, tempograms, and spectrograms. Results are FDR corrected at q < 0.05. These results show that, although many voxels in the DMN are not significant at the q < 0.01 threshold (Figure 5), many DMN voxels do survive when we threshold the regression results at q < 0.05. This suggests that, though many voxels in the DMN are somewhat sensitive to acoustic features (because many of these voxels do not survive at q < 0.01 in the nonregression distance to boundary results), activity in these areas is not solely driven by low-level acoustic features. Williams et al. 709 APPENDIX B Figure B1. (A) Distance to boundary searchlight Run 1 and Run 2. Searchlight maps for each run separately showing regions where significant matches between human annotations and HMM boundaries were observed (FDR corrected q < 0.01). (B) Distance to boundary regression searchlight Run 1 and Run 2. Searchlight maps for each run separately showing regions where significant matches between human annotations and HMM boundaries were observed after regressing out acoustic features (FDR corrected q < 0.01). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 710 Journal of Cognitive Neuroscience Volume 34, Number 4 APPENDIX C l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure C1. (A) Preferred event lengths across finer set of DMN and auditory ROIs. We sought to further determine the set of event lengths preferred within each ROI using a finer set of parcellations (Schaefer 300 as opposed to Schaefer 100). We attempted to threshold this image by only including ROIs with significant model fits (determined via bootstrapping). Nothing survived our threshold criteria; therefore, we are reporting unthresholded results. Subregions of DMN preferred a variety of event lengths, which was not obvious when using a coarser set of parcellations. For example, although mPFC obtained from the Schaefer 100 parcellation set shows a preference for the longest event lengths (∼25 sec), when evaluating this for a finer set of mPFC ROIs (Schaefer 300), we can see that mPFC subregions prefer a variety of event lengths ranging from 6 to 40 sec. (B) Model fits also vary greatly for the same set of DMN and auditory ROIs. APPENDIX D Figure D1. (A) Anterior hippocampus preferred event length did not significantly differ from auditory cortex, precuneus, mPFC, angular gyrus, or posterior hippocampus. Posterior hippocampus preferred event length did not significantly differ from auditory cortex, precuneus, angular gyrus, or anterior hippocampus, but was significantly less than mPFC ( p < .05). (B) Our measure of model fit (i.e., the difference between within-event and across-event pattern similarity) was significantly lower in hippocampal ROIs than in other DMN ROIs (auditory cortex, p < .001; precuneus, p < .001; mPFC, p < .001; angular gyrus, p < .001), whereas model fit in posterior hippocampus was greater than in anterior hippocampus ( p < .05). Williams et al. 711 Acknowledgments We thank Mark A. Pinsk for contributing to the Scanning Param- eters and Preprocessing section of the article, Benson Deverett for helping with the stimulus presentation script in Python, Elizabeth McDevitt for suggestions on the figures, Sara Chuang for helping with stimulus selection, and the members of the Hasson, Pillow, and Norman labs for their comments and support. Reprint requests should be sent to Jamal A. Williams, Princeton Neuroscience Institute and Department of Psychology, Prince- ton University, Princeton, NJ 08544, or via e-mail: jamalawilliams @gmail.com. Author Contributions Jamal A. Williams: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Validation, Visualization, Writing—Original Draft, Writing—Review & Editing. Elizabeth H. Margulis: Supervision, Writing—Original Draft, Writing—Review & Editing. Samuel A. Nastase: Writing—Original Draft, Writing—Review & Editing. Janice Chen: Conceptualization, Methodology, Supervi- sion, Writing—Review & Editing. Uri Hasson: Conceptual- ization, Funding Acquisition, Methodology, Supervision. Kenneth A. Norman: Conceptualization, Funding Acquisi- tion, Methodology, Project Administration, Supervision, Writing—Original Draft, Writing—Review & Editing. Christopher Baldassano: Conceptualization, Formal analysis, Methodology, Project Administration, Software, Supervision, Writing—Original Draft, Writing—Review & Editing. Funding Information This work was supported by National Institute of Mental Health (https://dx.doi.org/10.13039/100000025), grant number: R01 MH112357-01 to U. H. and K. A. N. and National Institute of Neurological Disorders and Stroke (https://dx.doi.org/10.13039/100000065), grant number: F99 NS118740-01 to J. W. Data Availability The fMRI data used in this study have been publicly released on OpenNeuro (https://openneuro.org/datasets /ds004007/versions/1.0.2). Diversity in Citation Practices Retrospective analysis of the citations in every article pub- lished in this journal from 2010 to 2021 reveals a persistent pattern of gender imbalance: Although the proportions of authorship teams (categorized by estimated gender identi- fication of first author/last author) publishing in the Journal of Cognitive Neuroscience ( JoCN) during this period were M(an)/M = .407, W(oman)/M = .32, M/W = .115, and W/W = .159, the comparable proportions for the articles that these authorship teams cited were M/M = .549, W/M = .257, M/W = .109, and W/W = .085 (Postle and Fulvio, JoCN, 34:1, pp. 1–3). Consequently, JoCN encour- ages all authors to consider gender balance explicitly when selecting which articles to cite and gives them the oppor- tunity to report their article’s gender citation balance. The authors of this article report its proportions of citations by gender category to be as follows: M/M = .612; W/M = .224; M/ W = .122; W/ W = .041. REFERENCES Abrams, D. A., Ryali, S., Chen, T., Chordia, P., Khouzam, A., Levitin, D. J., et al. (2013). Inter-subject synchronization of brain responses during natural music listening. European Journal of Neuroscience, 37, 1458–1469. https://doi.org/10 .1111/ejn.12173, PubMed: 23578016 Alluri, V., Toiviainen, P., Jääskeläinen, I. P., Glerean, E., Sams, M., & Brattico, E. (2012). Large-scale brain networks emerge from dynamic processing of musical timbre, key and rhythm. Neuroimage, 59, 3677–3689. https://doi.org/10.1016/j .neuroimage.2011.11.019, PubMed: 22116038 Andersson, J. L. R., Skare, S., & Ashburner, J. (2003). How to correct susceptibility distortions in spin-echo echo-planar images: Application to diffusion tensor imaging. Neuroimage, 20, 870–888. https://doi.org/10.1016/S1053 -8119(03)00336-7, PubMed: 14568458 Antony, J. W., Hartshorne, T. H., Pomeroy, K., Gureckis, T. M., Hasson, U., McDougle, S. D., et al. (2021). Behavioral, physiological, and neural signatures of surprise during naturalistic sports viewing. Neuron, 109, 377–390. https://doi.org/10.1016/j.neuron.2020.10.029, PubMed: 33242421 Asano, R., Boeckx, C., & Seifert, U. (2021). Hierarchical control as a shared neurocognitive mechanism for language and music. Cognition, 216, 104847. https://doi.org/10.1016/j .cognition.2021.104847, PubMed: 34311153 Auerbach, E. J., Xu, J., Yacoub, E., Moeller, S., & Uğurbil, K. (2013). Multiband accelerated spin-echo echo planar imaging with reduced peak RF power using time-shifted RF pulses. Magnetic Resonance in Medicine, 69, 1261–1267. https://doi .org/10.1002/mrm.24719, PubMed: 23468087 Baldassano, C., Chen, J., Zadbood, A., Pillow, J. W., Hasson, U., & Norman, K. A. (2017). Discovering event structure in continuous narrative perception and memory. Neuron, 95, 709–721. https://doi.org/10.1016/j.neuron.2017.06.041, PubMed: 28772125 Baldassano, C., Hasson, U., & Norman, K. A. (2018). Representation of real-world event schemas during narrative perception. Journal of Neuroscience, 38, 9689–9699. https://doi.org/10.1523/JNEUROSCI.0251-18.2018, PubMed: 30249790 Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B: Methodological, 57, 289–300. https://doi.org/10.1111/j.2517 -6161.1995.tb02031.x Ben-Yakov, A., & Henson, R. N. (2018). The hippocampal film editor: Sensitivity and specificity to event boundaries in continuous experience. Journal of Neuroscience, 38, 10057–10068. https://doi.org/10.1523/JNEUROSCI.0524-18 .2018, PubMed: 30301758 Blood, A. J., & Zatorre, R. J. (2001). Intensely pleasurable responses to music correlate with activity in brain regions 712 Journal of Cognitive Neuroscience Volume 34, Number 4 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 implicated in reward and emotion. Proceedings of the National Academy of Sciences, U.S.A., 98, 11818–11823. https://doi.org/10.1073/pnas.191355898, PubMed: 11573015 Castro, M., L’héritier, F., Plailly, J., Saive, A.-L., Corneyllie, A., Tillmann, B., et al. (2020). Personal familiarity of music and its cerebral effect on subsequent speech processing. Scientific Reports, 10, 14854. https://doi.org/10.1038/s41598-020-71855-5, PubMed: 32908227 Cauley, S. F., Polimeni, J. R., Bhat, H., Wald, L. L., & Setsompop, K. (2014). Interslice leakage artifact reduction technique for simultaneous multislice acquisitions. Magnetic Resonance in Medicine, 72, 93–102. https://doi.org/10.1002/mrm.24898, PubMed: 23963964 Chen, J., Leong, Y. C., Honey, C. J., Yong, C. H., Norman, K. A., & Hasson, U. (2017). Shared memories reveal shared structure in neural activity across individuals. Nature Neuroscience, 20, 115–125. https://doi.org/10.1038/nn.4450, PubMed: 27918531 Daly, I., Williams, D., Hallowell, J., Hwang, F., Kirke, A., Malik, A., et al. (2015). Music-induced emotions can be predicted from a combination of brain activity and acoustic features. Brain and Cognition, 101, 1–11. https://doi.org/10.1016/j .bandc.2015.08.003, PubMed: 26544602 Farbood, M. M., Heeger, D. J., Marcus, G., Hasson, U., & Lerner, Y. (2015). The neural processing of hierarchical structure in music and speech at different timescales. Frontiers in Neuroscience, 9, 157. https://doi.org/10.3389/fnins.2015 .00157, PubMed: 26029037 Fedorenko, E., Patel, A., Casasanto, D., Winawer, J., & Gibson, E. (2009). Structural integration in language and music: Evidence for a shared system. Memory & Cognition, 37, 1–9. https://doi.org/10.3758/MC.37.1.1, PubMed: 19103970 Geerligs, L., van Gerven, M., Campbell, K. L., & Güçlü, U. (2021). A nested cortical hierarchy of neural states underlies event segmentation in the human brain. bioRxiv, 2021.02.05.429165. https://doi.org/10.1101/2021.02.05.429165 Geerligs, L., van Gerven, M., & Güçlü, U. (2021). Detecting neural state transitions underlying event segmentation. Neuroimage, 236, 118085. https://doi.org/10.1016/j .neuroimage.2021.118085, PubMed: 33882350 Hasson, U., Chen, J., & Honey, C. J. (2015). Hierarchical process memory: Memory as an integral component of information processing. Trends in Cognitive Sciences, 19, 304–313. https://doi.org/10.1016/j.tics.2015.04.006, PubMed: 25980649 Hasson, U., Yang, E., Vallines, I., Heeger, D. J., & Rubin, N. (2008). A hierarchy of temporal receptive windows in human cortex. Journal of Neuroscience, 28, 2539–2550. https://doi.org/10.1523/jneurosci.5487-07.2008, PubMed: 18322098 Honey, C. J., Thesen, T., Donner, T. H., Silbert, L. J., Carlson, C. E., Devinsky, O., et al. (2012). Slow cortical dynamics and the accumulation of information over long timescales. Neuron, 76, 668. https://doi.org/10.1016/j.neuron.2012.10.024 Jackendoff, R., & Lerdahl, F. (2006). The capacity for music: What is it, and what’s special about it? Cognition, 100, 33–72. https://doi.org/10.1016/j.cognition.2005.11.005, PubMed: 16384553 Jantzen, M. G., Large, E. W., & Magne, C. (2016). Overlap of neural systems for processing language and music. Frontiers in Psychology, 7, 876. https://doi.org/10.3389/fpsyg.2016 .00876, PubMed: 27378976 Jenkinson, M., Bannister, P., Brady, M., & Smith, S. (2002). Improved optimization for the robust and accurate linear registration and motion correction of brain images. Neuroimage, 17, 825–841. https://doi.org/10.1006/nimg.2002 .1132, PubMed: 12377157 Kell, A. J. E., Yamins, D. L. K., Shook, E. N., Norman-Haignere, S. V., & McDermott, J. H. (2018). A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron, 98, 630–644. https://doi.org/10.1016/j.neuron.2018 .03.044, PubMed: 29681533 Koelsch, S. (2011). Toward a neural basis of music perception— A review and updated model. Frontier in Psychology, 2, 110. https://doi.org/10.3389/fpsyg.2011.00110, PubMed: 21713060 Koelsch, S., Gunter, T. C., v. Cramon, D. Y., Zysset, S., Lohmann, G., & Friederici, A. D. (2002). Bach speaks: A cortical “language-network” serves the processing of music. Neuroimage, 17, 956–966. https://doi.org/10.1006/nimg.2002 .1154, PubMed: 12377169 Kumar, M., Anderson, M. J., Antony, J. W., Baldassano, C., Brooks, P. P., Cai, M. B., et al. (2022). BrainIAK: The brain imaging analysis kit. Aperture Neuro, 1. https://doi.org/10 .52294/31bb5b68-2184-411b-8c00-a1dacb61e1da Landemard, A., Bimbard, C., Demené, C., Shamma, S., Norman- Haignere, S., & Boubenec, Y. (2021). Distinct higher-order representations of natural sounds in human and ferret auditory cortex. eLife, 10, e65566. https://doi.org/10.7554 /eLife.65566, PubMed: 34792467 Lee, D. J., Jung, H., & Loui, P. (2019). Attention modulates electrophysiological responses to simultaneous music and language syntax processing. Brain Sciences, 9, 305. https:// doi.org/10.3390/brainsci9110305, PubMed: 31683961 Lerner, Y., Honey, C. J., Silbert, L. J., & Hasson, U. (2011). Topographic mapping of a hierarchy of temporal receptive windows using a narrated story. Journal of Neuroscience, 31, 2906–2915. https://doi.org/10.1523/jneurosci.3684-10.2011, PubMed: 21414912 Liu, W., Shi, Y., Cousins, J. N., Kohn, N., & Fernández, G. (2021). Hippocampal-medial prefrontal event segmentation and integration contribute to episodic memory formation. Cerebral Cortex. https://doi.org/10.1101/2020.03.14.990002, PubMed: 34398213 Margulis, E. H., Wong, P. C. M., Simchy-Gross, R., & McAuley, J. D. (2019). What the music said: Narrative listening across cultures. Palgrave Communications, 5, 146. https://doi.org /10.1057/s41599-019-0363-1 Margulis, E. H., Wong, P. C. M., Turnbull, C., Kubit, B. M. & McAuley, J. D. (2021). Narratives imagined in response to music reveal culture-bounded intersubjectivity. Proceedings of the National Academy of Sciences, U.S.A., 119, e2110406119. Mason, M. F., Norton, M. I., Van Horn, J. D., Wegner, D. M., Grafton, S. T., & Macrae, C. N. (2007). Wandering minds: The default network and stimulus-independent thought. Science, 315, 393–395. https://doi.org/10.1126/science.1131295, PubMed: 17234951 McAuley, J. D., Wong, P. C. M., Mamidipaka, A., Phillips, N., & Margulis, E. H. (2021). Do you hear what I hear? Perceived narrative constitutes a semantic dimension for music. Cognition, 212, 104712. https://doi.org/10.1016/j.cognition .2021.104712, PubMed: 33848700 McFee, B., Raffel, C., Liang, D., Ellis, D., McVicar, M., Battenberg, E., et al. (2015). librosa: Audio and music signal analysis in python. Proceedings of the 14th Python in Science Conference. https://doi.org/10.25080/majora -7b98e3ed-003 Moeller, S., Yacoub, E., Olman, C. A., Auerbach, E., Strupp, J., Harel, N., et al. (2010). Multiband multislice GE-EPI at 7 tesla, with 16-fold acceleration using partial parallel imaging with application to high spatial and temporal whole-brain fMRI. Magnetic Resonance in Medicine, 63, 1144–1153. https://doi .org/10.1002/mrm.22361, PubMed: 20432285 Nakai, T., Koide-Majima, N., & Nishimoto, S. (2021). Correspondence of categorical and feature-based representations of music in the human brain. Brain and Williams et al. 713 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Behavior, 11, e01936. https://doi.org/10.1002/brb3.1936, PubMed: 33164348 Norman-Haignere, S., Kanwisher, N. G., & McDermott, J. H. (2015). Distinct cortical pathways for music and speech revealed by hypothesis-free voxel decomposition. Neuron, 88, 1281–1296. https://doi.org/10.1016/j.neuron.2015.11.035, PubMed: 26687225 Patel, A. D. (2011). Why would musical training benefit the neural encoding of speech? The OPERA hypothesis. Frontiers in Psychology, 2, 142. https://doi.org/10.3389/fpsyg.2011 .00142, PubMed: 21747773 Patil, K., Pressnitzer, D., Shamma, S., & Elhilali, M. (2012). Music in our ears: The biological bases of musical timbre perception. PLoS Computational Biology, 8, e1002759. https://doi.org/10.1371/journal.pcbi.1002759, PubMed: 23133363 Peretz, I., Vuvan, D., Lagroi, M.-E., & Armory, J. L. (2015). Neural overlap in processing music and speech. Philosophical Transactions of the Royal Society of London, Series B: Bilogical Sciences, 370, 20140090. https://doi.org/10.1098 /rstb.2014.0090, PubMed: 25646513 Poorjam, A. H.. (2018). Re: Why we take only 12–13 MFCC coefficients in feature extraction?. Retrieved from https:// www.researchgate.net/post/ Why_we_take_only_12-13 _MFCC_coefficients_in_feature_extraction /5b0fd2b7cbdfd4b7b60e9431/citation/download Raichle, M. E., MacLeod, A. M., Snyder, A. Z., Powers, W. J., Gusnard, D. A., & Shulman, G. L. (2001). A default mode of brain function. Proceedings of the National Academy of Sciences, U.S.A., 98, 676–682. https://doi.org/10.1073/pnas.98 .2.676, PubMed: 11209064 Schaefer, A., Kong, R., Gordon, E. M., Laumann, T. O., Zuo, X.-N., Holmes, A. J., et al. (2018). Local–global parcellation of the human cerebral cortex from intrinsic functional connectivity MRI. Cerebral Cortex, 28, 3095–3114. https://doi .org/10.1093/cercor/bhx179, PubMed: 28981612 Setsompop, K., Gagoski, B. A., Polimeni, J. R., Witzel, T., Wedeen, V. J., & Wald, L. L. (2012). Blipped-controlled aliasing in parallel imaging for simultaneous multislice echo planar imaging with reduced g-factor penalty. Magnetic Resonance in Medicine, 67, 1210–1224. https://doi.org/10 .1002/mrm.23097, PubMed: 21858868 Shulman, G. L., Fiez, J. A., Corbetta, M., Buckner, R. L., Miezin, F. M., Raichle, M. E., et al. (1997). Common blood flow changes across visual tasks: II. Decreases in cerebral cortex. Journal of Cognitive Neuroscience, 9, 648–663. https://doi .org/10.1162/jocn.1997.9.5.648, PubMed: 23965122 Sotiropoulos, S. N., Moeller, S., Jbabdi, S., Xu, J., Andersson, J. L., Auerbach, E. J., et al. (2013). Effects of image reconstruction on fiber orientation mapping from multichannel diffusion MRI: Reducing the noise floor using SENSE. Magnetic Resonance in Medicine, 70, 1682–1689. https://doi.org/10.1002/mrm.24623, PubMed: 23401137 Sridharan, D., Levitin, D. J., Chafe, C. H., Berger, J., & Menon, V. (2007). Neural Dynamics of event segmentation in music: converging evidence for dissociable ventral and dorsal networks. Neuron, 55, 521–532. https://doi.org/10.1016/j .neuron.2007.07.003, PubMed: 17678862 Tallal, P., & Gaab, N. (2006). Dynamic auditory processing, musical experience and language development. Trends in Neurosciences, 29, 382–390. https://doi.org/10.1016/j.tins .2006.06.003, PubMed: 16806512 Taruffi, L., Pehrs, C., Skouras, S., & Koelsch, S. (2017). Effects of sad and happy music on mind-wandering and the default mode network. Scientific Reports, 7, 14396. https://doi.org/10 .1038/s41598-017-14849-0, PubMed: 29089542 Tillmann, B. (2012). Music and language perception: Expectations, structural integration, and cognitive sequencing. Topics in Cognitive Science, 4, 568–584. https://doi.org/10.1111/j.1756 -8765.2012.01209.x, PubMed: 22760955 Toiviainen, P., Alluri, V., Brattico, E., Wallentin, M., & Vuust, P. (2014). Capturing the musical brain with Lasso: Dynamic decoding of musical features from fMRI data. Neuroimage, 88, 170–180. https://doi.org/10.1016/j.neuroimage.2013.11 .017, PubMed: 24269803 Woolrich, M. W., Behrens, T. E. J., Beckmann, C. F., Jenkinson, M., & Smith, S. M. (2004). Multilevel linear modelling for fMRI group analysis using Bayesian inference. Neuroimage, 21, 1732–1747. https://doi.org/10.1016/j.neuroimage.2003.12 .023, PubMed: 15050594 Woolrich, M. W., Ripley, B. D., Brady, M., & Smith, S. M. (2001). Temporal autocorrelation in univariate linear modeling of fMRI data. Neuroimage, 14, 1370–1386. https://doi.org/10 .1006/nimg.2001.0931, PubMed: 11707093 Xu, J., Moeller, S., Auerbach, E. J., Strupp, J., Smith, S. M., Feinberg, D. A., et al. (2013). Evaluation of slice accelerations using multiband echo planar imaging at 3T. Neuroimage, 83, 991–1001. https://doi.org/10.1016/j.neuroimage.2013.07.055, PubMed: 23899722 Zadbood, A., Chen, J., Leong, Y. C., Norman, K. A., & Hasson, U. (2017). How we transmit memories to other brains: Constructing shared neural representations via communication. Cerebral Cortex, 27, 4988–5000. https://doi.org/10.1093/cercor/bhx202, PubMed: 28922834 714 Journal of Cognitive Neuroscience Volume 34, Number 4 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / j / o c n a r t i c e - p d l f / / / / 3 4 4 6 9 9 1 9 9 6 6 9 5 / j o c n _ a _ 0 1 8 1 5 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 High-Order Areas and Auditory Cortex Both Represent image

Scarica il pdf