SPECIAL ISSUE: - Specialized Research AI at MIT

SPECIAL ISSUE:
Cognitive Computational Neuroscience of Language

Computational Language Modeling and the
Promise of in Silico Experimentation

Shailee Jain1

, Vy A. Vo3, Leila Wehbe4,5, and Alexander G. Huth1,2

1Department of Computer Science, University of Texas at Austin, Austin, TX, USA
2Department of Neuroscience, University of Texas at Austin, Austin, TX, USA
3Brain-Inspired Computing Lab, Intel Labs, Hillsboro, OR, USA
4Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA, USA
5Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA

Keywords: computational neuroscience, deep learning, encoding models, experimental design,
natural language processing, naturalistic stimuli

ABSTRACT

Language neuroscience currently relies on two major experimental paradigms: controlled
experiments using carefully hand-designed stimuli, and natural stimulus experiments. These
approaches have complementary advantages which allow them to address distinct aspects of
the neurobiology of language, but each approach also comes with drawbacks. Here we
discuss a third paradigm—in silico experimentation using deep learning-based encoding
models—that has been enabled by recent advances in cognitive computational neuroscience.
This paradigm promises to combine the interpretability of controlled experiments with the
generalizability and broad scope of natural stimulus experiments. We show four examples of
simulating language neuroscience experiments in silico and then discuss both the advantages
and caveats of this approach.

INTRODUCTION

One major goal of language neuroscience is to characterize the function of different brain
regions and networks that are engaged in language processing. A large body of work has inves-
tigated different aspects of language processing—such as semantic knowledge representation
(Binder et al., 2009; Huth et al., 2016; Mitchell et al., 2008), syntactic processing (Friederici
et al., 2000), and phonological mapping (Chang et al., 2010)—and characterized the proper-
ties of the language network like the processing timescale (Lerner et al., 2011), convergence
with different sensory systems (Popham et al., 2021), role in bilingual representations (Chan
et al., 2008), and more. To study these questions, language neuroscientists have developed a
suite of experimental designs, ranging from highly specific controlled experiments to natural
stimulus experiments and, more recently, deep learning-based approaches for computational
modeling.

Each experimental design can be thought of as an investigative tool for understanding the
brain’s response Rv = fv(S ), where fv is the function that some brain element v (e.g., a single
neuron, voxel, brain area, or magnetoencephalography [MEG] sensor) computes over a given
language stimulus S to produce responses Rv. Some experimental designs—like contrast-based
studies—aim to directly compare certain aspect of fv, such as the response to different word
categories. Others—like experiments with complex stimuli that are paired with encoding

a n o p e n a c c e s s

j o u r n a l

Citation: Jain, S., Vo, Vy A., Wehbe, L.,
& Huth, A. G. (2023). Computational
language modeling and the promise of
in silico experimentation. Neurobiology
of Language. Advance publication.
https://doi.org/10.1162/nol_a_00101

DOI:
https://doi.org/10.1162/nol_a_00101

Received: 28 February 2022
Accepted: 18 January 2023

Competing Interests: The authors have
declared that no competing interests
exist.

Corresponding Author:
Alexander G. Huth
huth@cs.utexas.edu

Handling Editor:
Alessandro Lopopolo

Copyright: © 2023
Massachusetts Institute of Technology
Published under a Creative Commons
Attribution 4.0 International
(CC BY 4.0) license

The MIT Press

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
o

l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2
n
o
_
a
_
0
0
1
0
1
2
0
7
4
5
3
9
n
o
_
a
_
0
0
1
0
1
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

In silico language neuroscience

In silico experimentation:
Simulating experiments by predicting
brain responses with a computational
model to test generalizability and do
efficient hypothesis testing.

models—approximate fv using computational tools, and this allows for the prediction of activ-
ity related to new stimuli. In this paper we describe an alternative to existing paradigms:
in silico controlled experimentation using computational models of naturalistic language pro-
cessing. This hybrid approach combines the strengths of controlled and naturalistic paradigms
to achieve high ecological generalizability, high experimental efficiency and reusability, high
interpretability, and sensitivity to individual participant effects.

We first compare and contrast experimental designs based on their effectiveness and effi-
ciency for revealing fv. Then we introduce the in silico experimentation paradigm with deep
learning models. We discuss four different neuroimaging studies that use this paradigm to
investigate different linguistic phenomena in the brain. And finally, we discuss the potential
of this approach to alleviate the problems of reproducibility in language neuroscience, as well
as caveats and pitfalls of in silico experimentation.

EXPERIMENTAL DESIGNS IN LANGUAGE NEUROSCIENCE

Controlled Experimental Design: Contrast-Based Studies

Language is a rich and complex modality that humans are uniquely specialized to process.
Given this complexity, neuroscientists have traditionally broken language down into specific
processes and properties and then designed controlled experiments to test each separately
(Binder et al., 2009; Friederici et al., 2000). Consider the example of investigating which areas
of the brain are responsible for encoding specific types of semantic categories like “actions”
(Kable et al., 2002; Noppeney et al., 2005; Wallentin et al., 2005). A simple and effective
approach is to collect and compare brain responses to action words and pair them with min-
imally different words, perhaps similar length and frequency “object” words. If some brain
element v responds more to stimuli containing the property being tested than the control
stimuli—that is, fv(“action” words) > fv(“object” words)—the experimenter concludes that v
is involved in processing action words. Similarly, the N400 effect (Kutas & Hillyard, 1984)
is assessed by testing whether an elements’ fv reflects surprise with respect to some context.
If fv(expected word|context) < fv(unexpected word|context), it would suggest that the brain element is capturing word surprisal. In order for a contrast-based study to be interpretable, it is vital to remove any confounds that could corrupt observed responses and lead to false positives. Binder et al. (2009) charac- terize three types of confounds: the main and control conditions could differ in low-level pro- cessing demands (phonological/orthographic); the main and control conditions could differ in working memory demands, attention demands, and so forth; and, in passive tasks, the partic- ipants might engage in different mental imagery or task-related thoughts in the two conditions. If such confounds are controlled effectively, one can assume that the observed brain response will be identical in all respects unless v specifically captures the property being studied. For example, if the action and object words are matched on all other properties, fv(“action” words) and fv (“object” words) will only differ if v selectively encodes action or object concepts. Con- sequently, the contrast-based paradigm has high interpretability, as any variations in observed response can be attributed to the hypothesis. This clear and direct relationship between hypothesis and result ensures that the experiment has scientific value even when a hypothesis or theory is incorrect. The controlled experimental design has thus been fundamental in revealing many important aspects of brain function, such as the specialization of parts of tem- poral cortex for speech processing (reviewed in S. K. Scott, 2019) and distinct neural systems for concrete versus abstract concepts (Binder et al., 2005, 2009). Neurobiology of Language 2 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / . / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d . / l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience Naturalistic stimuli: Stimuli that subjects could be exposed to in real life; not artificially constructed for an experiment. While this paradigm has been hugely influential and effective in language neuroscience, it is not without flaws. Perhaps the biggest drawback of most contrast-based designs is the lack of ecological generalizability (Hamilton & Huth, 2018; Matusz et al., 2019). To avoid confounds, controlled experiments often employ the simplest linguistic constructions required to demon- strate an effect, such as single words in the action versus object contrast. While we are fully capable of identifying action words in isolation, it is not necessary that the brain employs the same networks to understand such words in real-world settings (Matusz et al., 2019), for exam- ple, as used in a conversation or a story. In contrast to such studies, those using naturalistic stimuli have found more engagement and activation in higher order cortical regions, likely due to the incorporation of long-range structure (Deniz et al., 2021; Lerner et al., 2011). Further- more, due to practical limitations, controlled studies typically use small stimulus sets that span a limited domain. For example, neuroimaging studies of the action contrast often use fewer than 100 words in each condition. This raises the probability that there is something peculiar or nonrepresentative about the experimental stimuli, making it more difficult to reproduce the effect or establish generalizability to a broader stimulus domain (Yarkoni, 2022). Small stimu- lus sets can also artificially inflate the observed statistical significance (Westfall et al., 2017). While controlled studies offer a very clear and direct relationship between the hypothesis and experimental result, their value depends entirely on the quality of the hypothesis. In many cases, narrowing the experimental hypothesis to focus on contrasts of a particular stimulus property may be misleading, and may fail to account for interactions between several other stimulus properties. For example, standard statistical models for assessing the “action” contrast assume that brain response is identically distributed for any subcategorization of this semantic concept. However, studies such as Hauk et al. (2004) have found that different regions across cortex selectively encode hand-related, foot-related, or mouth-related actions. This type of subcategory specificity decreases the statistical power of the overall action contrast, thereby increasing the probability of false negatives. Worse, if the overall action contrast has unevenly sampled these subcategories, the statistical power to detect action selectivity will vary in an unexpected and unknown fashion between brain areas. This issue can occur in any contrast- based experiment and is difficult or even impossible to detect by the experimenter. One poten- tial solution would be to combine data across different contrast-based experiments, which could reveal interactions between effects. However, separate controlled experiments often do not share analysis methods, stimulus sets, or subjects, making it difficult to combine data or compare effect sizes across experiments. Lastly, for each language property that one wishes to investigate using a controlled experiment, one needs to design specific controls and repeat- edly measure Rv. This results in limited reusability of experimental data, slowing down the process of scientific discovery. Naturalistic Stimuli To combat the lack of stimulus generalization and limited reusability, there has been a rising trend toward naturalistic experimental paradigms (Brennan et al., 2012; Hamilton & Huth, 2018; Hasson et al., 2008; Lerner et al., 2011; Regev et al., 2013; Shain et al., 2020). With the development of better neuroimaging/recording technology, we now have access to high quality brain recordings of humans while they perceive engaging, ecologically valid stimuli like podcasts (Huth et al., 2016; Lerner et al., 2011; J. Li et al., 2022; Nastase et al., 2021; S. Wang et al., 2022), fictional books (Bhattasali et al., 2020; Wehbe, Murphy, et al., 2014), and movies (J. Chen et al., 2017)—all examples of stimuli humans encounter or seek out in their everyday lives. Recent work has further developed this naturalistic paradigm to incorpo- rate communication and social processing, beyond passive perception (Bevilacqua et al., Neurobiology of Language 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / / . 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d . / l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience Ecological validity: Determiniation whether an experiment is likely to faithfully reflect and generalize to situations encountered in real life. Linearized encoding model: Model that learns to predict elicited response in a brain element as a linear function of features of interest extracted from stimuli. 2019; Redcay & Moraczewski, 2020). Naturalistic stimulus data sets are easier to construct and often larger than controlled stimuli. For example, J. Chen et al. (2017) publicly released a data set collected on a 50 min movie, Wehbe, Murphy, et al. (2014) released data collected on an entire chapter from the Harry Potter books, comprising more than 5,000 words, and LeBel et al. (2022) released data collected on over 5 hr of English podcasts per participant. These stimuli also provide a diverse test bed of linguistic phenomena—from a broad array of semantic concepts to rich temporal structure capturing discourse-level information. Further- more, they do not directly constrain the hypotheses the experimenter can test and thus facil- itate high reusability of the data. However, this also means that natural stimulus data have low statistical power with respect to any specific hypothesis, and it is necessary to carefully design analyses to control for confounding effects. This makes interpretation of the observed effects much more challenging than contrast-based experiments. Naturalistic Experimental Design: Controlled Manipulations of Naturalistic Stimuli To reap the benefits of both interpretable controlled experiments and generalizable naturalistic stimuli, some studies have deployed a hybrid experimental design (Chien & Honey, 2020; Deniz et al., 2019; Lerner et al., 2011; Overath et al., 2015; Yeshurun et al., 2017). Here, natural stimuli are manipulated to change or remove some specific language cue or property (e.g., scrambling the words in a story) and the sensitivity of different brain regions to this manipulation is measured, for example, fv(intact story) vs. fv(scrambled story). This can reveal properties across the brain like the timescale of information represented (Lerner et al., 2011, 2014) or specificity to the type of naturalistic stimulus, such as human speech (Overath et al., 2015). This experimental design accounts for ecological validity by restricting analyses to brain regions that robustly respond to the naturalistic stimuli. Furthermore, it has the same advantage of controlled experiments when it comes to interpretation: Assuming effective con- trol of confounds, any observed change in brain activity is likely to be an effect of the stimulus manipulation. However, this approach also has disadvantages: The manipulated stimuli are often unnatural (like reversed or scrambled speech) and restrict the types of interactions the experimenter can observe. For example, the scrambled story experiment assumes that all regions processing short timescale information will behave identically. The manipulated stim- uli also limit the reusability of the experiment, meaning that a new experiment needs to be designed for each effect of interest. Naturalistic Experimental Design: Predictive Computational Modeling Encoding models are an alternative computational approach for leveraging naturalistic exper- imental data (Bhattasali et al., 2019; Caucheteux & King, 2022; Goldstein et al., 2021; Huth et al., 2016; Jain et al., 2020; Jain & Huth, 2018, p. 20; Schrimpf et al., 2021; Wehbe, Vaswani, et al., 2014). These predictive models learn to simulate elicited brain responses Rv = fv(S ) to natural language stimuli S by building a computational approximation to the function fv for each brain element v, typically in every participant individually. Here, R can be captured by any neuroimaging or neural recording technique. Given limitations on data set sizes, the search for fv is typically constrained to linearized encoding models, gv(Ls(S )) (M. C.-K. Wu et al., 2006), where gv is a linear combination of features extracted from the stimulus by a function Ls. While gv is termed a linear model, of particular interest is the linearizing transform Ls. Contrast-based experimental designs test a hypothesis by comparing responses elicited by different conditions. Each condition is composed of stimuli that share some features (e.g., all words that describe actions). Encoding models can test the same hypothesis by incorporating these features into Ls. For example, for every word in the natural stimulus, one could create an Neurobiology of Language 4 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l . / / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience Neural language models: Types of artificial neural networks that learn to predict the next word in a sequence from past context. indicator feature Iaction that is 1 if the word describes an action and 0 otherwise. Feature spaces consisting of 1s and 0s are equivalent to a contrast-based experimental design, assuming other confounds have been eliminated. Encoding models can also adopt much more complex and high-dimensional functions for Ls. This makes it possible to account for multiple, interacting stimulus properties that may affect the response Rv. For example, Ls could indicate multiple levels of semantic categories. In the example of action and object words, the feature space could indicate that hand-related, foot-related, and mouth-related words were all types of actions, and distinguish all action words from multiple subcategories of objects. One recent example of such a high-dimensional feature space that captures semantic similarity (Mikolov et al., 2013; Pennington et al., 2014) is word embeddings, which have been used to characterize semantic language representations across the human brain (de Heer et al., 2017; Huth et al., 2016; Wehbe, Murphy, et al., 2014; Wehbe, Vaswani, et al., 2014). With a suitably rich linearizing transform Ls, this approach vastly expands the set of hypotheses that can be reasonably explored with a limited data set. The expandable feature space also allows encoding models great flexibility to test addi- tional hypotheses without collecting new data, leading to high reusability. Estimating the brain response as a function of the nonlinear feature space is made possible by collecting large data sets that are partitioned into a portion for training (estimating) the model and a portion for testing the model on unseen data. Typically, regularized linear regression is used to estimate the linear relationship gv based on the feature space Ls. This is used to predict new responses ^Rv ¼ gv Ls Sð Þ ð Þ to unseen stimuli. Finally, the model is evaluated by measuring how well it predicts brain responses, ρ(fv(Snew), g(Ls(Snew))). Thus, unlike other approaches, encoding models explicitly measure generalizability by testing on new, naturalistic stimuli. In contrast-based designs, a gen- eralization test is usually achieved through replication with an independent data set, often from a different lab where protocols and analysis details may differ. With encoding models, the same experimenter usually runs their own generalization test and directly estimates how much of the neural response Rv is explained by the model, holding all other variables constant. Encoding models can also be used to investigate if the same brain region under different tasks have the same tuning. For example, Deniz et al. (2019) show that semantic tuning is preserved between reading and listening, while Çukur et al. (2013) show that the tuning of different regions in visual cortex when attending to a given category is biased toward the attended category. Encoding models can also be used to compare tuning of two different regions (Toneva, Williams, et al., 2022 ). Artificial Neural Networks as a Rich Source of Linguistic Features The most important choice that an experimenter makes when using encoding models is that of the linearizing transform. To find useful linearizing transforms, neuroscience has mostly followed advances in computational linguistics or natural language processing (NLP) where, in recent years, deep learning (DL) models trained using self-supervision have seen great suc- cess. One such cornerstone model is the neural language model—a self-supervised artificial neural network (ANN) that learns to predict the next word in a sequence, wt+1, from the context provided by previous words (w1, w2 … wt). Several recent studies have shown that representa- tions derived from LMs capture many linguistic properties of the preceding sequence (w1, w2 … wt) like dependency parse structure, semantic roles, and sentiment (see Mahowald et al., 2020, for a review; Clark et al., 2019; Conneau et al., 2018; Gulordava et al., 2018; Haber & Poesio, 2021; Hewitt & Liang, 2019; Hewitt & Manning, 2019; Lakretz et al., 2019; B. Z. Li et al., 2021; Linzen & Leonard, 2018; Marvin & Linzen, 2018; Prasad et al., 2019; Tenney, Das, et al., 2019; Neurobiology of Language 5 Q1 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / . / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d . / l f b y g u e s t Q2 t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience Tenney, Xia, et al. 2019). While this by no means is a complete representation of phrase mean- ing (Bender & Koller, 2020), using a language model as a linearizing transform has been shown to effectively predict natural language responses in both the cortex and cerebellum, with dif- ferent neuroimaging techniques and stimulus presentation modalities (Abnar et al., 2019; Anderson et al., 2021; Caucheteux & King, 2022; Goldstein et al., 2021; Jain et al., 2020; Jain & Huth, 2018; Kumar et al., 2022; LeBel et al., 2021; Reddy & Wehbe, 2020; Schrimpf et al., 2021; Toneva, Mitchell, et al., 2022 ; Toneva & Wehbe, 2019; Wehbe, Murphy, et al., 2014; Wehbe, Vaswani, et al., 2014; S. Wang et al., 2020). Moreover, these models easily outperform earlier word embedding encoding models that use one static feature vector for each word in the stimulus and thus ignore the effects of context (Antonello et al., 2021; Caucheteux & King, 2022; Jain & Huth, 2018). Deep LMs have also been used to investigate the mapping between ANN layer depth and hierarchical language processing (Jain & Huth, 2018). Along similar lines and at a lower-level, supervised and self-supervised models of speech acoustics have been used to develop the best current models of auditory processing in human cortex to date (Kell et al., 2018; Y. Li et al., 2022; Millet et al., 2022; Millet & King, 2021; Vaidya et al., 2022). The unprecedented success of DL-based approaches over earlier encoding models can likely be attributed to several important factors. First, features extracted from the DL-based models have the ability to represent many different types of linguistic information, as discussed above. Second, DL-based models serially process words from a language stimulus to generate incremental features. This mimics causal processing in humans and thus offers an important advantage over static representations like word embeddings, which cannot encode contextual properties. Third, recent work has shown that these models often recapitulate human errors and judgments, such as effectively predicting behavioral data of human reading times (Aurnhammer & Frank, 2018; Futrell et al., 2019; Goodkind & Bicknell, 2018; Merkx & Frank, 2021; Wilcox et al., 2021). This again suggests some isomorphism between human language processing and DL-based models. The next word prediction objective also enables language models to perform well on psycholinguistic diagnostics like the cloze task, although there is substantial room for improvement (Ettinger, 2020; Pandia & Ettinger, 2021). Finally, self- supervised ANNs, that is, networks that predict the next word or speech frame, transfer well to downstream language tasks like question answering and coreference resolution, and to speech tasks like speaker verification and translation across languages (Z. Chen et al., 2022; A. Wu et al., 2020). This suggests that the self-supervised networks are learning representations of language that are useful for many tasks that humans may encounter. These factors have contributed to the increasing popularity of DL-based encoding models as an investigative tool of brain function. This approach has revealed aspects of how the brain represents compositional meaning (Toneva, Mitchell, et al., 2022), provided fine-grained esti- mates of processing timescales across cortex (Jain et al., 2020), and uncovered new evidence for the cerebellum’s role in language understanding (LeBel et al., 2021). Yet despite these successes, DL-based encoding models are hard to interpret. The represen- tations produced by language models are entirely learned by black-box neural networks, and thus cannot be understood with the same ease as the indicator features described in the above section . While the representations themselves are opaque, one potential avenue is to interpret the success of a DL-based model at predicting some brain area as suggesting a commonality between that brain area and the objective that model was trained for (e.g., word identification (Kell et al., 2018) or 3D vision tasks (A. Wang et al., 2019)). However, the fact that similar representations can be derived from DL-based models that are trained for different objectives puts this type of interpretation on shaky ground (Antonello & Huth, 2022; Guest & Martin, 2023). These difficulties have left the field at something of an impasse: We know that DL-based Neurobiology of Language 6 Q3 Q4 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / / . 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Q5 In silico language neuroscience models are extremely effective at predicting brain responses, but we are unsure why and unsure what these models can tell us about how the brain processes language. Pièce De Résistance: In Silico Experimentation With DL-Based Encoding Models Controlled experiments and encoding models using naturalistic stimuli both have distinct advantages and disadvantages. However, it may be possible to combine these paradigms in a way that avoids the disadvantages and retains the advantages. To this end, we present an experimental design that combines these two paradigms: in silico controlled experimentation using encoding models. This paradigm first trains encoding models on an ecologically valid, highly generalizable naturalistic experiment. Then, it uses the encoding models to simulate brain activity to controlled stimulus variations or contrasts. Notably, this does not require addi- tional data to be collected for every condition. The first use of in silico experimentation is to test if effects discovered in controlled, non- ecologically valid setups generalize to naturalistic stimuli. This experimental design also facil- itates quick and efficient hypothesis testing. Experimenters can prototype new controlled experiments and narrow down the desired contrasts or stimuli without having to repeatedly measure in vivo. While this is a complement to and not a substitute for in vivo experiments that should follow the prototyping phase, in silico experimentation can greatly reduce the cost of generalizability and hypothesis testing, and accelerate scientific discovery. In Figure 1, we present a controlled experimental design with its in silico counterpart. Figure 1A shows an experimental paradigm that was designed to understand linguistic l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / / . 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d . / l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 1. Example of an in silico adaptation of a controlled experiment. (A) The original MEG study investigated composition over two-word phrases (Bemis & Pylkkänen, 2011). This was done by presenting three different types of phrases to participants to solve a picture matching task. By contrasting the elicited brain responses in the composition condition with the responses in the list and non-word conditions, the authors could infer which brain regions are engaged in compositional processing of two-word phrases. (B) This experimental paradigm can be conceptually simulated with LM-based fMRI encoding models of naturalistic stimuli. The composition and list conditions can be tested by using the learned encoding model to predict each voxel’s response to a large, diverse corpus of phrases. The non-word condition can be simulated by replacing the first word in a phrase with a non-word, extracting new ablated features of the phrase from the LM and using the encoding model to predict the brain’s response to the ablated phrase. If a voxel’s response is highly sensitive to the removal of the first word, it would suggest that the voxel combines information over both words to arrive at meaning. This provides a data-efficient way to test for compositional processing across diverse types of phrase constructions. fMRI = functional magetic resonance imaging; LM = language model; MEG = magnetoencephalography. Neurobiology of Language 7 In silico language neuroscience composition of two-word phrases (Bemis & Pylkkänen, 2011). Participants were presented with phrases in which meaning can be composed across constituent words and contrasting conditions where it cannot (word list and non-word). This experiment can be conceptually simulated in silico, as shown in Figure 1B (Jain & Huth, 2023). Instead of collecting separate neuroimaging data for each type of phrase construction, the in silico experiment was done with DL-based encoding models trained on two-word sequences. The learned models were first used to predict brain responses to a large, diverse corpus of phrases that contained both noun–noun and adjective–noun constructs among others. Next, the non-word condition was simulated by replacing the first word in the phrase with a non-word, extracting a new ablated feature, and finally predicting each functional magnetic resonance imaging (fMRI) voxel’s response to the ablated phrase. Assuming that the DL-based encoding model captures com- positional effects, this in silico experiment can ameliorate the disadvantages of both controlled and encoding model-based experimental designs. First, since simulating responses is trivial in both time and cost, the simulated experiment can use thousands or even millions of two-word phrases instead of the hundreds that can be tested in vivo. This ameliorates problems that arise with limited stimulus sets that may fail to account for key properties or generalize to natural- istic contexts. Second, by simulating and then comparing responses under conditions that are derived from linguistic theory (composition vs. single word, or word list), this in silico exper- iment provides results that are easily and explicitly interpretable, unlike encoding models with natural stimuli. However, one major concern raised by this approach is whether the encoding model can capture how the brain responds to the language properties of interest. To address this it is important to verify both that the encoding model is highly effective at predicting brain activity, and that it is sufficiently complex to capture the desired property. Similar in silico experimentation has recently become popular in vision neuroscience. There, DL-based encoding models of the visual system are first trained on ethologically valid tasks like object recognition. Then they are probed to understand brain function (Yamins & DiCarlo, 2016). For example, Bashivan et al. (2019) used DL-models to synthesize images that maximally drive neural responses. This provided a noninvasive in silico technique to control and manipulate internal brain states. Similarly, Ratan Murty et al. (2021) synthesized images from end-to-end DL models trained on brain data to provide stronger evidence for the cate- gorical selectivity of different brain regions. In silico experimentation with explicit computa- tional models has also been used in studies of the medial temporal lobe. In Nayebi et al. (2021), computational models of the medial entorhinal cortex were used to investigate the functional specialization of heterogeneous neurons that do not have stereotypical response profiles. By doing ablation in silico, they found that these types of cells are equally important for downstream processing as are grid- and border-cells. Each of these studies first relied on the encoding model’s ability to generalize to new stimuli. This was an indication that the features learned by the DL-based models encoded similar information to the brain regions that they predicted well. Second, these studies leveraged the predictive ability of encoding models to simulate brain activity in new, controlled conditions as a lens into brain function. This enabled the researchers to explore aspects of brain function that would otherwise be highly data inten- sive or impossible to do. In language, in silico experimentation is a promising area that is under development, bol- stered by the successes in vision neuroscience and growing efforts to understand artificial lan- guage systems. One of its earliest uses is the BOLDpredictions simulation engine (Wehbe et al., 2018, 2021), an online tool that allows the user to simulate language experiments that contrast two conditions, each defined by a list of isolated words. BOLDpredictions relies on an encoding model from a natural listening experiment that predicts brain activity as a function of Neurobiology of Language 8 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l . / / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience Generalizability testing: Testing to see if effects observed on a particular data set extend to a new data set not used for model estimation. individual word embeddings (Huth et al., 2016). In the following sections, we review in silico adaptations of four different language experiments based on four separate data sets. Each of these in silico experiments uses a single naturalistic experiment to train the encoding models, illustrating how a single data set and experimental approach can provide a flexible test bed for many different hypotheses about natural language. The first experiment uses the BOLDpredic- tions engine to simulate a semantic contrast comparing concrete and abstract words (Binder et al., 2005), testing its generalizability to naturalistic settings. The next experiment focuses on a contrast-based study of composition in two-word phrases (Bemis & Pylkkänen, 2011), testing generalizability over a broader, more diverse stimulus set. The third experiment adopts con- trasts from a group-level study investigating the temporal hierarchy for language processing across cortex by manipulating naturalistic stimuli (Lerner et al., 2011). This simulation checks if effects persist at the individual-level and demonstrates how a successful replication can be used to validate computational model constructs themselves. Finally, the last experiment con- ceptually replicates a study on forgetting behavior in the cortex that also uses controlled manipulations of naturalistic stimuli (Chien & Honey, 2020). This simulation demonstrates the possibility of misinterpretation with the in silico approach, arising from fundamental com- putational differences between neural language models and the human brain. In the experimental simulations described below, voxelwise encoding models were fit to fMRI data collected from a naturalistic speech perception experiment. Participants listened to natural, narrative stories from The Moth Radio Hour (Allison, 2009–) while their whole- brain BOLD responses were recorded (N = 8 for study 1; N = 7 for studies 2 and 3). In each study, encoding models were fit for each voxel in each subject individually using ridge regres- sion. The learned models were then tested on one held-out story that was not used for model estimation, and encoding performance was measured as the linear correlation between pre- dicted and true BOLD responses. Statistical significance of the encoding performance was measured using temporal blockwise permutation tests (p < 0.001, false discovery rate (FDR) corrected; Benjamini & Hochberg, 1995). Finally, in silico analyses were conducted on voxels that were significantly predicted by the encoding model, broadly covering the temporal, pari- etal, and frontal lobes. Semantic contrasts: Wehbe et al. (2018) Binder et al. (2005) investigated the brain regions responsive to abstract and concrete con- cepts. Subjects read individual stimulus strings and pressed one of two buttons to indicate whether each one was a word or a non-word. The study reported that concrete words acti- vated bilateral language regions such as the angular gyrus more than abstract words, and abstract words activated left inferior frontal regions more than concrete words. In total, the authors found 15 cluster peaks. Wehbe et al. (2018) evaluated the reproducibility of these results using an encoding model trained on naturalistic stimuli. They simulated a contrast between the lists of concrete words and abstract words that were kindly shared by Binder et al. (2005). Figure 2 shows the signif- icance map for subject 1 and the group-averaged significance map showing the number of subjects for which the null hypothesis is rejected. The reported regions of interest (ROIs) are shown as an overlay on the flattened cortical maps. Each ROI is originally reported as a single coordinate in brain space and is estimated to have a radius of 10 mm. For every one of the eight subjects, many voxels were significantly more activated for concrete words over abstract words (with p < 0.05, FDR corrected permutation test over the words in each condition), spe- cifically in areas bordering the visual cortex and parts of the inferior frontal gyrus. Some reported ROIs had a high overlap with the significance map (specifically in the angular gyri, Neurobiology of Language 9 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / . / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d . / l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience Figure 2. Generalizability test using BOLDpredictions of the concrete vs. abstract contrast of Binder et al. (2005). The authors compared fMRI activity when subjects processed concrete and abstract words. Wehbe et al. (2016) used the published stimulus to simulate the contrast for each subject and run a permutation test. After MNI space transformation, the number of subjects for which the null hypothesis was rejected is computed at each voxel. The simulated statistical maps are shown on flattened maps and inflated 3D hemispheres. Results for subject 1 are shown in subject 1’s native cortical space. Results for the average of eight subjects are shown in the MNI space. Published ROIs are estimated as 10 mm radius spheres, shown in red hatch on the flatmaps (distortion due to the flattening process). A comparison of the overlap of the reported ROIs and the statistical maps reveals that Wehbe et al. (2016) achieve a relatively high overlap for specific ROIs (in the angular gyri, in the posterior cingulate gyri, the right precuneus, and the middle temporal gyri) and not for others. Therefore, BOLDpredictions predicts that the contrast from Binder et al. (2005) generalizes to naturalistic conditions, to a certain extent. MNI = Montreal Neurological Institute; ROIs = regions of interest. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l . / / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 in the posterior cingulate gyri, the right precuneus, and the middle temporal gyri). The signif- icant effect in those ROIs can be considered to be replicated by BOLDpredictions. However, the reported ROIs and the significance map did not always agree, with the effect in some regions being reported only by Binder et al. (2005) or only by Wehbe et al. (2018). There are many possible reasons for non-generalizability of individual reported ROIs, including the stochasticity of brain activity, variations in experimental paradigms and analysis techniques, and lack of reproducibility. The authors of BOLDpredictions (Wehbe et al., 2018, 2021) note that any scientific finding needs to be reproduced in a series of experiments that would create a body of evidence toward this finding, and the in silico experimentation using BOLDpredictions is one additional piece of evidence. The authors also note that expanding the engine to different data sets, models, and so forth will establish the robustness of the in silico effects and help determine if the original contrast-based experiment lacks reproducibility (Wehbe et al., 2018). Semantic composition contrasts: Jain and Huth (2023) In the second in silico experiment, Jain and Huth (2023) simulated and expanded on studies of combinatorial processing in 2-word phrases across cortex, first described in Bemis and Pylkkänen (2011). The original controlled experiment consisted of participants reading two word adjective–noun phrases (“red boat”) and doing a picture matching task while brain responses were recorded using MEG. To contrast this compositional condition, a list control was introduced wherein participants were presented with two-word noun–noun phrases (“cup boat”) along with a non-word control consisting of a non-word and a word (“xkq boat”). Note that participants were instructed to avoid composing meaning in the word list, but no explicit control was introduced. To isolate regions involved in two-word composition, the study con- trasted the adjective–noun condition with the controls. The experimenters tested 25 base nouns, six color adjectives, and six non-words. Overall, they found that areas in ventral medial Neurobiology of Language 10 In silico language neuroscience prefrontal cortex (vmPFC) and left anterior temporal lobe both selectively responded to the composition condition. Jain and Huth (2023) conceptually replicated the original study by building encoding models that approximate every voxel’s response to a naturalistic two-word phrase as a non- linear function of the words in the phrase. For each (overlapping) two-word phrase in the natural language stimuli, features were first extracted from a powerful language model, the generative pretrained transformer (GPT; Radford et al., 2018). Then, voxelwise encoding models were trained to learn a linear function from the phrase features to the elicited response after the second word. Using the encoding models, each voxel’s response to a large corpus of over 147,000 two-word phrases was predicted and ranked. This stimulus set comprised both adjective–noun phrases like “red boat” and noun–noun phrases like “picnic table.” Next, for a given phrase selected by a voxel, the first word was replaced with a non-word (i.e., the word was ablated) and the ablated phrase feature was extracted from GPT. Using the learned encod- ing model, the voxel’s response to the ablated phrase was predicted. Finally, the sensitivity of the voxel to the presence of the ablated word was measured. If the ablated word is important for the voxel to process the phrase, removing it should notably change its response and give high sensitivity. This was done to simulate the compositional versus non-word condition in the original study. The resultant ablation sensitivity of voxels across cortex is visualized in Figure 3. Overall, the in silico experimentation produced similar results to the original study in vmPFC and left anterior temporal lobe—both of these regions exhibit sensitivity to the presence of a l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l . / / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d . / l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico adaptation of a study examining compositional processing in two-word phrases. The original study compared the MEG Figure 3. responses of participants with three different types of two-word phrases: adjective–noun, noun–noun and non-word–noun (Bemis & Pylkkänen, 2011). The in silico simulation of the first two conditions was done by constructing a larger diverse corpus of phrases and using LM-based encoding models to predict fMRI voxel responses to each phrase. The non-word–noun condition was simulated by replacing the first word in a phrase with a non-word (i.e., word ablation), extracting new phrase features from the LM, and then predicting the voxel’s response to the ablated phrase. A large change in a voxel’s response upon word ablation indicated its sensitivity to the first word in the phrase and suggested that the voxel relied on the first word to process the phrase. Similar to the original study, the in silico experiment revealed high sensitivity in ventral medial prefrontal cortex and left anterior temporal lobe. However, the experiment also found that several other areas across cortex combined meaning over the two words in a phrase and moreover, captured diverse semantic concepts arising from the composition. fMRI = functional magetic resonance imaging; LM = language model; MEG = magnetoencephalography. Neurobiology of Language 11 In silico language neuroscience compositional word. Beyond areas reported originally, other regions like right inferior parietal and dorsal prefrontal also showed high sensitivity. This finding corroborates other studies of phrase composition (e.g., Boylan et al., 2015; Graves et al., 2010). The in silico study was able to analyze two-word composition in broader regions of cortex by simulating activity for each voxel independently and over a much larger stimulus set that comprises diverse concepts and constructions. While the simulation does not guarantee causal involvement of any region in two-word composition, it demonstrates the utility of broadly sampling stimuli and raises the possibility that many more regions are involved in this process. Moreover, in the in silico study Jain and Huth (2023), this paradigm was extended to much longer phrases (10 words) to understand the relationship between semantic representations and word-level integration across cortex. This would be difficult to implement in real-world settings as doing single-word ablations on increasingly longer phrases is combinatorially explosive. Construction timescale contrast: Vo et al. (2023) In the third in silico experiment, Vo et al. (2023) tested whether voxelwise encoding models based on features from a neural LM can capture the timescale hierarchy observed during human natural language comprehension. In Lerner et al. (2011), subjects listened to a first- person narrative story that was either intact, reversed, or temporally scrambled at the word level, sentence level, or paragraph level. The scrambling manipulations altered the temporal coherence of the natural language stim- ulus, and allowed the researchers to measure the reliability of fMRI responses to each condi- tion using intersubject correlation. This revealed an apparently hierarchical organization of temporal receptive windows, with information over short timescales processed in auditory cor- tex and long timescales processed in parietal and frontal cortex. For the in silico adaptation, the authors trained a multi-timescale long short-term memory (MT-LSTM) network as a lan- guage model. Then they used the features from the MT-LSTM to predict fMRI responses for each voxel using the data set described above. To mimic the manipulations of the original study, they generated 100 scrambled versions of a held-out test story. This enabled the authors to examine the predicted fMRI responses within each voxel in each subject. Rather than mea- suring intersubject reliability, they chose to measure an analogous intrasubject reliability value, testing whether the scrambling condition caused a significant drop in this value across conditions. The authors show through simulations that their metric (based on the variance in the simulated fMRI response) is directly analogous to intersubject correlation measures, which is supported by other work (Blank & Fedorenko, 2017; Hasson et al., 2009). The results of this experiment compared to a schematized version of the original results are shown in Figure 4. This in silico experiment reproduced the pattern of the temporal hierarchy along the temporoparietal axis. It did find that some regions in frontal cortex appear to inte- grate over shorter timescales than the original work, similar to a later replication of the work (Blank & Fedorenko, 2020) and to a different in-silico replication of the experiment that used GPT-2, rather than a MT-LSTM language model (Caucheteux et al., 2021). Furthermore, the fine-grained resolution of the single-voxel analyses revealed substantial variability across sub- jects. Taken together, the in silico results suggest that timescales are not as uniform across broad regions as previously reported. This is in agreement with single-neuron studies that show a heterogeneity of intrinsic timescales within a brain region (Cavanagh et al., 2020). Forgetting timescale contrast: Vo et al. (2023) In the last experiment, the authors used the same MT-LSTM encoding models as experiment 3 to simulate how different brain regions forget information in naturalistic narrative stimuli Neurobiology of Language 12 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l . / / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience In silico adaptation of a study mapping the hierarchy of temporal receptive windows. (Top) original results adapted from Lerner Figure 4. et al. (2011). The authors played an audio clip of a narrative story, either intact, reversed, or scrambled at different temporal scales. The figure shows an overlay of several intersubject correlation maps, which measured the cross-subject reliability of the fMRI response in each condition. (Bottom) Results from the in silico experiment of temporal receptive windows, shown for every significantly predicted voxel on a single subject. The in silico experiment suggests that temporal processing windows for different brain regions are not as uniform as previously reported. CS = central sulcus; IPS = intrapariental sulcus; LS = lateral sulcus; mPFC = medial prefrontal cortex; TPJ = temporoparietal junction. (Chien & Honey, 2020). While Chien and Honey found that all brain regions forget informa- tion at the same rate (Figure 5A), the in silico results suggested that low-level regions such as auditory cortex forget information at a faster rate than high-level regions like the precuneus (Figure 5B). To better understand this discrepancy, the authors investigated forgetting behavior in the MT-LSTM itself. The results first indicated that every unit in MT-LSTM forgot information at a specific rate tied to its processing timescale (Figure 5C). The authors further hypothesized that the discrepancy could stem from the MT-LSTM’s inability to forget information, even if the preceding context is noisy/uninformative (Figure 5D). To test this, they measured the language model’s cross entropy (lower is better) for a paragraph in three conditions: preceded by the correct paragraph (actual context), preceded by no paragraph (no context) and preceded by random paragraphs in the story (box plot of 100 different incorrect contexts). The story was scrambled by dividing it into non-overlapping chunks of 9, 55, 70, 80, or 200 words or at the actual sentence and paragraph boundaries (hand-split). Overall, smaller differences were observed between the conditions as the scrambled context became longer (increased chunk size) and closer to the intact story. With fixed-size chunks, the model performed better when it had no context than when it had access to incorrect information. In contrast, with actual sentences/paragraphs, the model had better performance with incorrect context than no con- text at all. In both cases, the type of context influences the model performance suggesting that the model retains information from the past. Second, it retains this context even if it is not useful, as in the fixed-chunk conditions. The model could have simply ignored the wrong con- text to perform better but it did not (or was unable to). This highlights the language model’s inability to forget information that is then reflected in the encoding model results. The authors Neurobiology of Language 13 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l . / / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l Q6 f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience In silico adaptation of a study on forgetting behavior during natural language comprehension. In the original study, Chien and Figure 5. Honey (2020) scrambled paragraphs in a story and analyzed how quickly different brain regions forgot the incorrect context preceding each paragraph. The in silico adaptation used the MT-LSTM based encoding model to predict brain activity at different points in a paragraph when it was preceded by incorrect context. (A) The original study reported that each brain region (denoted by different colored lines) forgot information at a similar rate, despite differences in construction timescales. (B) In contrast, the in silico replication estimated that regions with longer construction timescales also forgot information slowly. (C) Within the MT-LSTM itself, the forgetting rate of different units was related to its attributed timescale. (D) Next, the MT-LSTM’s language modeling abilities were tested on shuffled sentences or paragraphs. The DL model achieved better performance at next-word prediction by using the incoherent, shuffled context as opposed to no context at all. This shows that the DL model retains the incoherent information, possibly because it helps with the original language modeling task it was trained on or because the model has no explicit mechanism to flush-out information when context changes (at sentence/paragraph boundary). The com- putational model’s forgetting behavior thus differs from the brain, revealing specific flaws in the in silico model that could be improved in future versions, such as a modified MT-LSTM. DL = deep learning; MT-LSTM = multi-timescale long short-term memory; rSI = correlation between scrambled and intact chunks. hypothesized that with hand-split sentences and paragraphs, the incorrect context still pro- vides relevant information to predict the next word, leading to better performance than no context at all. DISCUSSION Advantages of the in Silico Experimental Design In the following sections, we discuss the advantages of using in silico experimental design with DL-based encoding models and its potential impact on language neuroscience. Hypothesis development and testing Each of the studies above conceptually replicated controlled analyses of different linguistic properties using voxelwise encoding models fit on a single naturalistic fMRI data set. Overall, the studies reproduced several effects reported in the original experiments. Interestingly, how- ever, the in silico experiments also found new effects that had not been explored originally. For example, the first experiment suggested that regions in inferior frontal gyrus were more active for concrete words than abstract words. In the second experiment, the investigation of com- position in phrases was expanded to a much larger stimulus set and longer phrases. This cor- roborates earlier results, but the in silico paradigm also enables experimenters to explore Neurobiology of Language 14 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / / . 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d . / l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience interactions between the effect of interest (here, linguistic composition) and other important linguistic properties, such as semantic category. The third experiment found more diversity in timescales in regions like prefrontal cortex than previously reported, closely matching more recent studies of timescale distribution across cortex (Blank & Fedorenko, 2020; Caucheteux et al., 2021) and single-neuron studies (Cavanagh et al., 2020). This demonstrates how the in silico experimental design can be used not only to reproduce and test the generalizability of controlled studies, but also to conduct large-scale exploratory and data-driven analyses that can reveal new aspects of brain function. Beyond these examples, it is possible to test new hypotheses using in silico experiments before collecting data for a controlled experiment. Wehbe et al. (2016) showcase how BOLD- predictions and in silico experimentation can be used to design new experiments. While the in silico results may not precisely match the eventual data collected on a human population, they would reveal areas where the underlying DL model has failed to match human-level process- ing and present possible areas of improvement. This has the potential to advance our under- standing of both neural network language models and biological language processing. In particular, the in silico paradigm can both draw upon large-scale multidisciplinary efforts to build tools and methods for interpreting neural network language models (Ettinger et al., 2018; Hewitt & Manning, 2019; Ravfogel et al., 2020; Sundararajan et al., 2017), as well as con- tribute to them by providing a human neural benchmark. Furthermore, more interpretable models allow for novel causal intervention experiments that perturb and control ANNs in ways that biological neural networks cannot be perturbed (Zhang et al., 2022). Testing for generalizability of effects and experimental construct validity One way to ensure the observed effects of the in silico experiments are not due to the specific task design is to test the generalizability of effects across model architectures, training tasks, neuroimaging data sets and modalities. Unlike reproducibility tests in traditional neuroimaging experiments, these tests do not rely on laborious and time-consuming data collection. More- over, there are increasingly more tools and techniques to interpret DL models (Clark et al., 2019; Ettinger, 2020), and we can target investigations to these. For example, in the forgetting experiment, the authors checked how the model represented the cognitive process itself. We note that some drawbacks of DL models persist across architectures and tasks. For instance, current language models still perform poorly on common sense reasoning and struggle with capturing long-term dependencies in language. However, with technological advancements, the types of inferences we can make with the in silico paradigm will greatly improve. A case in point is the modeling of contextual processing in the brain. Until recently, language encoding models were largely restricted to static word embeddings that made it difficult to analyze how the brain processed word sequences. However, with the advent of neural language models, this has changed dramatically. In vision neuroscience, the functional profile of the fusiform face area was established through contrast experiments that evolved over a long period of time (Anzellotti et al., 2014; Gauthier et al., 2000; Kanwisher et al., 1997; Liu et al., 2010; Sergent et al., 1992; Suzanne Scherf et al., 2008; Xu, 2005). Each new experiment was designed to address a con- found that was not accounted for previously. Today, however, in silico experiments with vision models have enabled neuroscientists to efficiently contrast large, diverse sets of stimuli and establish the functional specificity of different regions (Ratan Murty et al., 2021). Similarly, in language neuroscience, encoding models have been used to evaluate the semantic selec- tivity of many regions going beyond semantic contrasts that are tested for a handful of condi- tions at a time (Huth et al., 2016; Mitchell et al., 2008). This demonstrates how the in silico Neurobiology of Language 15 Q7 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / . / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience Construct validity: Determinination whether a theoretical, experimental, or computational construct faithfully reflects the true phenomena. paradigm allows scientists to quickly design and test multiple experiments that get at the same underlying question. This means that in silico experiments, despite using similar manipulations to controlled experiments, can provide an additional way to test the construct validity of the experiment (Yarkoni, 2022). When coupled with generalizability testing, we run a lower risk of over-claiming or over-generalizing. Establishing the validity of model constructs The in silico approach uniquely facilitates experimenters to evaluate and improve the design of computational models based on observed in silico behavior, going beyond good prediction performance. For example, in the forgetting experiment, the authors identified that the MT-LSTM language model does not instantaneously switch context, and this could influence the observed effects. One possible solution to the nonreplication would be to then train the language model on a new task that encourages forgetting. Alternatively, it could prompt the need for designing alternate architectures that have a built-in forgetting mechanism closer to observed human behavior. Artificial neural networks can be investigated through causal inter- vention experiments and perturbations, whereas it is very difficult to impossible to do this for human language systems. By analyzing the behavior of DL models in many in silico experi- ments, we can create a check-and-correct cycle to build better computational models of the brain and establish the validity of model constructs. An analogous paradigm has also risen in popularity in NLP. Moving beyond better perfor- mance with larger language models, there has been a growing effort toward curating diverse language tasks like syntactic reasoning and multistep inference to understand the limitations of current models and establish a benchmark for future innovation. In the same vein, we believe that many different in silico experiments can be used together to establish the validity of dif- ferent model constructs and provide a benchmark to test future innovations in computational language modeling. We hope that this pushes the field past solely testing encoding model performance on different architectures. Ultimately, this paradigm is a bridge between compu- tational models and experimental designs in neuroscience, such that we can make joint infer- ences on both and improve them in tandem. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / . / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d . / l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Preserving individual participant differences One potential advantage of in silico encoding model experiments is that the models are typically estimated with single-subject data, allowing experiments to test for effects in individuals rather than over a group average. While group averaging is a common method to improve the signal- to noise ratio (SNR) of neuroimaging studies, it can lead to an underestimation of effect size (Fedorenko, 2021; Handwerker et al., 2004) and hide effects that can be seen with more fine- grained functional anatomical data (Popham et al., 2021). Finally, individual participant analysis does not preclude the estimation of how prevalent an effect is at the population level (Ince et al., 2021); however, it does enable experimenters to account for individual differences, which can be critical to establish links between brain and behavior (Hedge et al., 2018). Consequently, there has been a rising trend toward language studies that analyze participants individually and report consistency of effects across the group (Blank & Fedorenko, 2020; Huth et al., 2016; Wehbe, Vaswani, et al., 2014). While this requires the experimenter to collect more samples per subject to improve the SNR, this approach does not make assumptions about anatomical alignment and preserves spatial resolution important for inferring brain function. The improved sensitivity provides better control for Type 1 errors (by allowing the experimenter to see which effects replicate across participants) and Type 2 errors (by allowing a flexible mapping that can identify important regions in each participant, even if they do not match perfectly in anatomy). Neurobiology of Language 16 In silico language neuroscience However, the individual-participant analytic approach raises important questions about how to isolate functionally consistent regions across participants and infer consistency of effects. One solution is to use a common set of functional localizer stimuli across participants to isolate functionally homologous networks. For example, the auditory and visual language localizers developed by Fedorenko et al. (2010) and T. L. Scott et al. (2017) have been shown to robustly identify regions across cortex that are important for language processing. This approach enables future studies to consistently isolate language processing regions and char- acterize their function. Modeling approaches such as hyperalignment (Haxby et al., 2020) and probabilistic mapping of the cortical surface (Huth et al., 2015) offer solutions to compute group-level maps from functional data of individual participants. Nevertheless, these approaches do not provide a computational framework to model individual-participant effects. Encoding models, on the other hand, learn a different function for each brain element in each subject. This enables them to effectively model individual participants and retain high spatial resolution. Improving reproducibility in language neuroscience There has been an increasing concern in the sciences about the lack of reproducibility for many types of experiments (Pashler & Harris, 2012; Simmons et al., 2011), a problem to which neuroscience is not immune. Several papers have discussed the prevalence of analysis vari- ability, software errors, nontransparent reporting of methods, and lack of data/code sharing as primary causes for low reproducibility and generalizability in neuroscience (see Barch & Yarkoni, 2013, and Poldrack et al., 2020, for introductions to special issues on reproducibility in neuroimaging; Button et al., 2013; Evans, 2017). These studies have also identified issues in statistical analyses, like low statistical power (and, consequently, inflated effect sizes), HARK- ing, and p-hacking. We believe that the in silico experimentation paradigm can help alleviate some of these issues by providing access to and encouraging open tools for scientific research. When combined with open access to naturalistic data, preprocessing methods, and analysis code, the in silico paradigm can enable scientists to use a standard setup as they test a variety of different hypotheses and thus reduce the “researcher degrees of freedom.” Platforms such as BOLDpredictions can help with this. Indeed, BOLDpredictions is intended as a community tool to allow easy in silico experimentation and generalization testing. It is intended to allow other researchers to contribute their encoding models for other experiments (even outside of language) so that in silico experiments can be available to all. Furthermore, competitions such as Brain-Score (https://www.brain-score.org/competition/) and the SANS’22 Naturalistic fMRI data analysis challenge (https://compsan.org/sans_data_competition/content/intro.html) can align scientific work toward a common goal and facilitate verifiability. Since naturalistic exper- iments broadly sample the stimulus space, the in silico paradigm can also act as a test bed for generalizability. Caveats and the Possibility of Overinterpretation The in silico paradigm leverages the advantages of both controlled and naturalistic experimen- tal designs with DL-based encoding models. However, it is important to recognize the caveats of this approach so as to minimize the risk of overinterpretation. Here we discuss a number of potential issues. Limitations in the natural language stimulus One critical advantage of naturalistic stimuli over controlled designs is the access to many diverse examples of language use. However, this also means that the experimenter has little Neurobiology of Language 17 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l . / / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience Fine-tuning: A secondary learning procedure that modifies an already trained artificial neural network to adapt to a new task or data set. control over the rate of occurrence of different types of linguistic features. Word frequency is an example of uncontrolled variation in natural stimuli (e.g., high frequency of words describe everyday objects like “table” and “book” as opposed to low-frequency words like “artillery” and “democracy”). This presents an important challenge in naturalistic paradigms as the rare variables will have low power and could lead to incorrect or incomplete inferences of brain function. For example, if a voxel encodes semantic concepts related to politics and gover- nance, but this category is not well represented in the naturalistic stimuli, the experimenter runs the risk of incorrectly inferring the voxel function. This can be addressed by building larger, freely available data sets collected from diverse sources and encouraging replication of effects on them. Another issue with naturalistic paradigms is that they currently rely on the passive percep- tion of language. Many studies have shown that turn-taking in conversation is an important, universal aspect of communication and has implications on how we learn, understand and generate language (Levinson, 2016). Despite a rising trend toward stimuli that take into account social and contextual information, we are still far from studying truly natural use of language with neuroimaging. Some work has investigated aspects of conversational commu- nication (Bögels et al., 2015; Gisladottir et al., 2015; Magyari et al., 2014; Sievers et al., 2020), but the field is still behind in modeling these effects with encoding models or ANNs. Richer data sets will be key to developing these approaches, such as the real-world communication data collected in Bevilacqua et al. (2019) or multimodal movie stimuli discussed in Redcay and Moraczewski (2020). This is an important future direction for the naturalistic paradigm to understand the brain mechanisms of language processing in ethological settings. Limitations in the DL-based feature space Perhaps the most important factor guiding the in silico experimental design is the success of DL models at predicting brain activity. This paradigm allows neuroscientists to inspect brain function by conducting simulations on the computational model instead, which is easier to perturb, interpret, and control. However, this also means that the types of effects we can observe are limited by the capabilities of the DL model. For example, the forgetting experiment by Vo et al. (2023) demonstrates how the computational model has different behavior than the human brain, affecting the observed in silico behavior. Domain shift presents another common issue for neural networks, although recent studies has proposed that fine-tuning on the target domain/task (Radford et al., 2018) and dynamic data selection during training (Aharoni & Goldberg, 2020; van der Wees et al., 2017) can greatly alleviate this problem for language models. Several encoding model studies explicitly fine-tuned the language model to operate in set (Jain et al., 2020) or trained the language model on a corpus specifically curated to resemble the experimental task (Jain & Huth, 2018; Wehbe, Vaswani, et al., 2014). Further- more, while ANNs like language models have been successfully employed for a wide range of tasks, their syntactic, common sense, and logical reasoning abilities are still far from those of humans (Ettinger, 2020; Linzen, 2020; Pandia & Ettinger, 2021; Wilcox et al., 2021). Overall, it is important to note that building good encoding models of brain activity and understanding brain function with the in silico paradigm are both contingent on better artificial models of language processing. Limitations in computational modeling Another source of confounds in encoding models and the in silico paradigm is incorrect modeling assumptions. For example, Jain et al. (2020) highlight that many fMRI encoding models rely on a downsampling technique that incorrectly transforms slowly varying features, Neurobiology of Language 18 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / . / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience Zone generalization: Determination whether two brain regions process stimuli similarly. making them highly correlated with local word rate. Consequently, an experimenter may (incorrectly) conclude that a brain region that is well predicted by the downsampled features is selective for the slowly varying information (e.g., discourse) it captures, when, in fact, the brain region merely responds to the rate of words. In other cases, it may be important to model several different sources of noise, which has been pursued in other work simulating fMRI data (Ellis et al., 2020). Current neuroimaging modalities also have low SNR, limiting the predictive performance of computational models. Because all modeling approaches likely have caveats and edge cases for which their assumptions fail, it is important to clearly articulate and discuss these issues in future work. Inappropriate causality and zone generalization Unlike contrast-based experiments and encoding models with simple interpretable features like indicator variables, DL-based encoding models rely on ANNs that are themselves hard to interpret. To this end, any significant correlation observed between brain activity and model predictions leaves many possibilities for interpretation. An experimenter may conclude that the task or objective function the DL model was trained on closely resembles a task the brain solves, when this may not be the case. For example, one might falsely infer that the brain does predictive coding for language because it is well predicted by features from a language model that is trained to predict the next word in a sequence. Guest and Martin (2023) elaborate on this issue by discussing the logical fallacies in inferences drawn between brain behavior or activity, and DL models of language. Specifically, they highlight that studies analyzing paral- lels between the brain and computational models of language often attribute inappropriate causality by assuming that predictive ability is sufficient to claim task similarity or model equivalence. On the contrary, the direction of causality should be that if an artificial model closely resembles the brain, it can mimic brain behavior and activity, or that a lack of predic- tion abilities clearly indicates a lack of model equivalence. This is a pertinent issue for in silico experimentation as the paradigm uses computational models of language processing in the brain to simulate its behavior. However, it is important to note that in all of the in silico exam- ples presented here, the authors were using the generalizability of the encoding models to predict brain responses in different conditions. This only suggests that the encoding models can effectively capture the brain’s behavior for language tasks but is not a sufficient account to conclude model equivalence. Another issue with logical inference in DL-based encoding models relates to the functional equivalence of two brain regions that are both well predicted by a given feature space. In their recent study, Toneva, Williams, et al. (2022) discuss this issue in detail for language encoding models and provide a computational framework to analyze the extent to which brain regions share computational mechanisms solely based on their encoding performance. The three main sources of confounds—naturalistic stimuli, DL-based feature spaces, and modeling assumptions—can intersect in interesting ways and raise the probability of incorrect interpretation. False causality stemming from spurious correlations are a problem for in silico experiments, much like controlled experiments. To this end, it is important to emphasize trans- parent analysis methods, better interpretability tools for DL models, and rigorous tests of repro- ducibility with diverse data sources. Although reproducibility is traditionally viewed as the replication of effects across partici- pant pools/data sets, with in silico experimentation we can add another layer of replicability, across different models that have different intrinsic biases (architectures, training objectives, etc.), and learn different types of representation. Neurobiology of Language 19 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / / . 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience Important Factors of Consideration Before doing in silico experimentation, one important consideration is determining if the encoding model is “good enough.” While there is no quantitative threshold above which a model can be considered suitable, we suggest the following. Statistical significance of encoding model performance on a diverse, held-out test set It is imperative that experimenters test whether encoding model performance is statistically significant at the individual brain element level. Any in silico experimentation should only be done on brain elements that are significantly predicted by the computational model. A well-established approach in language encoding modes is to correlate the true and predicted responses for a held-out test set. Following this, a permutation test can be done to check if the correlation is statistically significant. We also emphasize the importance of using diverse test sets to effectively gauge generali- zation. If a brain element is selective for features that are not present in the test set, then it may falsely be labeled as poorly predicted. One feasible solution is to use a leave-one-out testing procedure. This can be done by fitting an ensemble of encoding models, each of which excludes one unique set of training data in the model estimation. Statistical significance can then be measured for encoding model predictions on all held-out data. This procedure increases diversity in the test set and improves statistical power. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l . / / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Feature-space selection Given the diversity of function across the brain, it is possible that no one feature space or computational model best predicts all brain regions. Thus, experimenters should test several different features spaces and models (Nunez-Elizalde et al., 2019) and individually choose the one with best held-out set performance for each brain element. This is especially important for DL models as different neural language models or their layers predict different brain regions well. In this case, we would use the neural language model (layer) that best predicts held-out stories for each element and, further, passes construct validity tests (ie., has well-understood behavior to the controlled manipulation). For example, in the in silico semantic composition experiment, Jain and Huth (2023) found that the lower layers of the neural language model were generally indifferent to ablating words farther in the past (Khandelwal et al., 2018). Con- sequently, these layers cannot be used to conduct the ablation study, as they do not respond to the manipulation in the first place. Interpreting the DL-models To establish the validity of computational model constructs, we suggested the use of interpret- ability tools and techniques to understand how the DL-model itself represents a cognitive pro- cess. This would allow the experimenter to directly investigate sources of confounds. It is also important to consider the types of questions the in silico paradigm is most suited to answer. As demonstrated here, this paradigm can be used to estimate functional properties in the brain, such as selectivity to different word categories or the processing timescale. It cannot, however, be used to test the causal involvement of a brain area or the exact computational mechanism. For example, many regions in the experiments above are shown to capture semantic properties in language. Whether these regions play a causal role in semantic tasks, can only be determined by an in vivo measurement. Neurobiology of Language 20 In silico language neuroscience Conclusion and Future Directions In this article, we highlight the promises of in silico experimentation and detail how it brings together the advantages of controlled experiments and naturalistic experiments paired with encoding models. We showcase four different in silico experiments that all rely on naturalistic language experiments to simulate four different previous studies. We survey the advantages and potential caveats of in silico experimentation and highlight how it can take advantage of recent work in DL to simulate experiments with diverse types of language stimuli. Current work on DL-based encoding models for language is largely restricted to self- supervised models. This is expected since self-supervised models have been trained on large amounts of data and consequently learn highly useful and transferable linguistic representa- tions. However, it remains to be seen if task-based experimental designs in neuroscience can be simulated and adapted with more goal-directed artificial language networks. Additionally, it is also important to investigate and characterize which types of neuroscientific results can be explored with self-supervised models and what aspects of language meaning are beyond the scope of the next-word-prediction objective. Lastly, DL-based language encoding models rely on feature extraction from language or speech ANNs (linearizing transform) and learn a linear function atop the features. We believe that the in silico paradigm can become more powerful if language encoding models directly update the parameters of the ANN itself, resulting in an end-to-end system. While this has been popularized in vision (e.g., Bashivan et al., 2019), it is yet to be explored for language. This approach can potentially introduce diversity into the computational mechanisms of the ANNs, such as recurrence, linear readout from a memory store, and so forth, to integrate pro- cessing in different brain structures (hippocampus, cortex, etc.). This could allow us to under- stand parallel mechanisms like linguistic function, working memory access, and attention using this same approach. ACKNOWLEDGMENTS Research reported in this article was also supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health as part of the Collabora- tive Research in Computational Neuroscience (CRCNS) program. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We thank Nicole Beckage and Javier Turek for useful discussions on this work. FUNDING INFORMATION Alexander Shailee Jain, Foundations of Language Fellowship, William Orr Dingwall Foundation. G. Huth, Burroughs Wellcome Fund (https://dx.doi.org/10.13039/100000861). Alexander G. Huth, Intel Corporation (https://dx.doi.org/10.13039/100002418). Alexander G. Huth, National Institute on Deafness and Other Communication Disorders https://dx.doi.org/10.13039 /100000055), Award ID: R01DC020088. Leila Wehbe, National Institute on Deafness and Other Communication Disorders https://dx.doi.org/10.13039/100000055), Award ID: R01DC020088. AUTHOR CONTRIBUTIONS Shailee Jain: Conceptualization: Lead; Writing – original draft: Lead; Writing – review & editing: Lead. Vy A. Vo: Conceptualization: Supporting; Writing – original draft: Supporting; Writing – review & editing: Supporting. Leila Wehbe: Conceptualization: Supporting; Funding Neurobiology of Language 21 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l . / / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d . / l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Q8 In silico language neuroscience acquisition: Equal; Supervision: Supporting; Writing – original draft: Supporting; Writing – review & editing: Supporting. Alexander G. Huth: Conceptualization: Supporting; Funding acquisition: Equal; Supervision: Lead; Writing – original draft: Supporting; Writing – review & editing: Supporting. REFERENCES Abnar, S., Beinborn, L., Choenni, R., & Zuidema, W. (2019). Black- box meets blackbox: Representational similarity & stability anal- ysis of neural language models and brains. In Proceedings of the 2019 ACL workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP (pp. 191–203). Association for Compu- tational Linguistics. https://doi.org/10.18653/v1/ W19-4820 Aharoni, R., & Goldberg, Y. (2020). Unsupervised domain clusters in pretrained language models. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics (pp. 7747–7763). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.692 Allison, J. (Producer). (2009–). The moth radio hour [Radio pro- gram]. The Moth; Atlantic Public Media/PRX. Anderson, A. J., Kiela, D., Binder, J. R., Fernandino, L., Humphries, C. J., Conant, L. L., Raizada, R. D. S., Grimm, S., & Lalor, E. C. (2021). Deep artificial neural networks reveal a distributed corti- cal network encoding propositional sentence-level meaning. Journal of Neuroscience, 41(18), 4100–4119. https://doi.org/10 .1523/JNEUROSCI.1152-20.2021, PubMed: 33753548 Antonello, R., & Huth, A. (2022). Predictive coding or just feature discovery? An alternative account of why language models fit brain data. Neurobiology of Language, 1–16. https://doi.org/10 .1162/nol_a_00087 Antonello, R., Turek, J. S., Vo, V. A., & Huth, A. (2021). Low- dimensional structure in the space of language representations is reflected in brain responses. In A. Beygelzimer, Y. Dauphin, P. Liang, & J. W. Vaughan (Eds.), Advances in neural information processing systems. NeurIPS. https://openreview.net/forum?id =UYI6Sk_3Nox Anzellotti, S., Fairhall, S. L., & Caramazza, A. (2014). Decoding representations of face identity that are tolerant to rotation. Cere- bral Cortex, 24(8), 1988–1995. https://doi.org/10.1093/cercor /bht046, PubMed: 23463339 Aurnhammer, C., & Frank, S. L. (2018). Comparing gated and sim- ple recurrent neural network architectures as models of human sentence processing. PsyArXiv. https://doi.org/10.31234/osf.io /wec74 Barch, D. M., & Yarkoni, T. (2013). Introduction to the special issue on reliability and replication in cognitive and affective neurosci- ence research. Cognitive, Affective, & Behavioral Neuroscience, 13(4), 687–689. https://doi.org/10.3758/s13415-013-0201-7, PubMed: 23922199 Bashivan, P., Kar, K., & DiCarlo, J. J. (2019). Neural population control via deep image synthesis. Science, 364(6439), Article eaav9436. https://doi.org/10.1126/science.aav9436, PubMed: 31048462 Bemis, D. K., & Pylkkänen, L. (2011). Simple composition: A mag- netoencephalography investigation into the comprehension of minimal linguistic phrases. Journal of Neuroscience, 31(8), 2801–2814. https://doi.org/10.1523/JNEUROSCI.5003-10.2011, PubMed: 21414902 Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics (pp. 5185–5198). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl -main.463 Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discov- ery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B (Methodological), 57(1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x Bevilacqua, D., Davidesco, I., Wan, L., Chaloner, K., Rowland, J., Ding, M., Poeppel, D., & Dikker, S. (2019). Brain-to-brain syn- chrony and learning outcomes vary by student–teacher dynamics: Evidence from a real-world classroom electroencephalography study. Journal of Cognitive Neuroscience, 31(3), 401–411. https:// doi.org/10.1162/jocn_a_01274, PubMed: 29708820 Bhattasali, S., Brennan, J., Luh, W.-M., Franzluebbers, B., & Hale, J. (2020). The Alice datasets: fMRI & EEG observations of natural language comprehension. In Proceedings of the 12th language resources and evaluation conference (pp. 120–125). European Language Resources Association. https://aclanthology.org/2020 .lrec-1.15 Bhattasali, S., Fabre, M., Luh, W.-M., Al Saied, H., Constant, M., Pallier, C., Brennan, J. R., Spreng, R. N., & Hale, J. (2019). Loca- lising memory retrieval and syntactic composition: An fMRI study of naturalistic language comprehension. Language, Cognition and Neuroscience, 34(4), 491–510. https://doi.org/10.1080 /23273798.2018.1518533 Binder, J. R., Desai, R. H., Graves, W. W., & Conant, L. L. (2009). Where is the semantic system? A critical review and meta- analysis of 120 functional neuroimaging studies. Cerebral Cortex, 19(12), 2767–2796. https://doi.org/10.1093/cercor/ bhp055, PubMed: 19329570 Binder, J. R., Westbury, C. F., McKiernan, K. A., Possing, E. T., & Medler, D. A. (2005). Distinct brain systems for processing con- crete and abstract concepts. Journal of Cognitive Neuroscience, 17(6), 905–917. https://doi.org/10.1162/0898929054021102, PubMed: 16021798 Blank, I. A., & Fedorenko, E. (2017). Domain-general brain regions do not track linguistic input as closely as language-selective regions. Journal of Neuroscience, 37(41), 9999–10011. https://doi.org/10 .1523/JNEUROSCI.3642-16.2017, PubMed: 28871034 Blank, I. A., & Fedorenko, E. (2020). No evidence for differences among language regions in their temporal receptive windows. NeuroImage, 219, Article 116925. https://doi.org/10.1016/j .neuroimage.2020.116925, PubMed: 32407994 Bögels, S., Magyari, L., & Levinson, S. C. (2015). Neural signatures of response planning occur midway through an incoming ques- tion in conversation. Scientific Reports, 5(1), Article 12881. https://doi.org/10.1038/srep12881, PubMed: 26242909 Boylan, C., Trueswell, J. C., & Thompson-Schill, S. L. (2015). Compositionality and the angular gyrus: A multi-voxel similarity analysis of the semantic composition of nouns and verbs. Neuropsy- chologia, 78, 130–141. https://doi.org/10.1016/j.neuropsychologia .2015.10.007, PubMed: 26454087 Neurobiology of Language 22 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / / . 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d . / l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience Brennan, J., Nir, Y., Hasson, U., Malach, R., Heeger, D. J., & Pylkkänen, L. (2012). Syntactic structure building in the anterior temporal lobe during natural story listening. Brain and Language, 120(2), 163–173. https://doi.org/10.1016/j.bandl.2010.04.002, PubMed: 20472279 Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376. https://doi.org /10.1038/nrn3475, PubMed: 23571845 Caucheteux, C., Gramfort, A., & King, J.-R. (2021). Model-based analysis of brain activity reveals the hierarchy of language in 305 subjects. In Findings of the Association for Computational Linguisitcs: EMNLP 2021 (pp. 3635–3644). Association for Computational Linguistics. https:// hal.archives-ouvertes.fr/ hal -03361430. https://doi.org/10.18653/v1/2021.findings-emnlp .308 Caucheteux, C., & King, J.-R. (2022). Brains and algorithms partially converge in natural language processing. Communications Biology, 5(1), Article 134. https://doi.org/10.1038/s42003-022 -03036-1, PubMed: 35173264 Cavanagh, S. E., Hunt, L. T., & Kennerley, S. W. (2020). A diversity of intrinsic timescales underlie neural computations. Frontiers in Neural Circuits, 14, 615626. https://doi.org/10.3389/fncir.2020 .615626, PubMed: 33408616 Chan, A. H. D., Luke, K.-K., Li, P., Yip, V., Li, G., Weekes, B., & Tan, L. H. (2008). Neural correlates of nouns and verbs in early bilinguals. Annals of the New York Academy of Sciences, 1145(1), 30–40. https://doi.org/10.1196/annals.1416.000, PubMed: 19076387 Chang, E. F., Rieger, J. W., Johnson, K., Berger, M. S., Barbaro, N. M., & Knight, R. T. (2010). Categorical speech representation in human superior temporal gyrus. Nature Neuroscience, 13(11), 1428–1432. https://doi.org/10.1038/nn.2641, PubMed: 20890293 Chen, J., Leong, Y. C., Honey, C. J., Yong, C. H., Norman, K. A., & Hasson, U. (2017). Shared memories reveal shared structure in neural activity across individuals. Nature Neuroscience, 20(1), 115–125. https://doi.org/10.1038/nn.4450, PubMed: 27918531 Chen, Z., Chen, S., Wu, Y., Qian, Y., Wang, C., Liu, S., Qian, Y., & Zeng, M. (2022). Large-scale self-supervised speech representa- tion learning for automatic speaker verification. In ICASSP 2022 —IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6147–6151). IEEE. https://doi.org/10 .1109/ICASSP43922.2022.9747814 Chien, H.-Y. S., & Honey, C. J. (2020). Constructing and forgetting temporal context in the human cerebral cortex. Neuron, 106(4), 675–686. https://doi.org/10.1016/j.neuron.2020.02.013, PubMed: 32164874 Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). What does BERT look at? An analysis of BERT’s attention. In Pro- ceedings of the 2019 ACL workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP (pp. 276–286). Associ- ation for Computational Linguistics. https://doi.org/10.18653/v1 / W19-4828 Conneau, A., Kruszewski, G., Lample, G., Barrault, L., & Baroni, M. (2018). What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th annual meeting of the Association for Computational Linguistics ( Volume 1: Long papers) (pp. 2126–2136). Associa- tion for Computational Linguistics. https://doi.org/10.18653/v1 /P18-1198 Çukur, T., Nishimoto, S., Huth, A. G., & Gallant, J. L. (2013). Atten- tion during natural vision warps semantic representation across the human brain. Nature Neuroscience, 16(6), 763–770. https:// doi.org/10.1038/nn.3381, PubMed: 23603707 de Heer, W. A., Huth, A. G., Griffiths, T. L., Gallant, J. L., & Theunissen, F. E. (2017). The hierarchical cortical organization of human speech processing. Journal of Neuroscience, 37(27), 6539–6557. https://doi.org/10.1523/JNEUROSCI.3267-16.2017, PubMed: 28588065 Deniz, F., Nunez-Elizalde, A. O., Huth, A. G., & Gallant, J. L. (2019). The representation of semantic information across human cerebral cortex during listening versus reading is invariant to stimulus modality. Journal of Neuroscience, 39(39), 7722–7736. https://doi.org/10.1523/ JNEUROSCI.0675-19.2019, PubMed: 31427396 Deniz, F., Tseng, C., Wehbe, L., & Gallant, J. L. (2021). Semantic representations during language comprehension are affected by context. bioRxiv. https://doi.org/10.1101/2021.12.15.472839 Ellis, C. T., Baldassano, C., Schapiro, A. C., Cai, M. B., & Cohen, J. D. (2020). Facilitating open-science with realistic fMRI simula- tion: Validation and application. PeerJ, 8, Article e8564. https:// doi.org/10.7717/peerj.8564, PubMed: 32117629 Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8, 34–48. https:// doi.org/10.1162/tacl_a_00298 Ettinger, A., Elgohary, A., Phillips, C., & Resnik, P. (2018). Assessing composition in sentence vector representations. In Proceedings of the 27th international conference on computational linguistics (pp. 1790–1801). Association for Computational Linguistics. https://aclanthology.org/C18-1152 Evans, S. (2017). What has replication ever done for us? Insights from neuroimaging of speech perception. Frontiers in Human Neuroscience, 11, 41. https://doi.org/10.3389/fnhum.2017.00041, PubMed: 28203154 Fedorenko, E. (2021). The early origins and the growing popularity of the individual-subject analytic approach in human neurosci- ence. Current Opinion in Behavioral Sciences, 40, 105–112. https://doi.org/10.1016/j.cobeha.2021.02.023 Fedorenko, E., Hsieh, P.-J., Nieto-Castañón, A., Whitfield-Gabrieli, S., & Kanwisher, N. (2010). New method for fMRI investigations of language: Defining ROIs functionally in individual subjects. Journal of Neurophysiology, 104(2), 1177–1194. https://doi.org /10.1152/jn.00032.2010, PubMed: 20410363 Friederici, A. D., Opitz, B., & von Cramon, D. Y. (2000). Segregat- ing semantic and syntactic aspects of processing in the human brain: An fMRI investigation of different word types. Cerebral Cortex, 10(7), 698–705. https://doi.org/10.1093/cercor/10.7 .698, PubMed: 10906316 Futrell, R., Wilcox, E., Morita, T., Qian, P., Ballesteros, M., & Levy, R. (2019). Neural language models as psycholinguistic subjects: Representations of syntactic state. In Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies ( Volume 1: Long and short papers) (pp. 32–42). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1004 Gauthier, I., Skudlarski, P., Gore, J. C., & Anderson, A. W. (2000). Expertise for cars and birds recruits brain areas involved in face recognition. Nature Neuroscience, 3(2), 191–197. https://doi.org /10.1038/72140, PubMed: 10649576 Gisladottir, R. S., Chwilla, D. J., & Levinson, S. C. (2015). Conver- sation electrified: ERP correlates of speech act recognition in underspecified utterances. PLOS ONE, 10(3), Article e0120068. https://doi.org/10.1371/journal.pone.0120068, PubMed: 25793289 Neurobiology of Language 23 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / / . 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience Goldstein, A., Zada, Z., Buchnik, E., Schain, M., Price, A., Aubrey, B., Nastase, S. A., Feder, A., Emanuel, D., Cohen, A., Jansen, A., Gazula, H., Choe, G., Rao, A., Kim, S. C., Casto, C., Fanda, L., Doyle, W., Friedman, D., … Hasson, U. (2021). Thinking ahead: Spontaneous prediction in context as a keystone of language in humans and machines. bioRxiv. https://doi.org/10.1101/2020.12 .02.403477 Goodkind, A., & Bicknell, K. (2018). Predictive power of word sur- prisal for reading times is a linear function of language model quality. In Proceedings of the 8th workshop on cognitive model- ing and computational linguistics (CMCL 2018) (pp. 10–18). Association for Computational Linguistics. https://doi.org/10 .18653/v1/ W18-0102 Graves, W. W., Binder, J. R., Desai, R. H., Conant, L. L., & Seidenberg, M. S. (2010). Neural correlates of implicit and explicit combinato- rial semantic processing. NeuroImage, 53(2), 638–646. https://doi .org/10.1016/j.neuroimage.2010.06.055, PubMed: 20600969 Guest, O., & Martin, A. E. (2023). On logical inference over brains, behaviour, and artificial neural networks. Computational Brain & Behavior. https://doi.org/10.1007/s42113-022-00166-x Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., & Baroni, M. (2018). Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies ( Volume 1: Long papers) (pp. 1195–1205). Association for Computational Linguistics. https://doi.org/10 .18653/v1/N18-1108 Haber, J., & Poesio, M. (2021). Patterns of polysemy and homonymy in contextualised language models. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 2663–2676). Association for Computational Linguistics. https://doi.org/10 .18653/v1/2021.findings-emnlp.226 Hamilton, L. S., & Huth, A. G. (2018). The revolution will not be controlled: Natural stimuli in speech neuroscience. Language, Cognition and Neuroscience, 35(5), 573–582. https://doi.org/10 .1080/23273798.2018.1499946, PubMed: 32656294 Handwerker, D. A., Ollinger, J. M., & D’Esposito, M. (2004). Vari- ation of BOLD hemodynamic responses across subjects and brain regions and their effects on statistical analyses. Neuro- Image, 21(4), 1639–1651. https://doi.org/10.1016/j.neuroimage .2003.11.029, PubMed: 15050587 Hasson, U., Avidan, G., Gelbard, H., Vallines, I., Harel, M., Minshew, N., & Behrmann, M. (2009). Shared and idiosyncratic cortical activation patterns in autism revealed under continuous real-life viewing conditions. Autism Research, 2(4), 220–231. https://doi.org/10.1002/aur.89, PubMed: 19708061 Hasson, U., Yang, E., Vallines, I., Heeger, D. J., & Rubin, N. (2008). A hierarchy of temporal receptive windows in human cortex. Journal of Neuroscience, 28(10), 2539–2550. https://doi.org/10 .1523/JNEUROSCI.5487-07.2008, PubMed: 18322098 Hauk, O., Johnsrude, I., & Pulvermüller, F. (2004). Somatotopic representation of action words in human motor and premotor cortex. Neuron, 41(2), 301–307. https://doi.org/10.1016/S0896 -6273(03)00838-9, PubMed: 14741110 Haxby, J. V., Guntupalli, J. S., Nastase, S. A., & Feilong, M. (2020). Hyperalignment: Modeling shared information encoded in idio- syncratic cortical topographies. ELife, 9, Article e56601. https:// doi.org/10.7554/eLife.56601, PubMed: 32484439 Hedge, C., Powell, G., & Sumner, P. (2018). The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behavior Research Methods, 50(3), 1166–1186. https://doi.org/10.3758/s13428-017-0935-1 , PubMed: 28726177 Hewitt, J., & Liang, P. (2019). Designing and interpreting probes with control tasks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP–IJCNLP) (pp. 2733–2743). Association for Computa- tional Linguistics. https://doi.org/10.18653/v1/D19-1275 Hewitt, J., & Manning, C. D. (2019). A structural probe for finding syntax in word representations. In Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies ( Volume 1: Long and short papers) (pp. 4129–4138). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19 -1419 Huth, A. G., de Heer, W. A., Griffiths, T. L., Theunissen, F. E., & Gallant, J. L. (2016). Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600), 453–458. https://doi.org/10.1038/nature17637, PubMed: 27121839 Huth, A. G., Griffiths, T. L., Theunissen, F. E., & Gallant, J. L. (2015). PrAGMATiC: A probabilistic and generative model of areas tiling the cortex. arXiv. https://doi.org/10.48550/arXiv.1504.03622 Ince, R. A., Paton, A. T., Kay, J. W., & Schyns, P. G. (2021). Bayesian inference of population prevalence. ELife, 10, Article e62461. https://doi.org/10.7554/eLife.62461, PubMed: 34612811 Jain, S., & Huth, A. (2018). Incorporating context into language encoding models for fMRI. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 31 (10 pp.). NeurIPS. Jain, S., & Huth, A. G. (2023). Discovering distinct patterns of semantic integration across cortex using natural language encod- ing models for fMRI [Manuscript in preparation]. Departments of Computer Science & Neuroscience, University of Texas at Austin. Jain, S., Vo, V. A., Mahto, S., LeBel, A., Turek, J. S., & Huth, A. (2020). Interpretable multi-timescale models for predicting fMRI responses to continuous natural speech. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems 33 (pp. 13738–13749). NeurIPS. Kable, J. W., Lease-Spellmeyer, J., & Chatterjee, A. (2002). Neural substrates of action event knowledge. Journal of Cognitive Neuroscience, 14(5), 795–805. https://doi.org/10.1162 /08989290260138681, PubMed: 12167263 Kanwisher, N., McDermott, J., & Chun, M. M. (1997). The fusiform face area: A module in human extrastriate cortex specialized for face perception. Journal of Neuroscience, 17(11), 4302–4311. https://doi.org/10.1523/ JNEUROSCI.17-11-04302.1997, PubMed: 9151747 Kell, A. J. E., Yamins, D. L. K., Shook, E. N., Norman-Haignere, S. V., & McDermott, J. H. (2018). A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron, 98(3), 630–644. https://doi.org/10.1016/j.neuron.2018.03.044, PubMed: 29681533 Khandelwal, U., He, H., Qi, P., & Jurafsky, D. (2018). Sharp nearby, fuzzy far away: How neural language models use context. In Pro- ceedings of the 56th annual meeting of the Association for Com- putational Linguistics ( Volume 1: Long papers) (pp. 284–294). Association for Computational Linguistics. https://doi.org/10 .18653/v1/P18-1027 Kumar, S., Sumers, T. R., Yamakoshi, T., Goldstein, A., Hasson, U., Norman, K. A., Griffiths, T. L., Hawkins, R. D., & Nastase, S. A. (2022). Reconstructing the cascade of language processing in the Neurobiology of Language 24 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / . / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d . / l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience brain using the internal computations of a transformer-based language model. bioRxiv. https://doi.org/10.1101/2022.06.08 .495348 Kutas, M., & Hillyard, S. A. (1984). Brain potentials during reading reflect word expectancy and semantic association. Nature, 307(5947), 161–163. https://doi.org/10.1038/307161a0, PubMed: 6690995 Lakretz, Y., Kruszewski, G., Desbordes, T., Hupkes, D., Dehaene, S., & Baroni, M. (2019). The emergence of number and syntax units in LSTM language models. In Proceedings of the 2019 con- ference of the North American chapter of the Association for Computational Linguistics: Human language technologies ( Volume 1: Long and short papers) (pp. 11–20). Association for Computational Linguistics. https://doi.org/10.18653/v1/ N19 -1002 LeBel, A., Jain, S., & Huth, A. G. (2021). Voxelwise encoding models show that cerebellar language representations are highly conceptual. Journal of Neuroscience, 41(50), 10341–10355. https://doi.org/10.1523/ JNEUROSCI.0118-21.2021, PubMed: 34732520 LeBel, A., Wagner, L., Jain, S., Adhikari-Desai, A., Gupta, B., Morgenthal, A., Tang, J., Xu, L., & Huth, A. G. (2022). A natural language fMRI dataset for voxelwise encoding models. bioRxiv. https://doi.org/10.1101/2022.09.22.509104 Lerner, Y., Honey, C. J., Katkov, M., & Hasson, U. (2014). Temporal scaling of neural responses to compressed and dilated natural speech. Journal of Neurophysiology, 111(12), 2433–2444. https://doi.org/10.1152/jn.00497.2013, PubMed: 24647432 Lerner, Y., Honey, C. J., Silbert, L. J., & Hasson, U. (2011). Topographic mapping of a hierarchy of temporal receptive windows using a narrated story. Journal of Neuroscience, 31(8), 2906–2915. https://doi.org/10.1523/JNEUROSCI.3684-10.2011, PubMed: 21414912 Levinson, S. C. (2016). Turn-taking in human communication: Origins and implications for language processing. Trends in Cognitive Sciences, 20(1), 6–14. https://doi.org/10.1016/j.tics .2015.10.010, PubMed: 26651245 Li, B. Z., Nye, M., & Andreas, J. (2021). Implicit representations of meaning in neural language models. In Proceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing ( Volume 1: Long papers) (pp. 1813–1827). Associa- tion for Computational Linguistics. https://doi.org/10.18653/v1 /2021.acl-long.143 Li, J., Bhattasali, S., Zhang, S., Franzluebbers, B., Luh, W.-M., Spreng, R. N., Brennan, J. R., Yang, Y., Pallier, C., & Hale, J. (2022). Le Petit Prince multilingual naturalistic fMRI corpus. Scientific Data, 9(1), Article 530. https://doi.org/10.1038 /s41597-022-01625-7, PubMed: 36038567 Li, Y., Anumanchipalli, G. K., Mohamed, A., Lu, J., Wu, J., & Chang, E. F. (2022). Dissecting neural computations of the human audi- tory pathway using deep neural networks for speech. bioRxiv. https://doi.org/10.1101/2022.03.14.484195 Linzen, T. (2020). How can we accelerate progress towards human-like linguistic generalization? In Proceedings of the 58th annual meeting of the Association for Computational Linguistics (pp. 5210–5217). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.465 Linzen, T., & Leonard, B. (2018). Distinct patterns of syntactic agreement errors in recurrent networks and humans. arXiv. https://doi.org/10.48550/arXiv.1807.06882 Liu, J., Harris, A., & Kanwisher, N. (2010). Perception of face parts and face configurations: An fMRI study. Journal of Cognitive Neuroscience, 22(1), 203–211. https://doi.org/10.1162/jocn .2009.21203, PubMed: 19302006 Magyari, L., Bastiaansen, M. C. M., de Ruiter, J. P., & Levinson, S. C. (2014). Early anticipation lies behind the speed of response in conversation. Journal of Cognitive Neuroscience, 26(11), 2530–2539. https://doi.org/10.1162/jocn_a_00673, PubMed: 24893743 Mahowald, K., Kachergis, G., & Frank, M. C. (2020). What counts as an exemplar model, anyway? A commentary on Ambridge (2020). First Language, 40(5–6), 608–611. https://doi.org/10 .1177/0142723720905920 Marvin, R., & Linzen, T. (2018). Targeted syntactic evaluation of language models. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 1192–1202). Association for Computational Linguistics. https://doi.org/10 .18653/v1/D18-1151 Matusz, P. J., Dikker, S., Huth, A. G., & Perrodin, C. (2019). Are we ready for real-world neuroscience? Journal of Cognitive Neuro- science, 31(3), 327–338. https://doi.org/10.1162/jocn_e_01276, PubMed: 29916793 Merkx, D., & Frank, S. L. (2021). Human sentence processing: Recurrence or attention? In Proceedings of the workshop on cognitive modeling and computational linguistics (pp. 12–22). Association for Computational Linguistics. https://doi.org/10 .18653/v1/2021.cmcl-1.2 Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient esti- mation of word representations in vector space. arXiv. https://doi .org/10.48550/arXiv.1301.3781 Millet, J., Caucheteux, C., Orhan, P., Boubenec, Y., Gramfort, A., Dunbar, E., Pallier, C., & King, J.-R. (2022). Toward a realistic model of speech processing in the brain with self-supervised learning. arXiv. https://doi.org/10.48550/arXiv.2206.01685 Millet, J., & King, J.-R. (2021). Inductive biases, pretraining and fine-tuning jointly account for brain responses to speech. arXiv. https://doi.org/10.48550/arXiv.2103.01032 Mitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.-M., Malave, V. L., Mason, R. A., & Just, M. A. (2008). Predicting human brain activity associated with the meanings of nouns. Science, 320(5880), 1191–1195. https://doi.org/10.1126/science .1152876, PubMed: 18511683 Nastase, S. A., Liu, Y.-F., Hillman, H., Zadbood, A., Hasenfratz, L., Keshavarzian, N., Chen, J., Honey, C. J., Yeshurun, Y., Regev, M., Nguyen, M., Chang, C. H. C., Baldassano, C., Lositsky, O., Simony, E., Chow, M. A., Leong, Y. C., Brooks, P. P., Micciche, E., … Hasson, U. (2021). The “Narratives” fMRI dataset for eval- uating models of naturalistic language comprehension. Scientific Data, 8(1), Article 250. https://doi.org/10.1038/s41597-021 -01033-3, PubMed: 34584100 Nayebi, A., Attinger, A., Campbell, M. G., Hardcastle, K., Low, I. I. C., Mallory, C. S., Mel, G. C., Sorscher, B., Williams, A. H., Ganguli, S., Giocomo, L. M., & Yamins, D. L. K. (2021). Explaining heterogeneity in medial entorhinal cortex with task-driven neural networks. bioRxiv. https://doi.org/10.1101/2021.10.30.466617 Noppeney, U., Josephs, O., Kiebel, S., Friston, K. J., & Price, C. J. (2005). Action selectivity in parietal and temporal cortex. Cogni- tive Brain Research, 25(3), 641–649. https://doi.org/10.1016/j .cogbrainres.2005.08.017, PubMed: 16242924 Nunez-Elizalde, A. O., Huth, A. G., & Gallant, J. L. (2019). Voxel- wise encoding models with non-spherical multivariate normal priors. NeuroImage, 197, 482–492. https://doi.org/10.1016/j .neuroimage.2019.04.012, PubMed: 31075394 Overath, T., McDermott, J. H., Zarate, J. M., & Poeppel, D. (2015). The cortical analysis of speech-specific temporal structure Neurobiology of Language 25 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l . / / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience revealed by responses to sound quilts. Nature Neuroscience, 18(6), 903–911. https://doi.org/10.1038/nn.4021, PubMed: 25984889 Pandia, L., & Ettinger, A. (2021). Sorting through the noise: Testing robustness of information processing in pre-trained language models. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 1583–1596). Asso- ciation for Computational Linguistics. https://doi.org/10.18653 /v1/2021.emnlp-main.119 Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three arguments examined. Perspectives on Psychological Science, 7(6), 531–536. https://doi.org/10.1177/1745691612463401, PubMed: 26168109 Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162 Poldrack, R. A., Whitaker, K., & Kennedy, D. (2020). Introduction to the special issue on reproducibility in neuroimaging. NeuroImage, 218, Article 116357. https://doi.org/10.1016/j.neuroimage.2019 .116357, PubMed: 31733374 Popham, S. F., Huth, A. G., Bilenko, N. Y., Deniz, F., Gao, J. S., Nunez-Elizalde, A. O., & Gallant, J. L. (2021). Visual and linguis- tic semantic representations are aligned at the border of human visual cortex. Nature Neuroscience, 24(11), 1628–1636. https:// doi.org/10.1038/s41593-021-00921-6, PubMed: 34711960 Prasad, G., van Schijndel, M., & Linzen, T. (2019). Using priming to uncover the organization of syntactic representations in neural language models. In Proceedings of the 23rd conference on computational natural language learning (CoNLL) (pp. 66–76). Association for Computational Linguistics. https://doi.org/10 .18653/v1/K19-1007 Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training [Preprint]. Papers With Code. Ratan Murty, N. A., Bashivan, P., Abate, A., DiCarlo, J. J., & Kanwisher, N. (2021). Computational models of category- selective brain regions enable high-throughput tests of selectivity. Nature Communications, 12(1), 5540. https://doi.org/10.1038 /s41467-021-25409-6, PubMed: 34545079 Ravfogel, S., Elazar, Y., Gonen, H., Twiton, M., & Goldberg, Y. (2020). Null it out: Guarding protected attributes by iterative null- space projection. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics (pp. 7237–7256). Association for Computational Linguistics. https://doi.org/10 .18653/v1/2020.acl-main.647 Redcay, E., & Moraczewski, D. (2020). Social cognition in context: A naturalistic imaging approach. NeuroImage, 216, Article 116392. https://doi.org/10.1016/j.neuroimage.2019.116392, PubMed: 31770637 Reddy, A. J., & Wehbe, L. (2020). Can fMRI reveal the representa- tion of syntactic structure in the brain?. bioRxiv. https://doi.org/10 .1101/2020.06.16.155499 Regev, M., Honey, C. J., Simony, E., & Hasson, U. (2013). Selective and invariant neural responses to spoken and written narratives. Journal of Neuroscience, 33(40), 15978–15988. https://doi.org /10.1523/JNEUROSCI.1580-13.2013, PubMed: 24089502 Schrimpf, M., Blank, I. A., Tuckute, G., Kauf, C., Hosseini, E. A., Kanwisher, N., Tenenbaum, J. B., & Fedorenko, E. (2021). The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences, 118(45), Article e2105646118. https://doi.org/10 .1073/pnas.2105646118, PubMed: 34737231 Scott, S. K. (2019). From speech and talkers to the social world: The neural processing of human spoken language. Science, 366(6461), 58–62. https://doi.org/10.1126/science.aax0288, PubMed: 31604302 Scott, T. L., Gallée, J., & Fedorenko, E. (2017). A new fun and robust version of an fMRI localizer for the frontotemporal language sys- tem. Cognitive Neuroscience, 8(3), 167–176. https://doi.org/10 .1080/17588928.2016.1201466, PubMed: 27386919 Sergent, J., Ohta, S., & MacDonald, B. (1992). Functional neuro- anatomy of face and object processing. A positron emission tomography study. Brain: A Journal of Neurology, 115(1), 15–36. https://doi.org/10.1093/brain/115.1.15, PubMed: 1559150 Shain, C., Blank, I. A., van Schijndel, M., Schuler, W., & Fedorenko, E. (2020). fMRI reveals language-specific predictive coding during naturalistic sentence comprehension. Neuropsychologia, 138, Article 107307. https://doi.org/10.1016/j.neuropsychologia .2019.107307, PubMed: 31874149 Sievers, B., Welker, C., Hasson, U., Kleinbaum, A. M., & Wheatley, T. (2020). How consensus-building conversation changes our minds and aligns our brains. PsyArXiv. https://doi.org/10.31234 /osf.io/562z7 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False- positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psycho- logical Science, 22(11), 1359–1366. https://doi.org/10.1177 /0956797611417632, PubMed: 22006061 Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic attribu- tion for deep networks. In D. Precup & Y. W. Teh (Eds.), Proceed- ings of the 34th international conference on machine learning ( Volume 70) (pp. 3319–3328). Association for Computing Machinery. Suzanne Scherf, K., Behrmann, M., Minshew, N., & Luna, B. (2008). Atypical development of face and greeble recognition in autism. Journal of Child Psychology and Psychiatry, 49(8), 838–847. https://doi.org/10.1111/j.1469-7610.2008.01903.x, PubMed: 18422548 Tenney, I., Das, D., & Pavlick, E. (2019). BERT rediscovers the classical NLP pipeline. arXiv. https://doi.org/10.48550/arXiv .1905.05950 Tenney, I., Xia, P., Chen, B., Wang, A., Poliak, A., McCoy, R. T., Kim, N., Durme, B. V., Bowman, S. R., Das, D., & Pavlick, E. (2019). What do you learn from context? Probing for sentence structure in contextualized word representations [Preprint]. Open Review. https://openreview.net/forum?id=SJzSgnRcKX Toneva, M., Mitchell, T. M., & Wehbe, L. (2022). Combining com- putational controls with natural text reveals new aspects of mean- ing composition. Nature Computational Science, 2(1), 745–757. https://doi.org/10.1038/s43588-022-00354-6, PubMed: 36777107 Toneva, M., & Wehbe, L. (2019). Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain). In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 32. NeurIPS. Toneva, M., Williams, J., Bollu, A., Dann, C., & Wehbe, L. (2022). Same cause; different effects in the brain. Proceedings of the First Conference on Causal Learning and Reasoning, 177, 787–825. Vaidya, A. R., Jain, S., & Huth, A. (2022). Self-supervised models of audio effectively explain human cortical responses to speech. Proceedings of the 39th International Conference on Machine Learning, 162, 21927–21944. https://proceedings.mlr.press /v162/vaidya22a.html van der Wees, M., Bisazza, A., & Monz, C. (2017). Dynamic data selection for neural machine translation. In Proceedings of the Neurobiology of Language 26 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / / . 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 In silico language neuroscience 2017 conference on empirical methods in natural language processing (pp. 1400–1410). Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1147 Vo, V. A., Jain, S., Beckage, N., Chien, H.-Y. S., Obinwa, C., & Huth, A. G. (2023). A unifying computational account of temporal processing in natural speech across cortex [Manuscript in prepara- tion]. Departments of Computer Science & Neuroscience, Univer- sity of Texas at Austin. Wallentin, M., Østergaard, S., Lund, T. E., Østergaard, L., & Roepstorff, A. (2005). Concrete spatial language: See what I mean? Brain and Language, 92(3), 221–233. https://doi.org/10.1016/j.bandl .2004.06.106, PubMed: 15721955 Wang, A., Tarr, M., & Wehbe, L. (2019). Neural taskonomy: Infer- ring the similarity of task-derived representations from brain activity. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 32. NeurIPS. https:// p r o c e e d i n g s . n e u r i p s . c c / p a p e r / 2 0 1 9 / h a s h /f490c742cd8318b8ee6dca10af2a163f-Abstract.html Wang, S., Zhang, J., Lin, N., & Zong, C. (2020). Probing brain activation patterns by dissociating semantics and syntax in sentences. Proceedings of the AAAI Conference on Artificial Intelligence, 34(5), 9201–9208. https://doi.org/10.1609/aaai .v34i05.6457 Wang, S., Zhang, X., Zhang, J., & Zong, C. (2022). A synchronized multimodal neuroimaging dataset for studying brain language processing. Scientific Data, 9(1), Article 590. https://doi.org/10 .1038/s41597-022-01708-5, PubMed: 36180444 Wehbe, L., Huth, A. G., Deniz, F., Gao, J., Kieseler, M.-L., & Gallant, J. L. (2021). BOLDpredictions [Software]. https://github .com/boldprediction Q9 Wehbe, L., Huth, A. G., Deniz, F., Gao, J., Kieseler, M.-L., & Gallant, J. L. (2018). BOLD predictions: Automated simulation of fMRI experiments [Poster]. 2018 Conference on Cognitive Computational Neuroscience, Philadelphia, Pennsylvania. https://doi.org/10.32470/CCN.2018.1123-0 Wehbe, L., Murphy, B., Talukdar, P., Fyshe, A., Ramdas, A., & Mitchell, T. (2014). Simultaneously uncovering the patterns of brain regions involved in different story reading subprocesses. PLOS ONE, 9(11), Article e112575. https://doi.org/10.1371 /journal.pone.0112575, PubMed: 25426840 Wehbe, L., Vaswani, A., Knight, K., & Mitchell, T. (2014). Aligning context-based statistical models of language with brain activity during reading. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 233–243). Association for Computational Linguistics. https://doi.org/10.3115 /v1/D14-1030 Westfall, J., Nichols, T. E., & Yarkoni, T. (2017). Fixing the stimulus-as-fixed-effect fallacy in task fMRI. Wellcome Open Research, 1, 23. https://doi.org/10.12688/wellcomeopenres .10298.2, PubMed: 28503664 Wilcox, E., Vani, P., & Levy, R. (2021). A targeted assessment of incremental processing in neural language models and humans. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint confer- ence on natural language processing ( Volume 1: Long papers) (pp. 939–952). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.76 Wu, A., Wang, C., Pino, J., & Gu, J. (2020). Self-supervised represen- tations improve end-to-end speech translation. Interspeech 2020, 1491–1495. https://doi.org/10.21437/Interspeech.2020-3094 Wu, M. C.-K., David, S. V., & Gallant, J. L. (2006). Complete func- tional characterization of sensory neurons by system identifica- tion. Annual Review of Neuroscience, 29, 477–505. https://doi .org/10.1146/annurev.neuro.29.051605.113024, PubMed: 16776594 Xu, Y. (2005). Revisiting the role of the fusiform face area in visual expertise. Cerebral Cortex, 15(8), 1234–1242. https://doi.org/10 .1093/cercor/bhi006, PubMed: 15677350 Yamins, D. L., & DiCarlo, J. J. (2016). Eight open questions in the computational modeling of higher sensory cortex. Current Opin- ion in Neurobiology, 37, 114–120. https://doi.org/10.1016/j.conb .2016.02.001, PubMed: 26921828 Yarkoni, T. (2022). The generalizability crisis. Behavioral and Brain Sciences, 45, Article e1. https://doi.org/10.1017 /S0140525X20001685, PubMed: 33342451 Yeshurun, Y., Nguyen, M., & Hasson, U. (2017). Amplification of local changes along the timescale processing hierarchy. Proceed- ings of the National Academy of Sciences, 114(35), 9475–9480. https://doi.org/10.1073/pnas.1701652114, PubMed: 28811367 Zhang, X., Wang, S., Lin, N., Zhang, J., & Zong, C. (2022). Probing word syntactic representations in the brain by a feature elimina- tion method [Poster]. Proceedings of the 36th AAAI conference on artificial intelligence. Association for the Advancement of Artifical Intelligence. https://aaai-2022.virtualchair.net/poster _aaai7935 Neurobiology of Language 27 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u n o / l / l a r t i c e - p d f / d o i / l / . / 1 0 1 1 6 2 n o _ a _ 0 0 1 0 1 2 0 7 4 5 3 9 n o _ a _ 0 0 1 0 1 p d / . l f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 SPECIAL ISSUE: image

Download pdf