TESTING, STRESS, AND PERFORMANCE: HOW

TESTING, STRESS, AND PERFORMANCE: HOW

STUDENTS RESPOND PHYSIOLOGICALLY TO

HIGH-STAKES TESTING

Abstracto
We examine how students’ physiological stress differs between a
regular school week and a high-stakes testing week, and we raise
questions about how to interpret high-stakes test scores. Y noche-
tential contributor to socioeconomic disparities in academic per-
formance is the difference in the level of stress experienced by
students outside of school. Chronic stress—due to neighborhood
violence, poverty, or family instability—can affect how individu-
als’ bodies respond to stressors in general, including the stress
of standardized testing. Este, Sucesivamente, can affect whether perfor-
mance on standardized tests is a valid measure of students’ actual
capacidad. We collect data on students’ stress responses using corti-
sol samples provided by low-income students in New Orleans. Nosotros
measure how their cortisol patterns change during high-stakes
testing weeks relative to baseline weeks. We find that high-stakes
testing is related to cortisol responses, and those responses are
related to test performance. Those who responded most strongly,
with either increases or decreases in cortisol, scored 0.40 estan-
dard deviations lower than expected on the high-stakes exam.

https://doi.org/10.1162/edfp_a_00306
Sin derechos reservados. This work was authored as part of the Contributor’s official duties as an Employee of the
United States Government and is therefore the work of the United States Government. De acuerdo con
17 USC. 105, no hay protección de derechos de autor disponible para dichas obras bajo los EE.UU.. law.

Jennifer A.. Heissel

(Autor correspondiente)
Graduate School of Defense

Management

Naval Postgraduate School
Monterey, California 93943
jaheisse@nps.edu

Emma K. Adán

School of Education and

Social Policy

Northwestern University

Evanston, IL 60208
ek-adam@northwestern.edu

Jennifer L. Doleac

Departamento de Economía

Texas A&Universidad M
College Station, Texas 77845
jdoleac@tamu.edu

David N. Figlio

School of Education and

Social Policy

Northwestern University

Evanston, IL 60208
figlio@northwestern.edu

Jonathan Meer

Departamento de Economía

Texas A&Universidad M
College Station, Texas 77845
jmeer@tamu.edu

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

F

/

mi
d
tu
mi
d
pag
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

1
6
2
1
8
3
1
9
1
0
6
6
4
mi
d
pag
_
a
_
0
0
3
0
6
pag
d

.

/

F

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

183

Testing, Stress, y rendimiento

INTRODUCCIÓN

1 .
The results of high-stakes standardized tests determine course placement, graduation,
and college admission for students, result in sanctions or rewards for schools, and in-
form education policy. There is substantial resistance to testing regimes, often pred-
icated on the notion that students are “stressed” by tests.1 Yet, a nuestro conocimiento, No
evidence exists on test-induced physiological stress among K–12 students in a real-world
setting.2 Understanding variation in test-induced stress responses and implications for
performance is important for determining whether scores on high-stakes tests are re-
liable measures of ability and knowledge, or if they are biased by “stress disparities”
between children (see review in Heissel, Exacción, and Adam 2017).

This study raises important questions about the use of high-stakes testing. We doc-
ument how high-stakes testing affects low-income children’s stress biology in one char-
ter school network, and we show how changes in children’s physiological responses to
high-stakes tests relate to performance on the standardized test. Our goal in this study
is to identify these patterns in one setting, and to call for more research into this area in
the field. Expanding our understanding of the relationship between stress and test per-
formance will affect our understanding of how high-stakes test results should be used
and interpreted. Throughout this paper, our footnotes include suggestions for future
investigación en esta área.

We use saliva-based measures of cortisol—a primary stress hormone that indicates
how the biological stress system is functioning—among low-income students in New
Orleans to document how cortisol levels change in response to a high-stakes standard-
ized test administered to students in grades 3–8, relative to a regular baseline school
week. We call this change “cortisol reactivity.” We find that students have 18 por ciento
higher cortisol levels in the homeroom period just before taking the high-stakes test,
relative to that same timeframe during weeks without testing. These differences are
driven by boys, whose homeroom cortisol is 35 percent higher during testing weeks
than regular weeks.3

Everyone has a natural cortisol rhythm over the course of the day (described in more
detail in section 2). Acute stressors are associated with increases in cortisol above these
natural rhythms. An increase in cortisol is not necessarily bad—in the best case, él
can provide the energetic boost one needs to respond to a challenge with attention and
focus. How an individual responds to a given stressor is based on what is adaptive in that
person’s particular context (Del Giudice, Ellis, and Shirtcliff 2011; Shirtcliff et al. 2014).

1. The Center for American Progress found that 49 percent of parents thought there was too much testing in
escuelas (Lazarín 2014), and the New York Association of School Psychologists provides an overview of many
reported parent concerns (Heiser et al. 2015). These concerns are not unfounded: grade 3–5 students reported
higher anxiety and stress symptoms following No Child Left Behind-required testing, relative to lower-stakes
classroom testing (Segool et al. 2013).

2. A variety of studies have examined cognitive tests in lab settings (p.ej., Lupien et al. 2002; stroud, Salovey, y
Epel 2002) or with researcher-administered tests in schools that did not matter for student or school outcomes
(p.ej., Blair, Granjero, and Razza 2005; Lindahl, Theorell, and Lindblad 2005). Though these studies provide
evidence of potential responses, they do not include baseline, non-testing weeks in their analysis. Other studies
have looked at adult responses in undergraduate and medical students (Malarkey et al. 1995; Weekes et al.
2006).

3. This is consistent with previous evidence that males show larger cortisol responses to achievement-related

stressors than females (stroud, Salovey, and Epel 2002; Weekes et al. 2006).

184

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

/

F

mi
d
tu
mi
d
pag
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

1
6
2
1
8
3
1
9
1
0
6
6
4
mi
d
pag
_
a
_
0
0
3
0
6
pag
d

F

/

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Jennifer A.. Heissel, Emma K. Adán, Jennifer L. Doleac, David N. Figlio, and Jonathan Meer

What is adaptive to a given individual may differ by various background characteristics,
and it is not necessarily adaptive in the context of an academic test. Although our entire
sample can be considered economically disadvantaged, we find suggestive evidence of
differences in cortisol changes by level of disadvantage, with the largest cortisol effects
for those living in high-poverty and high-crime neighborhoods.

We next examine whether differences in cortisol reactivity are associated with test
performance on a subsample of students for whom we have test score data. Large in-
creases in cortisol can make concentration difficult, while reduced cortisol may be a
sign of disengagement with a task. We show that both increases and decreases in cor-
tisol from the baseline week to the high-stakes testing week are associated with lower
test scores on the high-stakes test, relative to how we would expect students to perform
based on other in-school academic performance (es decir., Los grados).

Descriptive studies show that children from low socioeconomic status and
racial/ethnic minority groups have lower average scores on standardized academic tests
relative to high socioeconomic status and white families (Reardon 2011; Bradbury et al.
2015). Low socioeconomic status and racial/ethnic minority individuals are also more
likely to be exposed to stressful life events relative to higher income or white individu-
como (see review in Hatch and Dohrenwend 2007). These patterns are correlated, pero el
physiological stress response may provide a link between them. En particular, estudiantes
who experience chronic stress may respond differently to new stressors, such as high-
stakes tests. Persistent socioeconomic gaps in academic performance could be due in
part to different responses to the stress of testing disparately affecting test performance.
This disparate effect, Sucesivamente, has implications for whether standardized tests are a fair
means of evaluating student ability and school quality.

This study makes several contributions. For one, we document cortisol patterns for
a low-income 7-to-15-year-old student population about which there is limited evidence.
This is the first study to take cortisol samples from such young students during the
timeframe surrounding high-stakes testing, and our experience provides guidance for
researchers interested in measuring cortisol levels in similar populations. Segundo, nosotros
document how cortisol patterns change for this population in response to a stressful
evento. This is relevant to understanding how students respond to tests in their actual
school settings. Tercero, we provide the first evidence on how differences in cortisol re-
sponses are related to performance on real-world standardized tests. This is crucial
for understanding the validity of those tests themselves and the interpretation of indi-
vidual differences in test results, which can have important real-world consequences.
Our analysis of test performance is necessarily correlational—who has larger cortisol
responses is not randomly assigned. Changes to cortisol could be attributed to other
outside shocks (p.ej., changes to family income) that are also correlated with test scores.
Still, after conditioning on in-school performance (Los grados), demographics, and neigh-
borhood characteristics, we find that large changes to cortisol are associated with worse
outcomes on the test. We use this evidence as a call for more research into the relation-
ship between stress and test performance, across a variety of settings.

This paper proceeds as follows: En la sección 2 we provide more background on the
science of biological stress responses and the cortisol hormone. Sección 3 describes our
datos. Sección 4 describes our analytic strategy. Sección 5 presents our results. Sección 6
discusses the results and conclusions.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

F

/

/

mi
d
tu
mi
d
pag
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

1
6
2
1
8
3
1
9
1
0
6
6
4
mi
d
pag
_
a
_
0
0
3
0
6
pag
d

/

.

F

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

185

Testing, Stress, y rendimiento

2 . B AC K G RO U N D O N B I O L O G I C A L S T R E S S R E S P O N S E S A N D C O R T I S O L
Biological stress responses include multiple systems, but this paper focuses on the
hypothalamic-pituitary-adrenal (HPA) axis and its primary hormonal product, cortisol.
Cortisol levels show a strong circadian rhythm across the day, known as the diurnal cor-
tisol rhythm, with the highest cortisol levels occurring shortly after waking and the low-
est levels occurring about thirty minutes after sleep begins (see Gunnar and Quevedo
2007 for more details). Two key measures in cortisol research are the waking cortisol
level and the daily cortisol slope (es decir., the rate at which cortisol levels drop from wake to
bedtime). The cortisol awakening response (CAR), a sharp increase in cortisol thirty to
forty minutes after waking, is an additional measure. The CAR provides an energetic
boost to help individuals meet the expected demands of the upcoming day (see review
in Clow et al. 2010).

Real or perceived stressors can increase cortisol above typical diurnal levels.4 For
routine stressors (p.ej., momentary loneliness [Doane and Adam 2010]), cortisol levels
return to their usual daily pattern approximately an hour after the stressor has passed.
According to the adaptive calibration model, the stress response is generally adaptive;
por ejemplo, the HPA axis may mobilize psychological and physiological responses
when presented with a stressor (Del Giudice, Ellis, and Shirtcliff 2011; Shirtcliff et al.
2014). One at-home study had twenty-four participants (aged 21–42 years) recruited
from a university community provide hourly cortisol samples over a 48-hour period.
Rising cortisol was associated with subsequent-hour increases in positive emotions
such as activeness, alertness, and relaxation, and marginally significant decreases in
nervousness (Hoyt et al. 2016).

Broadly, high or rising cortisol occurs when individuals are in personally relevant
situations, are engaged with their environment, and are facing a difficult (but not im-
posible) tarea. Low or diminishing cortisol occurs if an individual is disengaged from
the environment, a task is impossible, or a task is no longer novel.5 The HPA axis
can also be anticipatory, with rising cortisol levels before an expected stressful event
or changes to the CAR if the prior day was particularly stressful.6 In the context of
high-stakes testing, we may expect moderately increased cortisol before the test, par-
ticularly if the student expects the test to be difficult but manageable, with stakes that
matter for them. Limited (or lowered) cortisol responses to stressors may be related to

4. This pattern has been consistently demonstrated in the psychology and endocrinology literature (see reviews

in Sapolsky, Romero, and Munck 2000; Molinero, Chen, and Zhou 2007; Adán 2012).

5. The adaptive calibration model attempts to build a model of the development of stress responsivity in general
(Del Giudice, Ellis, and Shirtcliff 2011), and Shirtcliff et al. (2014) specifically focus on the cost/benefit of cortisol
responsivity in individuals’ particular contexts. This latter model specifically argues against the popular notion
of cortisol as detrimental to health and well-being, and instead argues that cortisol responses can be beneficial
in certain contexts. A large meta-analysis of 208 studies found that stressors that were uncontrollable or had a
social-evaluative component (meaning that performance could be negatively judged by others) led to the largest
increase in cortisol in laboratory settings (Dickerson and Kemeny 2004).

6. See Engert et al. (2013) for a summary of anticipatory cortisol in lab-based settings. The effect has also been
demonstrated in the field: por ejemplo, seventeen young men set to participate in a judo competition had
higher cortisol on the day of the competition (but before the competition began) than at the same time on non-
competition days (Salvador et al. 2003). For the CAR, Doane and Adam (2010) found that prior-day loneliness
(a stressful experience) was associated with higher next-day cortisol in young adults; similarmente, Heissel et al.
(2018) demonstrated that nearby violent crime is associated with a larger CAR the following day in a sample of
adolescents in a large Midwestern city, perhaps as the body anticipates a more stressful day ahead.

186

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

F

/

/

mi
d
tu
mi
d
pag
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

1
6
2
1
8
3
1
9
1
0
6
6
4
mi
d
pag
_
a
_
0
0
3
0
6
pag
d

/

.

F

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Jennifer A.. Heissel, Emma K. Adán, Jennifer L. Doleac, David N. Figlio, and Jonathan Meer

disengagement or “shutting down” in the face of the test; large increases in cortisol may
reflect feeling threatened or overwhelmed in a way that is likely to prevent productive
focus.

Stress patterns also differ by sex. Females’ CARs tend to peak later in the day
than males’ CARs (Stalder et al. 2016). Además, males may be more responsive to
achievement-related stressors, whereas females may be more responsive to social rejec-
ción (stroud, Salovey, and Epel 2002). A meta-analysis of twenty-eight studies similarly
found larger cortisol responses to stressors in males than females (Sauro, Jorgensen,
and Pedlow 2003). In the context of high-stakes testing, we may then expect larger cor-
tisol responses to high-stakes testing from male students.

Of particular concern in this context, long-term stress exposure can lead to changes
in the HPA axis that can be maladaptive in some contexts, incluyendo la escuela. Para en-
postura, hypocortisolism is a condition that can follow a period of chronic stress, wherein
the HPA axis shows low levels of cortisol and no longer responds to stressors (see sum-
maries in McEwen 1998; McEwen and Gianaros 2010). This is one reason we might
expect that children with high-stress backgrounds respond less-optimally (physiologi-
cally) to a high-stakes test. Sin embargo, our results are more consistent with a story that
chronic stress is associated with high cortisol reactivity in this population.

HPA axis activity may affect cognitive performance during test-taking by affecting
memory recall. Associations between cortisol and memory recall generally displays an
inverse-U pattern in laboratory-based studies.7 In particular, inducing large increases
or decreases in cortisol results in worse memory recall. If cortisol and memory recall
are related, then differences in stress response may lead to different test outcomes even
among students with equal ability who have learned the same amount of material. Si
the students most likely to be “stressed testers” come from already-disadvantaged back-
grounds, this pattern may exacerbate the observed achievement gaps on high-stakes
pruebas.

Two previous studies compared a baseline week of normal activity against a stress-
ful testing week. Weekes et al. (2006) found that male undergraduate students had
an increase in examination-week cortisol levels, while female undergraduates did not.
The authors found no link between psychological (self-reported) stress and physiolog-
ical stress as measured by cortisol. A diferencia de, Malarkey et al. (1995) collected cortisol
and other measures on medical students one month before, durante, and two weeks
after examinations. They found increases in cortisol during the test week but only for
those students who perceived the test as stressful. Neither set of authors examined per-
formance on the tests and its relationship to cortisol.

Other research has not included baseline stress levels but instead examined same-
day changes in cortisol in response to stressors. Perceiving a researcher-administered

7. When cortisol is administered synthetically before a lab-based memory assessment, humans generally have
worse memory recall, relative to participants who did not receive a dose of synthetic cortisol (see review in
Het, Ramlow, and Wolf 2005). Sin embargo, randomly varying the levels of synthetically administered cortisol
(de 0 a 24 mg) across participants was associated with an inverse-U shaped pattern, with the best memory
recall at moderate elevations (Schilling et al. 2013). Another study pharmacologically decreased cortisol levels,
then restored baseline cortisol levels with hydrocortisone replacement treatment, for treated participants. El
researchers tested memory function after each manipulation, finding impaired recall after the induced cortisol
decrease. Subsequent hydrocortisone replacement restored memory recall to the placebo level (Lupien et al.
2002).

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

F

/

/

mi
d
tu
mi
d
pag
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

1
6
2
1
8
3
1
9
1
0
6
6
4
mi
d
pag
_
a
_
0
0
3
0
6
pag
d

F

/

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

187

Testing, Stress, y rendimiento

test during the school day as stressful was correlated with higher same-day cortisol
and lower test performance in Swedish adolescents (Lindahl, Theorell, and Lindblad
2005). En cambio, among young, low-income children in a Head Start program, teniendo
a larger same-day cortisol response to a stressor was correlated with better cognition
and behavioral outcomes than those without a cortisol response (Blair, Granjero, y
Razza 2005). Adults with higher anxiety had larger increases in cortisol in response to
performance tasks than those who did not (Malarkey et al. 1995; Schlotz et al. 2006).
Whether cortisol improves or detracts from performance may depend on anxiety about
the task at hand (Mattarella-Micke et al. 2011).

En general, the relationships between perceived stress, stress hormones, and perfor-
mance on a task are complicated and related to a wide variety of background char-
caracteristicas. These relationships highlight the importance of accounting for baseline
differences in cortisol patterns for individual students: Do students perform poorly be-
cause of elevated cortisol, or do the students who perform poorly in general also tend
to have high cortisol levels in regular, non-tested weeks? Además, it is not obvi-
ous that a real-world high-stakes test will lead to a physiological reaction in a group
of young, low-income students. If reactions do occur, it is not obvious who would be
most affected, or how such reactions might correspond to performance on the test. Este
study contributes to our understanding of these dynamics by measuring how cortisol
changes in response to a high-stakes test for grade-school students from disadvantaged
backgrounds.

3 . DATA
Our data consist of cortisol measures, student diaries, and administrative data on stu-
dent demographics and academic performance, for students from a charter school
network in New Orleans. Descriptive statistics are in table 1. The participants were al-
most all black (95 por ciento), economically disadvantaged (97 por ciento),8 and from high-
poverty neighborhoods (con 40 percent of block group households in poverty, significar
block group income of $27,000, and mean block group unemployment of 13 por ciento). The households were also in neighborhoods with a great deal of police activity, with a mean of 416 high-priority emergency (911) calls within a quarter-mile of their home in the prior year. Sin embargo, these averages mask heterogeneity: The fraction of neighbor- hood households in poverty ranged from 14 a 91 por ciento, mean neighborhood incomes ranged from $9,000 a $58,000, neighborhood unemployment rates ranged from 0 a 74 por ciento, and the number of nearby high-priority 911 calls ranged from 0 a 1,380 in the prior year. De término medio, the participants are disadvantaged relative to the overall population, but there is significant variation within the sample.9 Cortisol Data We collected salivary cortisol samples from ninety-three pre-adolescent and adolescent volunteers in grades 3–8, across three schools from the charter school network. Nosotros 8. Economic disadvantage is indicated by eligibility for free or reduced-price lunch. 9. The median household income in the United States was $57,000 en 2015, con 13.5 percent of households in
poverty (Proctor, Semega, and Kollar 2016). New Orleans had a median household income of $39,000, with nearly 25 percent of households in poverty in this period; the mean of $27,000 income in our sample is similar
to the $26,000 median black family income in New Orleans (Litten 2016).

188

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

F

/

mi
d
tu
mi
d
pag
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

1
6
2
1
8
3
1
9
1
0
6
6
4
mi
d
pag
_
a
_
0
0
3
0
6
pag
d

/

.

F

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Jennifer A.. Heissel, Emma K. Adán, Jennifer L. Doleac, David N. Figlio, and Jonathan Meer

Mesa 1. Estadísticas descriptivas

Calificación

Age (caer 2015)

Female

Limited English proficiency

Exceptional child

Gifted

Negro

Economically disadvantaged

Sección 504 plan

McKinney-Vento Act

Significar
(1)

5.77

11.59

0.55

0.03

0.13

0.03

0.95

0.97

0.29

0.08

Dakota del Sur
(2)

1.84

2.06

0.50

0.18

0.34

0.18

0.23

0.16

0.45

0.28

Priority 1 911 calls within 0.1 mi of home

83.94

73.99

Priority 1 911 calls within 0.25 mi of home

415.81

311.47

Neighborhood fraction houses in poverty

0.40

0.17

mín.
(3)

3.00

7.90

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.14

máx.
(4)

8.00

15.60

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

351.00

1,380.00

0.91

Neighborhood median income

26,830

11,246

9,327

58,194

Neighborhood fraction unemployed

norte

0.13

93

0.11

0.00

0.74

Count
(5)

93

93

93

93

92

92

93

84

84

84

85

85

86

80

86

Notas: Sección 504 is a civil rights law that prohibits discrimination against individuals with disabili-
corbatas. Sección 504 ensures that the child with a disability has equal access to an education. The child
may receive accommodations and modifications. The McKinney-Vento Education of Homeless Children
and Youth Assistance Act is a federal law that ensures immediate enrollment and educational stability
for homeless children and youth. McKinney-Vento provides federal funding to states for the purpose of
supporting district programs that serve homeless students. SD = standard deviation.

recruited participants through flyers distributed by their school, obtained parental con-
sent and participant assent, and briefed participants on the protocol during homeroom
on their first day of collection. Some participants joined the study late (norte = 13 joined
in week 2) and were briefed on the protocol individually. These students were mainly
from school 2.10

To provide the samples, participants let saliva collect in their mouth, then used a
small straw to drain the saliva into a small vial; this is called the passive drool tech-
nique. Participants watched a saliva sample demonstration at the first collection, had
a video demonstration available, and received reminder texts from the research team
during the data collection to ensure that they followed protocol. Participants were in-
structed to avoid eating, drinking, and brushing their teeth 30 minutes prior to each
sample collection. A kitchen timer preset to 30 minutes was provided to aid in the
timing of sample 2. Participants were instructed to refrigerate their home samples as
soon as possible after collection and return their home samples to the research team
in homeroom every day.

10. Based on a linear probability model regressing an indicator for joining in week 2 on the demographic charac-
teristics of the ninety-three students, those with a 10 percentage point higher in-school science grade were 12.2
percentage points more likely to join in week 2 (relative to week 1), every one year increase in age was associated
con un 10.8 percentage point decrease of joining in week 2, and being in school 2 was associated with a 58.8
percentage point increase in joining in week 2. Many students in school 2 mistook our real cash incentives
as the points-based behavioral “dollars” the school used; they decided to join the study upon learning that the
compensation was in fact real dollars.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

F

/

mi
d
tu
mi
d
pag
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

1
6
2
1
8
3
1
9
1
0
6
6
4
mi
d
pag
_
a
_
0
0
3
0
6
pag
d

F

/

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

189

Testing, Stress, y rendimiento

Nota: Testing occurred between homeroom and lunch in weeks 2 y 3. CAR = cortisol awakening response.

Cifra 1. Research Design by Sample, Day, and Week

Cifra 1 displays the experimental design. Saliva sample collection occurred during
three weeks of the 2015–16 academic year: a baseline week (no testing; late August), a
low-stakes testing week (internal school testing; early September); and a high-stakes
testing week (statewide testing; late April). During each week, participants provided
saliva samples at six points over a 24-hour period: at wake (sample 1), 30 minutes af-
ter wake (to capture the CAR; sample 2), during homeroom11 (sample 3), before lunch
(sample 4), after school (sample 5), and at bedtime (sample 6). Data were collected over
a 48-hour period each week, beginning in homeroom on the first day, such that day 1
included samples 3–6, día 2 included samples 1–6, and day 3 included samples 1 y
2.12 Bajo- and high-stakes testing (during testing weeks) occurred just after homeroom
and ended before lunch. Homeroom (before the test, sample 3) and before-lunch (después
the test, sample 4) saliva samples were collected under the supervision of the research
equipo, and timing was verified by the team.13 Sample 3 had the most consistent tim-
ing and the highest completion rate across days and is the focus of the majority of our
análisis.

We dropped week 1 samples for two students who had extremely high cortisol levels
(mean = 15.43 µg/dL and 3.50 µg/dL, relative to the overall mean of 0.15 µg/dL), likely
indicating that they were taking cortisol-containing medication. Per typical protocol for
cortisol, we top-code cortisol levels to 1.80 µg/dL.14 Online appendix table A.1 contains
descriptive statistics for the homeroom cortisol samples by school.

11. Homeroom started at 7:00 soy. in school 1 y 8:00 soy. in schools 2 y 3.
12. Seventy-eight percent of individual-week-sample number combinations had at least one sample, although com-
pletion rates varied by sample (see figure A.1, which is available in a separate online appendix that can be
accessed on Education Finance and Policy’s Web site at https://doi.org/10.1162/edfp_a_00306). Samples were
stored at −20°C before shipment to Trier, Alemania, where they were assayed in duplicate using time-resolved
fluorescent-detection immunoassay (Dressendörfer et al. 1992).

13. Samples 1, 2, 5, y 6 were taken out of school, and timing was reported by students and verified against
diary entries. The in-school compliance rate was 89 por ciento; out-of-school compliance was 72 por ciento. Future
researchers working in this area could consider focusing only on in-school sampling in order to ensure compli-
ance and reduce costs. Changes in school lunch scheduling meant that sample 4 timing had a wider variance
than sample 3. Figure A.2 in the online appendix displays the distribution of sample timing by sample and
escuela.

14. Online appendix figure A.3 displays an exercise using alternative means of limiting potentially contaminated

muestras, from including all available cortisol to dropping anything above the 90th percentile.

190

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

F

/

/

mi
d
tu
mi
d
pag
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

1
6
2
1
8
3
1
9
1
0
6
6
4
mi
d
pag
_
a
_
0
0
3
0
6
pag
d

F

.

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Jennifer A.. Heissel, Emma K. Adán, Jennifer L. Doleac, David N. Figlio, and Jonathan Meer

Student Diaries
Participants filled out diaries at each saliva collection. The sample 1 diary included ques-
tions about the prior night’s sleep and that morning’s wake time. We coded wake time
for each day as the minimum reported timing across the sample 1 cortisol sample, el
sample 1 diary entry, and diary-reported daily waking time. Students took sample 1 en
días 2 y 3; by design day 1 did not have a reported wake time. If students were miss-
ing the wake time measure, we imputed it using the mean wake time by individual by
week, entonces (if still missing) the mean wake time by individual, entonces (if still missing)
the mean wake time by school by week.

We calculated time since wake for each sample as the length of time between that
day’s wake time and the reported timing of the cortisol sample. If missing sample tim-
En g, we imputed it using that sample’s diary time, entonces (if still missing) the mean of
the sample timing by individual by sample number, entonces (if still missing) the mean of
the sampling timing by school by sample number.

Administrative Data
The charter network provided administrative data including participants’ scores on low-
stakes math, ciencia, English Language Arts (ELLA), and social studies tests and high-
stakes math, ciencia, and ELA tests. The administrative data also included in-school
Los grados (on a 0–100 scale) for each academic quarter in math, ciencia, ELLA, y entonces-
cial studies. In the test score analysis, we dropped students’ missing test score data or
missing cortisol data in the baseline week or high-stakes testing week, leaving us with
norte = 68 students in the test score subsample. We lose fifteen students who did not
have baseline week cortisol (thirteen who joined in week 2 and two who did not have
homeroom-specific data), eight students who appear to have moved out of the charter
network,15 and two with missing test score data (but with cortisol data, meaning they
were in school at least part of the week). We used a linear probability model to regress
an indicator for being in the final test score sample on the demographic characteristics
and in-school grades for the ninety-three participants. Those with 10-point higher sci-
ence grades had a 13.1-percentage point higher probability of being in the final sample;
all other demographic characteristics were not statistically related to being in the final
test score sample.

We converted each test score into standardized Z-score units by grade; la resultante
scores should thus be interpreted as the distance from the average score in standard
deviations.16

The results of the high-stakes test in our study mattered for the school, as they con-
tributed to the letter grade (A–F) rating given to the school by Louisiana’s Department
of Education. Sin embargo, the test had no direct repercussions for individual students. En
addition, the students took a variety of other tests in their school system throughout
the year, including a series of tests that were only used for internal assessment but

15. The students who appeared to move out of the school system did not have cortisol data in the high-stakes testing
week, test score data, or quarter 4 Los grados. In robustness testing, we examine how changes in cortisol relate to
performance on the low-stakes test, which was only two weeks after the baseline week and thus had less time
for students to switch schools.

16. Z-scores are calculated by subtracting the mean score on a given test in a given grade from the individual’s

score on that test, then dividing by the standard deviation of that test in that grade.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

/

F

mi
d
tu
mi
d
pag
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

1
6
2
1
8
3
1
9
1
0
6
6
4
mi
d
pag
_
a
_
0
0
3
0
6
pag
d

.

F

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

191

Testing, Stress, y rendimiento

mimicked the structure of the year-end high-stakes test.17 Given how often these stu-
dents were tested, we might expect them to be so accustomed to the process that even
high-stakes tests would not be perceived as stressful. This will reduce the likelihood of
finding any effect of testing on cortisol responses.

4 . A N A LY T I C S T R AT E G Y
Given the greater control the research team had over the before-test homeroom sample
collection, and because our main objective is understanding the high-stakes testing
período, we focus most of our attention on sample 3, taken in the homeroom period just
before the test was administered. This time period is particularly important given that
it reflects the level of cortisol that students bring into the test setting. Our first analysis
examines whether the level of cortisol in the homeroom period changed from baseline
to the low-stakes and high-stakes testing weeks with the following specification:

ln(cortisoliwd ) = β0 + β1LowStakesw + β2HighStakesw + β3Timeiwd + β4Time2
+ β5Waketimeiwd + β6CARiwd + γi + εiwd,

iwd

(1)

where LowStakesw is equal to 1 in the low-stakes testing week and zero otherwise,
HighStakesw is equal to 1 in the high-stakes PARCC testing week and zero otherwise,
Timeiwd is time of the sample relative to the end of homeroom for individual i in week
w on day d, Waketimeiwd is that day’s approximate wake time for the individual (cosa-
sured in hours relative to midnight), and CARiwd is an indicator for whether the home-
room sample was 15–60 minutes after the individual’s wake time that day. A control for
CAR may be necessary if a student woke up late relative to school start and took their
homeroom sample 15–60 minutes post-waking. We control for a quadratic of Timeiwd
because the level of cortisol falls at a decreasing rate throughout the day; not including
the quadratic does not change the results. The individual fixed effects γi account for any
observed and unobserved factors that are constant across an individual over time (p.ej.,
sexo, intelligence, personality, constant health) and allows us to isolate within-student
changes in cortisol from week to week. Standard errors are clustered at the individual
nivel. The analysis indicates whether, holding other individual-specific factors constant,
homeroom cortisol levels change from baseline to the testing weeks.18

Supplementary analyses test for variation based on proxies for chronic stress—
specifically, poverty rates and crime rates in students’ neighborhoods. We might expect
students’ responsiveness to the stress of the test to differ if they are chronically stressed.
We also tested for differences by gender.

Finalmente, we examined whether cortisol reactivity to high-stakes testing was as-
sociated with performance on the high-stakes test. We controlled for participant

17. This amount of testing is not atypical; por ejemplo, Chicago Public Schools had a testing schedule comparable

to our charter school network (Chicago Public Schools 2020).

18. One concern could be that seasonality in cortisol levels could lead us to falsely attribute changes in cortisol
to the test. Alternativamente, cortisol sampling itself could be stressful, but habituation could occur as individuals
take more cortisol samples. If anything, prior research suggests that both habituation and seasonality in cortisol
would work against finding increased cortisol, as both more sampling exposure and springtime are associated
with lower cortisol levels than less sampling exposure and fall, respectivamente (King et al. 2000). We could not test
this theory under a schedule that worked for our charter school network, though future studies should consider
adding an additional “control” sample just before or after the high-stakes testing period.

192

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

F

/

mi
d
tu
mi
d
pag
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

1
6
2
1
8
3
1
9
1
0
6
6
4
mi
d
pag
_
a
_
0
0
3
0
6
pag
d

.

F

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Jennifer A.. Heissel, Emma K. Adán, Jennifer L. Doleac, David N. Figlio, and Jonathan Meer

demographics, academic grades in the school in the first three quarters of the year,
sample timing, and school characteristics. Estimated effects on high-stakes test perfor-
mance can be interpreted as differences relative to how we would expect participants
to perform based on their academic performance in daily school settings. We estimate
the following model:

TestZScorei = β0 + Responsivityi
+ Tiα + Xiδ + εi,

γ + β1CurrentCortisoli + β2CurrentCortisol2
i

(2)

where TestZScorei is the average Z-score of the math, ciencia, and ELA high-stakes tests;
CurrentCortisoli is the mean individual homeroom cortisol in the high-stakes testing
week; CurrentCortisol2
i allows the marginal effect of cortisol to change as the cortisol
level increases; Ti is a vector of grades in school (on a 0–100 scale) in academic quarters
1–3 in math, ciencia, ELLA, and social studies; and Xi is a vector of other individual
characteristics from school administrative data (edad, género, exceptional child status,
whether the student had a Section 504 plan, homelessness, and school controls).

The primary variable of interest is Responsivityi, which is a vector of indicator vari-
ables representing 20 percentage-point bins for the change in homeroom cortisol levels
from baseline to the high-stakes testing week. Bins are grouped as follows: −30 percent
or lower, −10 to −30 percent, −10 to 10 por ciento (the reference bin), 10 por ciento a 30
por ciento, 30 por ciento a 50 por ciento, 50 por ciento a 70 por ciento, y 70 percent or higher.
We will show that alternative bin cutoffs lead to qualitatively similar conclusions. Sta-
tistically significant coefficients for CurrentCortisoli would indicate that the same-day
level of cortisol is related to test performance; statistically significant coefficients for
Responsivityi would indicate that the change in cortisol level from baseline to the test
week is related to performance.

The vector of in-school grades accounts for regular, non-high-stakes student per-
rendimiento, and any observed effects of responsivity or current cortisol would indicate
underperformance or overperformance on the test beyond what is predicted by those
test scores and demographic factors. To the extent that some demographics (p.ej., home-
lessness) themselves cause chronic stress that in turn could affect cortisol responses,
our main estimates could underestimate the effect of cortisol responsivity.19 Still, otro
unobserved shocks to individuals (p.ej., parental job loss) could conceivably be related
to both changes in cortisol and test performance.20 Moreover, some participants were
missing either requisite cortisol or testing data, and the N in the final analysis is 68.21
Given potential omitted variable bias and this smaller sample, we interpret the aca-
demic performance results as suggestive and do not conduct subgroup analyses.

19. There is no statistical difference in our estimates if we do not include the demographic controls in this analysis;

we include them for completeness.

20. In future research efforts, an additional cortisol data collection closer to the time of the high-stakes test could
reduce this concern. Practically, one challenge researchers may face is that a lot of testing (both high- and low-
stakes) occurs in the spring, so it may be difficult to find a “control” week near the high-stakes testing week.
Además, each sample collection is a burden on the school and students, and schools may be hesitant to allow
too much access during the spring testing period.

21. Results were similar when we imputed responsivity for those missing baseline cortisol measures, using the

change in cortisol from the low-stakes to the high-stakes testing week.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

F

/

/

mi
d
tu
mi
d
pag
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

1
6
2
1
8
3
1
9
1
0
6
6
4
mi
d
pag
_
a
_
0
0
3
0
6
pag
d

/

F

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

193

Testing, Stress, y rendimiento

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

F

/

mi
d
tu
mi
d
pag
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

1
6
2
1
8
3
1
9
1
0
6
6
4
mi
d
pag
_
a
_
0
0
3
0
6
pag
d

.

F

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Notas: We use locally weighted scatter plot smoothing to display the data, which does not impose parameters on the pattern. Boxes
include the cortisol awakening response (CAR, 15—60 minutes post-waking) and the interquartile range (IQR) of timing for the
before-test (homeroom) and post-test (before lunch) muestras. norte = 93 individuals included over multiple days.

Cifra 2. Cortisol Patterns from Wake to Eight Hours Post-Wake for Baseline, Low-Stakes, and High-Stakes Weeks

5 . RESULTADOS
Changes in Cortisol Daily Rhythms
Cifra 2 displays the cortisol patterns from wake to eight hours post-wake for baseline,
low-stakes, and high-stakes weeks using locally weighted scatter plot smoothing, cual
does not impose parameters on the pattern. Cortisol followed the expected diurnal pat-
tern in the baseline week. We see the sharp rise in the cortisol awakening response
(15–60 minutes after waking), following by falling cortisol as time passes.

The pattern visibly differs in the high-stakes testing week, with a less-pronounced
CAR and much higher levels of cortisol during the homeroom period. In the base-
line week, cortisol levels were not elevated above the expected slope during homeroom,
which provides an important test on our hypothesis: We would not expect elevated cor-
tisol during homeroom in a regular school week. Cortisol elevations during the low-
stakes test week were in between the baseline and high-stakes test weeks in the home-
room period.

Changes in Before-Test Cortisol
The estimates in table 2 show whether, within individuals, homeroom cortisol levels dif-
fered from baseline to the testing weeks. All columns include individual fixed effects

194

Jennifer A.. Heissel, Emma K. Adán, Jennifer L. Doleac, David N. Figlio, and Jonathan Meer

Mesa 2. Changes in Level of Before-Testing Homeroom Period Cortisol by Week

Todo
(1)

0.123
(0.087)
0.204**
(0.075)

Todo
(2)

By Gender
(3)

By Poverty
(4)

By Local 911 Calls
(5)

By Ability
(6)

0.069
(0.156)
0.295*
(0.123)

0.267*
(0.118)
0.320**
(0.121)

0.269**
(0.091)
0.273**
(0.092)

0.101
(0.087)
0.176*
(0.076)

0.310**
(0.113)
0.327**
(0.119)
−0.353*
(0.166)
−0.259+
(0.150)

Low-stakes testing

High-stakes testing

Low-stakes × female

High-stakes × female

Low-stakes × lower poverty

High-stakes × lower poverty

Low-stakes × lower crime

High-stakes × lower crime

Low-stakes × higher ability

High-stakes × higher ability

Time of day

Time of day-squared

Wake time

CAR timeframe

−0.149
(0.612)

0.150
(0.619)

0.034
(0.644)

0.349
(0.682)
−0.183
(0.139)
−0.101
(0.175)

pag(sum low-stakes testing = 0)
pag(sum high-stakes testing sum = 0)
Observaciones

Participantes

489

93

489

93

0.018
(0.188)
−0.240
(0.164)

0.133
(0.713)

0.409
(0.749)
−0.184
(0.138)
−0.151
(0.194)

0.404

0.610

454

86

0.038
(0.646)

0.345
(0.689)
−0.192
(0.136)
−0.085
(0.175)

0.721

0.469

489

93

−0.369+
(0.196)
−0.266
(0.171)

0.261
(0.721)

0.445
(0.750)
−0.191
(0.136)
−0.118
(0.191)

0.487

0.644

448

85

−0.329*
(0.163)
−0.190
(0.142)

0.047
(0.642)

0.345
(0.678)
−0.178
(0.141)
−0.077
(0.175)

0.668

0.467

489

93

Notas: Robust standard errors clustered by student identification. Analysis conducted at the student-day level. Outcome is the natural
log of cortisol. Data come from saliva collected in homeroom. Each column represents a different regression estimate. Model limits the
comparison to within-individuals, accounting for any constant observed and unobserved characteristics. Wake time is the approximate
wakeup time for the day, measured with error. Columna 2 is the preferred overall model. Columns 3—6 conduct the analysis by interacting
the test with indicator variables for the given group. Columna 4 is separated by median neighborhood poverty level (40 por ciento), columna 5
is separated by median number of 911 calls within 0.25 of home address within a year (median = 240 calls), and column 6 is separated
by median first quarter grades expressed in Z-scores (median = −0.15 standard deviations). Table includes p-values of the estimated
difference in these groups for the change in cortisol for the low- and high-stakes weeks.
+pag < 0.10; *p < 0.05; **p < 0.01. and a quadratic of time relative to waking for a given day. The coefficient on low-stakes (high-stakes) testing approximates whether the level of cortisol differs from the base- line week to the low-stakes (high-stakes) testing week. Column 1 does not include wake time or controls for whether the sample was taken during the CAR period (15-60 min- utes post-wake), as these were necessarily imputed on day 1 of each week. However, later wake times were associated with higher waking cortisol, so we added controls for wake time and CAR in column 2.22 The homeroom estimates were similar whether 22. For reference, online appendix table A.2 displays the analysis without fixed effects but with demographics controls, as well as post-double-selection LASSO methods (Belloni, Chernozhukov, and Hansen 2014) to select l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / f e d u e d p a r t i c e - p d l f / / / / 1 6 2 1 8 3 1 9 1 0 6 6 4 e d p _ a _ 0 0 3 0 6 p d / f . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 195 Testing, Stress, and Performance controlling for these variables or not, and going forward we prefer the more conserva- tive estimate that controls for wake time and CAR. On average, student levels of cortisol were 18 percent higher in homeroom in the high-stakes week relative to the same stu- dents’ homeroom cortisol at baseline. There was no statistical difference in cortisol in the low-stakes week relative to the baseline week, though as expected the coefficients are positive. The high-stakes and low-stakes weeks do not statistically differ from one another; future analyses should explore this relationship further.23 Columns 3 through 6 of table 2 examine subgroups. All estimates within a column come from the same regression, and the bottom rows of the table test whether the sum of the main effect and the interaction for a given test differs from zero. Male students had large average increases in homeroom cortisol in the low-stakes testing week (31 percent) and high-stakes testing week (33 percent), relative to the baseline week. The female effect sizes statistically differed from the male estimates; the difference rela- tive to baseline was −4 percent in the low-stakes week (calculated as 0.310–0.353) and +7 percent in the high-stakes week. Neither of the female estimates statistically dif- fered from zero, as indicated by the p-values at the bottom of the table. Turning to neighborhood characteristics, we first divide individuals by the median neighborhood poverty rate observed in our sample (40 percent); lower-poverty neighborhood ranged from 14 percent to 40 percent (with a mean of 28 percent) and higher-poverty neigh- borhoods ranged from 41 percent to 91 percent (with a mean of 53 percent). Those from higher-poverty neighborhoods had larger average increases in homeroom cortisol than those from lower-poverty neighborhoods in the high-stakes week (30 percent versus 6 percent), relative to baseline, though the difference between groups was not statisti- cally significant. Similarly, those from neighborhoods with an above-median number of high-priority 911 calls within 0.25 miles (median = about 340 calls) had larger av- erage increases in homeroom cortisol levels than those from below-median neighbor- hoods in the high-stakes week (32 percent versus 5 percent).24 The difference between groups was again not statistically significant. While the stress responses of those fac- ing chronic stress do differ from their less-stressed peers, we do not find evidence of pervasive hypocortisolism (the lack of ability to respond to a stressor), per se; indeed, those participants have moderately larger increases in cortisol. Note that our sample size is a bit smaller in the neighborhood analysis due to missing or difficult-to-geocode addresses. Our final subgroup analysis examines changes by student ability, based on median first-observed-quarter grades expressed in Z-scores (median = −0.15 standard devia- tion).25 Increases in cortisol were driven by lower-achieving students; there is no aver- age change for above-median students. 23. a set of controls to avoid over-fitting but minimize omitted variable bias. All models indicate broadly similar results. In a larger sample, it would be useful to know if students acclimate to low-stakes testing, high-stakes testing, both, or neither. We only had one school with students in grades 3–4 and two with students in grades 6–8, which means we cannot separate out school-specific effects from age- or grade-specific effects. 24. Lower-crime addresses ranged from 0 to 338 high-priority calls within 0.25 miles in a year (with a mean of 191 calls) and higher-crime addresses ranged from 343 to 1,380 annual calls (with a mean of 645 calls). 25. Scores in the lower-achieving group ranged from −1.55 to −0.15 standard deviations from the mean (mean of this group = –0.80 standard deviation), while the higher-achieving group ranged from −0.11 to 2.14 standard deviation (with a mean of 0.82 standard deviation). 196 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / f / e d u e d p a r t i c e - p d l f / / / / 1 6 2 1 8 3 1 9 1 0 6 6 4 e d p _ a _ 0 0 3 0 6 p d . f / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Jennifer A. Heissel, Emma K. Adam, Jennifer L. Doleac, David N. Figlio, and Jonathan Meer There was higher cortisol before the test, on average, relative to the baseline week, but there was also substantial variation in reactivity. Figure 3 displays the density of the change (“responsivity”) from baseline to the low-stakes testing week and from baseline to the high-stakes testing week. Although, on average, cortisol was higher in testing weeks, some individuals had little change and others actually had lower cortisol in the testing weeks—either due to the noisiness of the cortisol sampling or perhaps due to disengagement from the stressful situation. We next test whether these different re- sponses were associated with different performances on the test. Differences in Academic Outcomes Figure 4 examines how cortisol reactivity was related to test outcomes. It is unclear how best to measure this relationship, and we include several approaches for transparency. First, panel A breaks the subgroup of participants with the requisite data into quintiles based on the percentage change in their homeroom cortisol from baseline to the high- stakes testing week. Quintile 1, the reference group, includes those whose cortisol fell 22 to 78 percent relative to baseline during high-stakes testing. Quintile 2 includes those with little change, ranging from −21 percent to +12 percent. Quintile 3 participants had moderate increases, from 13 percent to 52 percent. The final two quintiles cover those with large increases, from 53 to 119 percent in quintile 4 and over 119 percent increases in quintile 5. Quintile 2 differs significantly from quintile 1, with test scores 0.38 standard devi- ation higher (conditional on demographic controls, concurrent cortisol, and in-school grades; p-value = 0.027). The final three quintiles do not differ significantly from quin- tile 1. We reject that the five quintiles are the same at the 10 percent level with an F-test (F(4, 42) = 2.15; p-value = 0.091), but we cannot reject that quintiles 1, 3, 4, and 5 are statistically the same (F(3, 42) = 0.87; p-value = 0.465). In other words, it appears that participants in quintile 2, who have the least amount of change from baseline to the high-stakes test, outperform the other quintiles, conditional on the other control vari- ables (although it is only marginally statistically significant). An alternative, parametric approach to the estimate is displayed in panel B. Prior re- search has found an inverse-U shape in the relationship between cortisol and outcomes (Het, Ramlow, and Wolf 2005; Schilling et al. 2013). Here, conditional on demographic controls and in-school grades, we model a quadratic estimate of the relationship be- tween test score (the outcome) and the raw level of change in cortisol in micrograms per deciliter.26 Panel B plots this estimate and its 95 percent confidence interval, as well as binned scatter plots for test scores and change in cortisol (with five to six observations per bin).27 The pattern appears as an inverse-U, but contrary to prior work, we find no evidence of an improvement in outcomes for moderate increases in cortisol. Moreover, the quadratic term is not statistically significant, at least at our level of power.28 26. The model also includes a quadratic control for concurrent cortisol, but we find no relationship between con- current cortisol and outcomes on the test. 27. Predicted line and 95 percent confidence interval estimated using Stata’s adjust command, based on the mean of all other control variables. 28. The quadratic term for responsivity is β = −1.446; p-value = 0.132. The linear estimate is null in this model, and it is actually slightly negative (β = −0.405; p-value = 0.504). We also tested cubic and quartic functions but they were also not statistically significant. Future tests with more power may confirm the inverse-U pattern. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / f e d u e d p a r t i c e - p d l f / / / / 1 6 2 1 8 3 1 9 1 0 6 6 4 e d p _ a _ 0 0 3 0 6 p d . f / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 197 Testing, Stress, and Performance l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / f e d u e d p a r t i c e - p d l f / / / / 1 6 2 1 8 3 1 9 1 0 6 6 4 e d p _ a _ 0 0 3 0 6 p d / . f f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Note: Figure includes estimates by gender and poverty, with above-median poverty indicating more poverty. Figure 3. Distribution of the Change (“Responsivity”) from Baseline to the Low-Stakes Testing Week and from Baseline to the High-Stakes Testing Week 198 Jennifer A. Heissel, Emma K. Adam, Jennifer L. Doleac, David N. Figlio, and Jonathan Meer Notes: Models regressed mean Z-score on different ways to measure of change in cortisol. All models control for quarters 1—3 grades for math, English Language Arts, science, and social studies; time of day; time-squared; age; indicators for female, exceptional child status, Section 504 status, and homelessness; and school indicator variables. N = 68 individuals. Analysis conducted at the student level. Results are similar when imputing cortisol changes for those missing baseline data. Panel A groups participants by quintile based on their percentage change in cortisol, in quintile 1 (−78% to −22%, N = 13), quintile 2 (−21% to +12%, N = 13), quintile 3 (+13 to +52%, N = 14), quintile 4 (+53% to +119%, N = 14), and quintile 5 (>119%, norte = 13). The model also controls
for five quintiles of concurrent (in the high-stakes week) cortisol. Panel B does not group participants into categories, but instead
includes variables for responsivity and responsivity-squared term to measure whether an inverse-U pattern occurs. Displayed solid
line maps this predicted Z-score; dashed lines provide 95% confidence intervals, based on predicted values and standard errors
from Stata’s adjust command. The model also controls for concurrent cortisol and concurrent cortisol-squared, as well as the typical
demographics. Display includes a binned scatter plot of the raw data, with 5—6 observations per bin. Panel C groups participants by
percentage-change in cortisol. Bins grouped by decreases greater than 30% (norte = 10), −30% to −10% (norte = 5), reference group at
−10% to +10% (norte = 8), +10% a +30% (norte = 8), +30% a +50% (norte = 7), 50% a 70% (norte = 8), and increases greater than
70% (norte = 21).

Cifra 4. Change in Predicted Mean Z-Score (across Math, Ciencia, and English Language Arts Tests) on the High-Stakes Test by Change in
Cortisol from the Baseline to the High-Stakes Testing Week

Finalmente, our preferred model groups the estimates into 20-percentage point bins.
The estimates are somewhat noisy, but relative to those in the low reactivity group (de
−10 percent to +10 percent homeroom cortisol change from baseline to the high-stakes
week), those with either large increases or decreases in cortisol from the baseline week
performed worse on the standardized test. En otras palabras, decreases and increases in
cortisol were associated with underperformance on the high-stakes test. Grouping the
“change” bins together, an increase of more than 10 percent or a decrease of more than
10 percent was associated with a 0.443 standard deviation decrease in the test score
(p-value = 0.009), relative to those with little cortisol responsivity (−10 percent to +10
por ciento), holding school-year academic grades, demographic characteristics, y estafa-
current cortisol constant. Mesa 3 contains these results. The estimates are fairly similar
when broken up by those who increase more than 10 por ciento (0.437 standard deviation
lower scores relative to those with −10 percent to +10 percent change, p-value = 0.015)

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

/

/

F

mi
d
tu
mi
d
pag
a
r
t
i
C
mi

pag
d

yo

F
/

/

/

/

1
6
2
1
8
3
1
9
1
0
6
6
4
mi
d
pag
_
a
_
0
0
3
0
6
pag
d

.

F

/

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

199

Testing, Stress, y rendimiento

Mesa 3. Changes in Test Scores by Cortisol Responsivity to the Test

Low-stakes
Scores as a
Control
(2)

No
Concurrent
Cortisol
(3)

Adding
Low-stakes
Week
(4)

Preferred
(1)

Placebo:
Within-
base
Cambiar
(5)

Placebo:
Low-stakes
Cambiar
(6)

Placebo:
High-
stakes
Cambiar
(7)

Grupo A: Change in Test Scores for 10% Above/Below Baseline Cortisol in Testing Week

±10% from baseline

−0.443**
(0.164)

−0.558*
(0.214)

−0.439**
(0.157)

−0.261*
(0.109)

−0.114
(0.196)

−0.129
(0.193)

−0.159
(0.204)

Grupo B: Change in Test Scores for 10% Above or 10% Below Baseline Cortisol in Testing Week

10% above baseline

10% below baseline

Control S:

Q1—3 grades

Low-stakes test scores

Concurrent cortisol

Cortisol change from

base

Test outcome

−0.437*
(0.172)
−0.458*
(0.192)

−0.550*
(0.222)
−0.589+
(0.292)

−0.431*
(0.163)
−0.458*
(0.176)

Y

norte

Y

norte

Y

Y

Y

norte

norte

A
high-stakes

A
high-stakes

A
high-stakes

High-stakes

High-stakes

High-stakes

−0.263*
(0.115)
−0.258*
(0.127)

Q1 only

norte

Y

To low- O
high-stakes

Bajo- O
high-stakes

−0.125
(0.198)
−0.046
(0.218)

−0.077
(0.194)
−0.238
(0.207)

−0.163
(0.208)
−0.154
(0.217)

Y

norte

Y

Y

norte

Y

Y

norte

Y

Within
base

A
low-stakes

A
high-stakes

High-stakes

High-stakes

Low-stakes

norte

67

62

67

136

82

63

63

Notas: Robust standard errors. Analysis conducted at the student level. Outcome is the Z-score of the indicated test. Cortisol data comes
from saliva collected in homeroom. Each column represents a different regression estimate. All models also control for school fixed effects; a
quadratic of time relative to wake; an indicator for whether the cortisol sample occurred within the CAR timeframe; wake time; edad; femenino;
and indicators for economically disadvantaged, Sección 504, and McKinney-Vento Act (ver tabla 1 details). Columna 1 is the preferred overall
modelo. Columns 2—4 test alternate specifications by changing the measure of baseline ability (Columna 2), removing controls for concurrent
cortisol (Columna 3), or adding the low-stakes week as an additional observation (Columna 4). Columna 4 uses quarter 1 grades as the control
for baseline ability. Columns 5—7 test placebos of whether general cortisol variability is associated with test scores. Columna 5 assesses within-
baseline week changes of cortisol predict scores on the high-stakes test. Columna 6 (7) tests whether cortisol changes from baseline to the
low-stakes (high-stakes) test week affect high-stakes (low-stakes) test outcomes.
+pag < 0.10; *p < 0.05; **p < 0.01. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / f e d u e d p a r t i c e - p d l f / / / / 1 6 2 1 8 3 1 9 1 0 6 6 4 e d p _ a _ 0 0 3 0 6 p d . f / and those who decrease more than 10 percent (0.458 standard deviation lower scores, p-value = 0.021). A potential concern is that grades do not adequately account for student testing abil- ity. Thus, column 2 of table 3 uses low-stakes test scores instead of quarters 1–3 grades to control for baseline ability. Results were similar (−0.558 standard deviation lower for large reactivity participants, relative to the ±10 percent group, p-value = 0.005; N = 62), though the N was also slightly lower because some students were missing low- stakes test results. Results were also similar without controlling for concurrent cortisol (−0.439 standard deviation, p-value = 0.007; N = 67) and when adding the low-stakes test as an additional outcome (−0.261 standard deviations, p-value = 0.019; N = 136 tests for 73 participants).29 The estimates are based on the average score across the math, ELA, and science high-stakes tests to decrease variability in scores; post hoc anal- yses demonstrated that the effects were negative for all three individual tests, with the f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 29. Here, we add the low-stakes week as an additional observation, and we define reactivity based on the change from baseline to the given testing week. 200 Jennifer A. Heissel, Emma K. Adam, Jennifer L. Doleac, David N. Figlio, and Jonathan Meer largest estimate in science.30 There was no relationship between a quadratic of concur- rent level of homeroom cortisol during the testing week itself and outcomes on the test, with or without including reactivity from baseline. One concern with the analysis is that cortisol variability, rather than responsivity to the high-stakes test, is associated with worse outcomes on the test. We test this with three placebo measures in columns 5–7 of table 3. These all test whether changes be- tween days unrelated to a given test predict performance on the test. Column 5 uses whether changes from day 1 to day 2 (or day 2 to day 3 for three individuals who joined on day 2) in the baseline week predict performance on the high-stakes test. It does not. As a second placebo in column 6, we test whether cortisol responsivity to the low-stakes test predicted performance months later on the high-stakes test. It did not, nor did re- sponsivity to the high-stakes test predict performance on the low-stakes test in column 7. Thus, it does not appear that cortisol variability in general predicts performance on the high-stakes test. Instead, it is specifically changes from baseline to the high-stakes test week that are associated with performance on the high-stakes test itself. We do not conduct the full binning exercise by subgroups due to small sample size. However, when we compared lower-reactivity participants (±10 percent cortisol change) to higher-reactivity participants (greater than ±10 percent change), we found no statistically significant differences in the patterns by gender, neighborhood poverty, neighborhood crime, or prior grades.31 So, although some groups are more likely to be high-reactivity than others, the relationship with test scores is similar among all high- reactivity participants, at least to the extent that we can test it in this setting. Although we prefer a bin-based specification for flexibility, the choice of −10 per- cent to +10 percent as a reference group range is arbitrary. Thus, figure 5 displays the estimated effect for being above and below different cut points. The graph includes 95 percent confidence intervals. The x-axis starts at 10 percent to match the estimates above, showing that a change of more than 10 percent above or 10 percent below base- line cortisol levels is associated with statistically significantly lower test scores, relative to those with cortisol responsivity between −10 percent and 10 percent. If, instead, we set the reference range to be ±15 percent, those whose cortisol dropped 15 percent or more had 0.340 standard deviation lower test scores (p-value = 0.051) and those whose cortisol increased 15 percent or more had 0.362 standard deviation lower test scores (p-value = 0.025), relative to those in the −15 percent to +15 percent range. Neither of the differences is statistically significant at the 5 percent level when we reach the ±17 percent range; neither is statistically significant at the 10 percent level once we reach the ±29 percent range. 30. The high-reactivity scores were lower than the low-reactivity (±10 percent) scores in science (−0.625, p-value = 0.009), reading (−0.493, p-value = 0.121), and math (−0.212, p-value = 0.240). Hausman tests indicated these coefficients sizes did not statistically differ across the three models (p-value = 0.237) and they jointly differed from zero (p-value = 0.001). 31. When interacting demographic indicator variables with an indicator for reactivity, the coefficient was −0.639 standard deviation for high-reactivity male participants and −0.965 standard deviation for high-reactivity fe- male participants, relative to non-reactors in the ±10 percent range (p-value of male–female difference = 0.317). The coefficient was −0.372 for participants from lower-poverty neighborhoods and −0.145 for higher-poverty participants (p-value of difference = 0.492). The coefficient was −0.557 for participants from lower-crime neighborhoods and −0.707 for higher-crime participants (p-value of difference = 0.638). The coefficient was −0.600 for participants who had below-median course grades and −1.000 for participants who had above- median course grades (p-value of difference = 0.182) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . f / / e d u e d p a r t i c e - p d l f / / / / 1 6 2 1 8 3 1 9 1 0 6 6 4 e d p _ a _ 0 0 3 0 6 p d / . f f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 201 Testing, Stress, and Performance l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . f / / e d u e d p a r t i c e - p d l f / / / / 1 6 2 1 8 3 1 9 1 0 6 6 4 e d p _ a _ 0 0 3 0 6 p d f . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Notes: Each distance is a separate regression; two coefficients per regression displayed. Coefficients displayed are for a variable that is equal to 1 if the change from baseline to the high-stakes testing week is greater than the indicated level. N = 67. Figure 5. Estimated Effect Size by Different Bounding Distances (±10% to ±50%) Figures 4 and 5 show that the estimates are noisy, with considerable unexplained fluctuation in test scores, and that the best outcomes appear around where there is little cortisol change. Overall, we take this as suggestive evidence that large changes in cortisol in response to high-stakes tests are associated with worse performance on the test, but there is much more to be done in this area. Misbehavior as a Potential Mechanism One hypothesis is that a cortisol spike could be associated with “acting out” and mis- behavior during the test, which could inhibit performance. We can assess this hypoth- esis because the charter network tracked behavior using a daily points-based system.32 Throughout the year, the average student got into at least some trouble on 35 percent of school days. Relative to a regular day, there were no differences in the probability of getting in trouble on a low-stakes test day. However, for the most important week of high- stakes testing, students were 26 percentage points less likely to get in trouble than on 32. Observed values for behavior infractions and rewards ranged from −30 to +10, with positive outcomes in areas such as “scholarship” (+5 points, 775 observed instances over the academic year across the 83 students with observed data) and being a “reading rockstar” (+10 points, 60 instances observed), and negative outcomes in areas such as “instigating and/or fighting/fronting (including play fighting)” (−20 points, 65 instances), a category called “bathroom” (−10 points, 833 instances), “major violations” (−10 points, 828 instances), “talking out of turn” (−5 points, 1,847 instances), and “line” (−2 points, 973 instances). 202 Jennifer A. Heissel, Emma K. Adam, Jennifer L. Doleac, David N. Figlio, and Jonathan Meer regular school days (p-value = 0.000).33 We do not take these estimates as a measure of acting out, necessarily, given the discretion that teachers have in assigning points to students.34 Perhaps teachers were more lenient in general on test days, or perhaps students had fewer opportunities to get in trouble. However, we did test whether those with large increases in cortisol had different drops in infractions than those who had decreased cortisol or those who did not have a strong cortisol response.35 We found no evidence of a difference in the probability of getting in trouble by those with very large increases (or decreases) in cortisol level. As best as we can measure, then, we find no evidence that misbehavior is driving the results on the tests. Instead, we hypothesize that the ability to focus and recall information relevant to the test is affected. 6 . D I S C U S S I O N This study examined whether children responded physiologically to high-stakes test- ing in a naturalistic setting, and how any responses were associated with performance on a high-stakes test. Children in one charter school network displayed a statistically significant increase in cortisol level in anticipation of high-stakes testing; this pattern was driven by male students. We also find some evidence that, among a sample of disadvantaged students, the most-disadvantaged students had the largest increase in cortisol in anticipation of the high-stakes test. These changes were driven by the oc- currence of a test that mattered for schools but had limited consequence for individual students. Moderate decreases and increases in cortisol were associated with underperfor- mance on the high-stakes test, relative to what we would have expected from students given their in-school academic performance and other characteristics. Even the average increase in cortisol shown in table 2 (18 percent) was associated with lower test scores, relative to those with little change in cortisol. An increase of more than 10 percent or a decrease of more than 10 percent was associated with a 0.4 standard deviation decrease in test scores, relative to those with little change. This is equivalent to approximately 80 points on the 1,600-point SAT scale. Concurrent cortisol measured as linear, quadratic, or bins during the test was not a statistically significant predictor of performance; it was cortisol change relative to baseline that predicted outcomes. Of course, one study on a small, nonrandom sample of students may not give us the true population-level effect of high-stakes tests on cortisol or how cortisol relates to test 33. We identified every low- and high-stakes test day during the academic year. Using student fixed effects, we regressed an indicator for these day types, indicators for day of the week, and a continuous variable measuring the day of the year on an indicator for the probability that a student got in trouble on a given day. Students were less likely to get in trouble as the year went on, with the daily probability of getting in trouble dropping about 0.47 percentage points every ten calendar days. Tuesdays were the most likely day to get in trouble, followed by Wednesday, Monday, Thursday, and (much less likely) Friday. 34. Anecdotally, during our data collection we observed multiple instances of students acting out or acting differ- ently during the high-stakes testing period than during the other data collection weeks. For example, a student was throwing up in the back of the room after the test; we were told protocol was to allow the students to leave their seats if they had to throw up. Another student “made a run for it” and led the staff on a chase through the school when we brought him to the hallway for his saliva sampling; they found him hiding in the kitchen. The behavior of students—and how that behavior might affect test scores—is an area in need of further systematic study for those who want to use test scores to make high-stakes decisions about students and school. 35. We interacted test type with an indicator for a responsivity greater than 10 percent and an indicator for a re- sponsivity of less than −10 percent. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / f e d u e d p a r t i c e - p d l f / / / / 1 6 2 1 8 3 1 9 1 0 6 6 4 e d p _ a _ 0 0 3 0 6 p d . f / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 203 Testing, Stress, and Performance performance. We view this study as a call for additional work in this area. Specifically, we identify four non-exclusive themes in need of future research. First, future analyses should replicate that students do indeed have an increase in cortisol and that it differs by various attributes. Such research should examine a more diverse population of stu- dents, rather than the largely low-income, mostly black population we examined here. A larger sample size would permit a greater degree of heterogeneity analysis than is possible in the present study in order to more robustly test whether different groups respond differentially to high-stakes testing. One possibility is collecting cortisol sam- ples around the time of SAT or ACT testing, as these tests have real-world implications for students. Second, research must confirm whether moderate changes in cortisol are associ- ated with worse performance in other settings. Researchers rely on high-stakes tests as a measure of academic performance to evaluate various education and social policies. Such research may accept that high-stakes tests are noisy measures of ability or knowl- edge, but it generally assumes that the noise is evenly distributed across the socioeco- nomic spectrum. If, however, certain groups are systematically “stressed testers”—that is, they have large physiological reactions to the high-stakes testing setting—the poli- cies recommended by such research may be suboptimal. As an extreme example, con- sider a world where all children learn the same amount of material during the year but group A has a bigger physiological reaction, and subsequently lower scores, than group B. Examining test scores would lead researchers to conclude there is an achievement gap between these groups and that group A needs intervention. But in reality, both groups learn the same amount of material and can perhaps even apply that material similarly in the real world. The policy solution in this case would be much different than if learning differed between groups. Such test-day stress deficits are not the only cause of achievement gaps, but they may explain part of existing disparities. Future research should examine how large a role they play. A key consideration in any such research is causality. Researchers cannot assign stress responses to students in real- world tests, but real-world tests may be more stressful to students than lab-based tests. Carefully designed research, which includes measures of performance outside of the high-stakes tests, will be necessary to move understanding forward. Third, researchers should consider how school policies that use test scores may exacerbate or alleviate disparities among groups. If certain groups are more likely to be stressed testers, then, holding baseline knowledge constant, those stressed testers will be disadvantaged by admission or graduation policies based on high-stakes tests. Researchers should carefully consider how policy decisions interact with biological re- sponses to testing. Finally, if new work confirms that testing causes stress for students in ways that impact their performance, a logical question is what schools can do to mitigate the stress response—or at least the effects of the stress response on performance. Po- tential options include mindfulness programs (Zenner, Herrnleben-Kurz, and Walach 2014), integrated mental health interventions (Fazel et al. 2014), or yoga (Ehud, An, and Avshalom 2010), among other interventions. Though such programs may have benefits well beyond test scores, researchers could investigate whether they are associated with changes in biological stress reactions to high-stakes testing. 204 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . f / / e d u e d p a r t i c e - p d l f / / / / 1 6 2 1 8 3 1 9 1 0 6 6 4 e d p _ a _ 0 0 3 0 6 p d f / . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Jennifer A. Heissel, Emma K. Adam, Jennifer L. Doleac, David N. Figlio, and Jonathan Meer Given the prevalence of high-stakes testing in U.S. education policy, much more work is needed in this area. If the patterns of test-induced stress that we find in this study continue to hold up, it might suggest that high-stakes testing results should be used and interpreted differently than the way they are currently implemented in edu- cation policy and practice. ACKNOWLEDGMENTS We thank the anonymous school district and its staff for their invaluable cooperation, as well as Kaho Arakawa, Chernjen Lee, Royette Tavernier, members of the COAST Lab at Northwestern University, and seminar participants at Northwestern University and the AEFP, APPAM, and Western Economic Association meetings. Laura Scaramella at the University of New Orleans provided access to laboratory space. We are grateful for funding from the Spencer Foundation (grant no. 2015000117) and the Institute for Policy Research at Northwestern University. REFERENCES Adam, Emma K. 2012. Emotion-cortisol transactions occur over multiple time scales in develop- ment: Implications for research on emotion and the development of emotional disorders. Mono- graphs of the Society for Research in Child Development 77(2): 17–27. Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. 2014. Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies 81(2): 608– 650. Blair, Clancy, Douglas Granger, and Rachel Peters Razza. 2005. Cortisol reactivity is positively related to executive function in preschool children attending Head Start. Child Development 76(3): 554–567. Bradbury, Bruce, Miles Corak, Jane Waldfogel, and Elizabeth Washbrook. 2015. Too many chil- dren left behind: The U.S. achievement gap in comparative perspective. New York: Russell Sage Foundation. Chicago Public Schools. 2020. Chicago Public Schools student assessments. Available https://www. cps.edu/academics/student-assessments/. Accessed 13 October 2020. Clow, Angela, Frank Hucklebridge, Tobias Stalder, Phil Evans, and Lisa Thorn. 2010. The cortisol awakening response: More than a measure of HPA axis function. Neuroscience & Biobehavioral Reviews 35(1): 97–103. Del Giudice, Marco, Bruce J. Ellis, and Elizabeth A. Shirtcliff. 2011. The adaptive calibration model of stress responsivity. Neuroscience & Biobehavioral Reviews 35(7): 1562–1592. Dickerson, Sally S., and Margaret E. Kemeny. 2004. Acute stressors and cortisol responses: A theoretical integration and synthesis of laboratory research. Psychological Bulletin 130(3): 355–391. Doane, Leah D., and Emma K. Adam. 2010. Loneliness and cortisol: Momentary, day-to-day, and trait associations. Psychoneuroendocrinology 35(3): 430–441. Dressendörfer, R. A., C. Kirschbaum, W. Rohde, F. Stahl, and C. J. Strasburger. 1992. Synthesis of a cortisol-biotin conjugate and evaluation as a tracer in an immunoassay for salivary cortisol measurement. Journal of Steroid Biochemistry and Molecular Biology 43(7): 683–692. Ehud, Miron, Bar-Dov An, and Strulov Avshalom. 2010. Here and now: Yoga in Israeli schools. International Journal of Yoga 3(2): 42–47. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / f e d u e d p a r t i c e - p d l f / / / / 1 6 2 1 8 3 1 9 1 0 6 6 4 e d p _ a _ 0 0 3 0 6 p d . f / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 205 Testing, Stress, and Performance Engert, Veronika, Simona I. Efanov, Annie Duchesne, Susanne Vogel, Vincent Corbo, and Jens C. Pruessner. 2013. Differentiating anticipatory from reactive cortisol responses to psychosocial stress. Psychoneuroendocrinology 38(8): 1328–1337. Fazel, Mina, Kimberly Hoagwood, Sharon Stephan, and Tamsin Ford. 2014. Mental health inter- ventions in schools in high-income countries. Lancet Psychiatry 1(5): 377–387. Gunnar, Megan R., and Karina Quevedo. 2007. The neurobiology of stress and development. Annual Review of Psychology 58(1): 145–173. Hatch, Stephani L., and Bruce P. Dohrenwend. 2007. Distribution of traumatic and other stress- ful life events by race/ethnicity, gender, SES and age: A review of the research. American Journal of Community Psychology 40(3–4): 313–332. Heiser, Paul, Gayle Simidian, David Albert, John Garruto, Dawn Catucci, Peter Faustino, Kara McCarten May, and Kelly Caci. 2015. Anxious for success: High anxiety in New York’s schools. Available www.nyssba.org/clientuploads/nyssba_pdf/Test_Anxiety_Report.pdf . Accessed 8 Oc- tober 2020. Heissel, Jennifer A., Dorainne J. Levy, and Emma K. Adam. 2017. Stress, sleep, and perfor- mance on standardized tests: Understudied pathways to the achievement gap. AERA Open 3(3): 1– 17. Heissel, Jennifer A., Patrick T. Sharkey, Gerard Torrats-Espinosa, Kathryn Grant, and Emma K. Adam. 2018. Violence and vigilance: The acute effects of community violent crime on sleep and cortisol. Child Development 89(4): e323–e331. Het, Serkan, G. Ramlow, and Oliver T. Wolf. 2005. A meta-analytic review of the effects of acute cortisol administration on human memory. Psychoneuroendocrinology 30(8): 771–784. Hoyt, Lindsay T., Katharine H. Zeiders, Katherine B. Ehrlich, and Emma K. Adam. 2016. Positive upshots of cortisol in everyday life. Emotion 16(4): 431–435. King, Jean A., Milagros C. Rosal, Yunsheng Ma, George Reed, Terri-Ann Kelly, and Ira S. Ock- ene. 2000. Sequence and seasonal effects of salivary cortisol. Behavioral Medicine 26(2): 67– 73. Lazarín, Melissa. 2014. Testing overload in America’s schools. Available https://cdn.american progress.org/wp-content/uploads/2014/10/LazarinOvertestingReport.pdf . Accessed 8 October 2020. Lindahl, Mats, Töres Theorell, and Frank Lindblad. 2005. Test performance and self-esteem in relation to experienced stress in Swedish sixth and ninth graders—Saliva cortisol levels and psy- chological reactions to demands. Acta Pædiatrica 94(4): 489–495. Litten, Kevin. 2016. New Orleans poverty rates fall in 2015, still higher than state average. Available https://www.nola.com/news/politics/article_8b5169be-4ab4-5609-b65c-d8fd215e0808.html. Accessed 13 October 2020. Lupien, Sonia J., Charles W. Wilkinson, Sophie Brière, Catherine Ménard, N. M. K. Ng Ying Kin, and N. P. V. Nair. 2002. The modulatory effects of corticosteroids on cognition: Studies in young human populations. Psychoneuroendocrinology 27(3): 401–416. Malarkey, William B., Dennis K. Pearl, Laurence M. Demers, Janice K. Kiecolt-Glaser, and Ronald Glaser. 1995. Influence of academic stress and season on 24-hour mean concentrations of ACTH, cortisol, and β-endorphin. Psychoneuroendocrinology 20(5): 499–508. 206 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / f e d u e d p a r t i c e - p d l f / / / / 1 6 2 1 8 3 1 9 1 0 6 6 4 e d p _ a _ 0 0 3 0 6 p d f / . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Jennifer A. Heissel, Emma K. Adam, Jennifer L. Doleac, David N. Figlio, and Jonathan Meer Mattarella-Micke, Andrew, Jill Mateo, Megan N. Kozak, Katherine Foster, and Sian L. Beilock. 2011. Choke or thrive? The relation between salivary cortisol and math performance depends on individual differences in working memory and math-anxiety. Emotion 11(4): 1000–1005. McEwen, Bruce S. 1998. Stress, adaptation and disease: Allostasis and allostatic load. Annals of the New York Academy of Sciences 840: 34–44. McEwen, Bruce S., and Peter J. Gianaros. 2010. Central role of the brain in stress and adaptation: Links to socioeconomic status, health, and disease. Annals of the New York Academy of Sciences 1186: 190–222. Miller, Gregory E., Edith Chen, and Eric S. Zhou. 2007. If it goes up, must it come down? Chronic stress and the hypothalamic-pituitary-adrenocortical axis in humans. Psychological Bulletin 133(1): 25–45. Proctor, Bernadette D., Jessica L. Semega, and Melissa A. Kollar. 2016. Income and poverty in the United States: 2015. Available https://www.census.gov/library/publications/2016/demo/ p60-256.html. Accessed 8 October 2020. Reardon, Sean F. 2011. The widening academic achievement gap between the rich and the poor: New evidence and possible explanations. In Whither opportunity? Rising inequality, schools, and children’s life chances, edited by Greg J. Duncan and Richard J. Murnane, pp. 91–116. New York: Russell Sage Foundation. Salvador, A., F. Suay, E. González-Bono, and M. A. Serrano. 2003. Anticipatory cortisol, testos- terone and psychological responses to judo competition in young men. Psychoneuroendocrinology 28(3): 364–375. Sapolsky, Robert M., L. Michael Romero, and Allan U. Munck. 2000. How do glucocorticoids influence stress responses? Integrating permissive, suppressive, stimulatory, and preparative ac- tions. Endocrine Reviews 21(1): 55–89. Sauro, Marie D., Randall S. Jorgensen, and Teal Pedlow. 2003. Stress, glucocorticoids, and mem- ory: A meta-analytic review. Stress 6(4): 235–245. Schilling, Thomas M., Monika Kölsch, Mauro F. Larra, Carina M. Zech, Terry D. Blumenthal, Christian Frings, and Hartmut Schächinger. 2013. For whom the bell (curve) tolls: Cortisol rapidly affects memory retrieval by an inverted U-shaped dose–response relationship. Psychoneuroen- docrinology 38(9): 1565–1572. Schlotz, Wolff, Peter Schulz, Juliane Hellhammer, Arthur A. Stone, and Dirk H. Hellhammer. 2006. Trait anxiety moderates the impact of performance pressure on salivary cortisol in everyday life. Psychoneuroendocrinology 31(4): 459–472. Segool, Natasha K., John S. Carlson, Anisa N. Goforth, Nathan von der Embse, and Justin A. Bar- terian. 2013. Heightened test anxiety among young children: Elementary school students’ anxious responses to high-stakes testing. Psychology in the Schools 50(5): 489–499. Shirtcliff, Elizabeth A., Jeremy C. Peres, Andrew R. Dismukes, Yoojin Lee, and Jenny M. Phan. 2014. Hormones: Commentary. Riding the physiological roller coaster: Adaptive significance of cortisol stress reactivity to social contexts. Journal of Personality Disorders 28(1): 40–51. Stalder, Tobias, Clemens Kirschbaum, Brigitte M. Kudielka, Emma K. Adam, Jens C. Pruessner, Stefan Wüst, Samantha Dockray, Nina Smyth, Phil Evans, Dirk H. Hellhammer, Robert Miller, Mark A. Wetherell, Sonia J. Lupien, and Angela Clow. 2016. Assessment of the cortisol awakening response: Expert consensus guidelines. Psychoneuroendocrinology 63:414–432. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . f / / e d u e d p a r t i c e - p d l f / / / / 1 6 2 1 8 3 1 9 1 0 6 6 4 e d p _ a _ 0 0 3 0 6 p d f . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 207 Testing, Stress, and Performance Stroud, Laura R., Peter Salovey, and Elissa S. Epel. 2002. Sex differences in stress responses: Social rejection versus achievement stress. Biological Psychiatry 52(4): 318–327. Weekes, Nicole, Richard Lewis, Falgooni Patel, Jared Garrison-Jakel, Dale E. Berger, and Sonia J. Lupien. 2006. Examination stress as an ecological inducer of cortisol and psychological re- sponses to stress in undergraduate students. Stress 9(4): 199–206. Zenner, Charlotte, Solveig Herrnleben-Kurz, and Harald Walach. 2014. Mindfulness-based in- terventions in schools—A systematic review and meta-analysis. Frontiers in Psychology 5: Article 603. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / / f e d u e d p a r t i c e - p d l f / / / / 1 6 2 1 8 3 1 9 1 0 6 6 4 e d p _ a _ 0 0 3 0 6 p d f / . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 208TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image
TESTING, STRESS, AND PERFORMANCE: HOW image

Descargar PDF