Journal of Interdisciplinary History, XLVII:1 (Summer, 2016), 85–92. - IA de Investigación especializada en el MIT

Revista de Historia Interdisciplinaria, XLVII:1 (Verano, 2016), 85–92.

Barry Edmonston

The Statistical Analysis of Longitudinal Data

Lives in Transition: Longitudinal Analysis from Historical Sources.
Edited by Peter Baskerville and Kris Inwood (Montréal, McGill-
Queen’s University Press, 2015) 381 páginas. $110.00 cloth $34.95 paper
Lives in Transition offers an innovative and useful discussion of data
and methods for quantitative longitudinal historical research. Después
a helpful introduction, it includes three chapters about interna-
tional migration, four about mobility in rural areas, two about mo-
bility in urban areas, and three about ethnic groups during World
War I. The chapters focus on four countries. New Zealand and
Australia each receive a chapter; the United States receives two
and Canada seven. Most of the analysis pertains to the second half
of the 1800s, although two of the chapters concentrate on the ﬁrst
half of the 1800s and three deal with the early 1900s.

The book presents an excellent opportunity to discuss issues
related to data collection and statistical analysis. This review essay
ﬁrst surveys various types of quantitative life-course data and how
the chapters in this volume exemplify the collection, linkage, y
analysis of such data before exploring the statistical analysis of
quantitative longitudinal data.

QUANTITATIVE LIFE-COURSE DATA Quantitative life-course analysis
has beneﬁted greatly from the recent expansion of suitable data
sources and the development of appropriate statistical methods.
Most life-course analysis is based on four types of census, survey,
or administrative data. One type, prospective data, derives from a
particular group’s responses to a regular series of surveys over time.
En algunos casos, the prospective data include information from such
administrative sources as employment or tax records. These data
are expensive to collect and take a long time to accumulate for

Barry Edmonston is Research Professor, Department of Sociology, and Associate Director,
Population Research Group, universidad de victoria. He is the editor of the Special Issue “Life-
course Perspectives on Immigration,” Canadian Studies in Population, XL (2013), 1–102; con
Eric Fong, Canada’s Population Situation (Montréal, 2011).

© 2016 por el Instituto de Tecnología de Massachusetts y The Journal of Interdisciplinary
Historia, Cª, doi:10.1162/JINH_a_00942

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
j
i

norte
h
a
r
t
i
C
mi
–
pag
d

F
/

4
7
1
8
5
1
7
0
0
7
8
0

/
j
i

norte
h
_
a
_
0
0
9
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

86 | B AR RY E D MO NST ON

análisis. This approach, sin embargo, has the advantage of collecting
contemporary information as conditions change and events occur.
Another type, retrospective data, derives from surveys about
events and conditions in the past. It is less expensive than the pro-
spective type and provides immediate data for analysis, involving
the duration between major events, such as the time between
marriage and the ﬁrst birth of a child. The disadvantages of retro-
spective data are unknown selection (Por ejemplo, immigrants not
always living or residing in their destination country) and recall
inclinación (respondents not always remembering key events or their
fechas).

The third way for life-course information to be collected is by
linking several censuses or surveys over time or linking these data
with administrative records, primarily concerning prior or later
immigration, military service, income, employment, birth, death, o
marriage. This approach to collecting information about changes
over time is relatively inexpensive, but the information available
from census, survey, and administrative records is often limited.

The fourth data type is based on synthetic cohorts in censuses
and surveys. This method usually relies on birth or immigration
cohorts for analysis, rather than on individuals. Studies usually
follow either a birth cohort (people born during, decir, the Great
Depression) or an immigrant cohort (a group of immigrants arriv-
ing during the same period) through time, using several censuses
or surveys to compare its experiences in marriage or labor-force
participation with that of other residents. It provides inexpensive
data for long periods of time but conﬁnes analysis to cohort com-
parisons with potential selection biases.

Prospective and retrospective data are rare in historical re-
buscar, unless researchers are familiar with a historical dataset that
prospectively or retrospectively contains evidence for a group of
individuals over time. Hay, sin embargo, unique prospective
datasets available for historical research. The Oakland Growth
Estudiar, under the leadership of Glen Elder, interviewed 167 ado-
lescents born in 1920/1 en 1932 and four times thereafter to study
the effects of the Great Depression on the American family. El
last of the ﬁve interviews occurred in 1980/1, when the original
cohort was nearing retirement. A list of other potential prospective
or retrospective data sets, their data content, and availability for
historical research would be useful to compile.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
j
i

norte
h
a
r
t
i
C
mi
–
pag
d

F
/

4
7
1
8
5
1
7
0
0
7
8
0

/
j
i

norte
h
_
a
_
0
0
9
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

STAT IST IC AL AN ALY SI S OF LO NG IT UD I N AL D ATA

| 87

Most longitudinal data for historical research involves tracking
individuals along several censuses or surveys, or linking those data
to administrative sources with dates either before or after those of
the census or survey data. Criminal records could be joined with
later census data to investigate the relationship between different
criminal statuses and subsequent employment, or death certiﬁcates
could be linked to earlier census data to ascertain how occupations
relate to causes of death.

The twelve chapters in this volume illustrate applications of
several of these four types of data. Five of them make important
use of the linkage between two or more censuses or surveys.
Gordon Darroch’s contribution links 1861 y 1871 microdata
records for a study of agricultural occupational mobility in Ontario.
Luiza Antonie, Baskerville, Inwood, y j. Andrew Ross analyze
1871 y 1881 census records to determine Canadian work patterns.
Baskerville connects data for Perth County, ontario, desde el 1871
Canada census to either the 1881 Canada census or the 1880 A NOSOTROS.
census to trace migration paths. Although each of these ﬁve chapters
includes a helpful discussion of the merits and limitations of linked
census records, Sherry Olson’s deserves special mention for its ex-
cellent advice regarding the ﬁne points of how to link records and
recover certain kinds of data, such as addresses. Evan Roberts con-
nects a unique 1924/5 survey of 477 Chicago families to the 1920
y 1930 A NOSOTROS. censuses, revealing the high degree of occupational
and spatial mobility that existed even within a ﬁve-year period. Su
plans to compare the survey data with the forthcoming release of
1940 census data is keenly anticipated for the light that it could shed
on family strategies for coping with the Great Depression.

Two chapters analyze data obtained by a linkage between
census microdata and administrative records. John Cranﬁeld and
Inwood work with microdata from Canada’s 1901 census and
Canadian Expeditionary Force military records from 1914 a
1918 to examine differences in height between British and French
soldiers. Based on similar linked data, Allegra Fryxell, Inwood, y
Aaron van Tassel explore the participation of Australian Aboriginal
soldiers in World War I, a topic that has received limited attention.
Four chapters delve into sets of administrative records to extract
novel data with exceptional interest. Rebecca Kippen and Janet
McCalman consult the criminal records of prisoners who arrived
in Tasmania from 1826 a 1838 and follow up with administrative

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
j
i

norte
h
a
r
t
i
C
mi
–
pag
d

F
/

4
7
1
8
5
1
7
0
0
7
8
0

/
j
i

norte
h
_
a
_
0
0
9
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

88 | B AR RY E D MO NST ON

records after their arrival, including useful data about a special group
of working-class rioters transported to Tasmania during this period.
Their efforts result in an interesting comparison group for under-
standing the selection bias of data about convicts and their outcomes
in Tasmania. Hamish Maxwell-Stewart and Kippen deal with three
administrative data sets—medical records for individual convicts
transported to Australia, contextual data about the 289 convict vessels
sailing from British or Irish ports, and various records collected after
convicts’ arrival. The upshot is intriguing evidence about the rela-
tionship between the condition of a vessel and male/female mortal-
idad, among other things. Rebecca Lenihan uses a genealogical register
in New Zealand to study Scottish immigrants—a self-selected data set
based on records already examined rather than a sample of all possible
immigrant arrivals (single men who did not remain long and did not
have descendants are likely missing from such genealogical registers).
Linked longitudinal data does not apply to individuals only. Kenneth
METRO. Sylvester and Susan Hautaniemi Leonard use agricultural censuses
from Kansas to link data about twenty-ﬁve farming communities
de 1875 a 1940 to test several ideas about the availability of land,
labor mobility, and the evolution of family farms.

Kandace Bogaert, Jane van Koeverden, y D. Ann Herring
follow a cohort of Polish-American military personnel in 1917—
from their recruitment in the United States through their training
at Camp Koscuiszko in Niagara-on-the-Lake, Ontario—to dis-
cover their mortality experience during the deadly 1918 inﬂuenza
epidemic. Although the most common use of the term cohort is in
relation to birth, such as a group of babies born during the 1930s,
the demographic deﬁnition of cohort relates to a collection of peo-
ple experiencing a common event during a common period of
tiempo, like these Polish-American soldiers.

These twelve chapters show how the analysis of quantitative
longitudinal data can reveal new areas for social science and historical
investigación. They provide helpful instruction regarding the creation of
longitudinal data for the study of traditional topics as well as new
areas of investigation.

STATISTICAL ANALYSIS OF LIFE-COURSE DATA Current life-course
analysis relies on several multivariate statistical tools. Statisticians
proposed new methods for the analysis of event changes observed
in longitudinal data several decades ago; these initial models have

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
j
i

norte
h
a
r
t
i
C
mi
–
pag
d

F
/

4
7
1
8
5
1
7
0
0
7
8
0

/
j
i

norte
h
_
a
_
0
0
9
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

STAT IST IC AL AN ALY SI S OF LO NG IT UD I N AL D ATA

| 89

evolved into generalized linear mixed varieties (also known as
multilevel or hierarchical models) that are suitable for analysis of
continuous, binario, or counted data over time. Generalized linear
mixed models, as implemented in several statistical packages, son
described in applied statistics textbooks.1 Other statistical methods,
such as the double-cohort approach, have been developed in re-
cent years for analysis of longitudinal immigration data.2

There are several speciﬁc issues in statistical analysis that the
chapters in this volume bring to light. Primero, sample size should
always be stated in tables and ﬁgures. In cases when analysis deals
with a total sample, readers can correctly infer sample size, pero en
cases when analysis is limited to a selected group, sample size needs
to be noted. Segundo, several chapters in this volume make appro-
priate use of logistic regression analysis, but they would have done
well to cite several overall tests, including the chi-square ﬁt and its
statistical signiﬁcance, as well the adjusted R-squared. Tercero, cuando
giving rates, researchers should show the number of observations
for the denominator so that readers can compare the sample size
used for the different rates. Finalmente, ﬁgures that are perfectly com-
prehensible in color may need to be designed for presentation in
black and white. The use of different types of shading or markers
can help readers to identify different categories in the ﬁgures.
Además, black-and-white ﬁgures can be improved by showing
numbers or percentages if comparisons are not clear.

Several chapters in this volume use logistic regression models
to indicate factors affecting a binary-outcome variable, como
labor-force participation. Because the binary logit model is non-
linear, it is a challenge to interpret the relationship between an
explanatory variable and the outcome. One common approach—
taken by some of the contributors to this volume—is to take the
exponent of the logit regression coefﬁcient, which indicates the
expected change in the odds of the outcome. Pero, for most analysts,
interpreting changes in the odds of an outcome is difﬁcult because it
depends on the base probability. Consider two examples: (1) If the
odds of the outcome is 1/100, the corresponding probability is

See Sophia Rabe-Hesketh and Anders Skrondal, Multilevel and Longitudinal Modeling Using
1
Stata (College Station, 2012; origen. pub. 2005); Judith D. Singer and John B. Willett, Applied
Longitudinal Data Analysis: Modeling Change and Event Occurrence (Nueva York, 2003).
2
ownership, 1991 to 2006,” Canadian Studies in Population, XL (2013), 57–74.

Ver, Por ejemplo, Edmonston and Sharon M. Sotavento, “Immigrants’ Transition to Home-

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
j
i

norte
h
a
r
t
i
C
mi
–
pag
d

F
/

4
7
1
8
5
1
7
0
0
7
8
0

/
j
i

norte
h
_
a
_
0
0
9
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

90 | B AR RY E D MO NST ON

acerca de 0.01. If the effect of an explanatory variable doubles the odds,
the odds increase to 2/100, and the probability doubles to about
0.02. In this instance, the interpretation is fairly clear because a dou-
bling of the odds ratio increases the outcome probability by about
two. (2) If the odds of the outcome are 1/1, the corresponding
probability is 0.50. If the odds double to 2/1, sin embargo, the corre-
sponding probability increases to 0.67. In this instance, there is no
clear intuitive interpretation for the relationship of a change in the
odds and the change in the corresponding probability.

The key point is that a constant change in the odds ratio does
not correspond to a constant change in the outcome probability.
The change in the outcome probability is affected by the base
probabilidad. In most analyses, when the base probability is not close
to zero, it is preferable to interpret binary logit coefﬁcients in terms
of changes in the outcome probabilities. In medical research, el
outcome variable is often close to zero; hence, health researchers
routinely cite the odds ratio for interpreting logistic regression
coeficientes. Si, Por ejemplo, the coefﬁcient for cigarette smoking
es 2.1, and the probability for a non-smoker to contract lung
cancer is 0.0001, then cigarette smokers are 8 veces (the exponent
de 2.1 is about 8) more likely to get lung cancer, with a probability
de 0.0008. En este caso, the interpretation of the odds ratio is clear.
Sin embargo, the interpretation is not apparent when the outcome
probability is not close to zero, which is the case for many out-
comes in historical and social-science research.

A variety of approaches for interpreting the relationship be-
tween explanatory and outcome variables for binary logit models
is possible when the outcome probability is not close to zero.3 A
useful one is to compute predicted values or probabilities of the out-
come for speciﬁed values of the explanatory variables. Such predicted
probabilities for the outcome are also called predictive margins, o
adjusted predictions, in the statistical literature. Predicted probabili-
ties have a straightforward interpretation because they indicate the
outcome probability for a speciﬁc value of an explanatory variable,
holding constant all other explanatory variables.

Consider a logit regression model that predicts labor-force
participación (a binary variable coded 0 if not in the labor force

j. Scott Long and Jeremy Freese, Regression Models for Categorical Dependent Variables Using

3
Stata, Second Edition (College Station, 2006; origen. pub. 2001).

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
j
i

norte
h
a
r
t
i
C
mi
–
pag
d

F
/

4
7
1
8
5
1
7
0
0
7
8
0

/
j
i

norte
h
_
a
_
0
0
9
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

STAT IST IC AL AN ALY SI S OF LO NG IT UD I N AL D ATA

| 91

y 1 if in the labor force) based on sex and education. The pre-
dicted probability for females is the average probability for labor-
force participation if everyone in the data is treated as female while
all other variables are held constant with their estimated regression
coeficientes. The predicted probabilities for females and males, en
this example, give the expected probability of labor-force partici-
pation by sex, holding education constant.

To make this example speciﬁc, consider the following esti-
mated logit regression equation: Y = 0.5 + 0.1 Sex + 0.2 Educa-
ción, where Y is a binary outcome variable (0 for not in the labor
force and 1 for in the labor force), Sex is a dummy variable (0 para
female and 1 for male), and Education is a dummy variable (0 para
less than high school and 1 for high school). Further suppose a
data set of 100 adults—50 females and 50 machos. If the constant term
es 0.5, the effect of sex (being male) es 0.1, and the effect of edu-
catión (being a high-school graduate) es 0.2, then males are 1.11
veces (1.11 is the exponentiation of 0.1) more likely to be in the
labor force—holding education constant—compared to females.
Pero, en este caso, there is no clear intuitive interpretation for an odds
ratio of 1.11. How can the use of predicted probabilities assist the
interpretación?

We calculate the predicted probability of labor-force par-
ticipation as follows. For females, we assume that each of the
100 persons is female and has the observed regression coefﬁcient
for education. Although we assume that each person has 0.0 para
sex because we assume that everyone is female, we note each per-
son’s actual education and take into account the estimated effect
(0.0 for less than high school and 0.2 for high school). The sum
of the values for all 100 persons is calculated and divided by the
observed sample size, cual es 100 en este caso, to yield a predicted
probabilidad. In this hypothetical case with an equal number of
males and females and an equal distribution of education groups
for each sex, the predicted probability would be 0.65 for female
labor-force participation, holding education constant. The similar
calculation for males produces a predicted probability of 0.70 para
male labor-force participation, holding education constant. Estos
two values—65 percent for females and 70 percent for males—offer
a useful interpretation of the logit regression coefﬁcients for sex in
terms of predicted probabilities, holding all other explanatory vari-
ables constant.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
j
i

norte
h
a
r
t
i
C
mi
–
pag
d

F
/

4
7
1
8
5
1
7
0
0
7
8
0

/
j
i

norte
h
_
a
_
0
0
9
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

92 | B AR RY E D MO NST ON

In this simple example, the coefﬁcient of 0.1 for sex means
eso 65 percent of females are predicted to be in the labor force
compared to 70 percent of males, holding their observed educa-
tion constant. En otras palabras, males are predicted to have labor-
force participation rates that are 5 percentage points higher than
hembras. Usually, there are more explanatory variables and more
categories in some variables. Además, logit regression equa-
tions often include continuous variables, which are evaluated using
the observed variable value multiplied by the estimated regression
coefﬁcient. Fortunately, the calculation of predicted probabilities
for binary logit models is easy to achieve. With Stata software,
Por ejemplo, the margin command performs this calculation after
estimating a logistic regression model.4

Lives in Transition provides a valuable sampling of empirical re-
search by historians using longitudinal data. It improves upon
the current understanding of a variety of historical issues by focus-
ing on different aspects and stages of individual life courses and
identifying questions for further study. It demonstrates the impor-
tant empirical challenges and presents a variety of substantive
topics for longitudinal analysis. En general, the chapters show that
analysis of longitudinal data illuminates historical studies in new
maneras, providing insights about the factors affecting changes in
individual lives.

Stata, Stata Base Reference Manual: Release 14 (College Station, Texas, 2014).

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
j
i

norte
h
a
r
t
i
C
mi
–
pag
d

F
/

4
7
1
8
5
1
7
0
0
7
8
0

/
j
i

norte
h
_
a
_
0
0
9
4
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Descargar PDF