SOCIAL NETWORKS AND RESEARCH OUTPUT

Lorenzo Ductor, Marcel Fafchamps, Sanjeev Goyal, and Marco J. van der Leij*

Abstract—We study how knowledge about the social network of an individ-
ual researcher, as embodied in his coauthor relations, helps us in developing
a more accurate prediction of his or her future productivity. We ﬁnd
that incorporating information about coauthor networks leads to a mod-
est improvement in the accuracy of forecasts on individual output, over and
above what we can predict based on the knowledge of past individual out-
put. Deuxième, we ﬁnd that the informativeness of networks dissipates over
the lifetime of a researcher’s career. This suggests that the signaling content
of the network is quantitatively more important than the ﬂow of ideas.

je.

Introduction

G OOD recruitment requires an accurate prediction of a

candidate’s potential future performance. Sports clubs,
academic departments, and business ﬁrms routinely use past
performance as a guide to predict the potential of applicants
and to forecast their future performance. In this paper, le
focus is on researchers.

Social interaction is an important aspect of research activ-
ville: researchers discuss and comment on each other’s work,
they assess the work of others for publication and for
prizes, and they join together to coauthor publications. Sci-
entiﬁc collaboration involves the exchange of opinions and
ideas and facilitates the generation of new ideas. Access
to new and original ideas in turn may help researchers be
more productive. It follows that other things being equal,
individuals who are better connected and more central in
their professional network may be more productive in the
avenir.

Network connectedness and centrality arise out of links
created by individuals and thus reﬂect their individual charac-
teristics: ability, sociability, and ambition, Par exemple. Since
the ability of a researcher is imperfectly known, the existence
of such ties may be informative.

These considerations suggest that someone’s collaboration
network is related to his or her research output in two ways:
the network serves as a conduit of ideas and signals individual
qualité. The ﬁrst channel suggests a causal relationship from

Received for publication August 27, 2011. Revision accepted for publi-

cation May 3, 2013. Editor: Philippe Aghion.

* Ductor: Massey University; Fafchamps: University of Oxford and Mans-
ﬁeld College; Goyal: University of Cambridge and Christ’s College; van der
Leij: CeNDEF, University of Amsterdam; De Nederlandsche Bank; and Tin-
bergen Institute.

We thank the editor and two anonymous referees for a number of help-
ful comments. We are also grateful to Maria Dolores Collado, Markus
Mobius, and conference participants at SAEe (Vigo), Bristol, Arrière (NYU),
Microsoft Research, Cambridge, AVEC, Alicante, Oxford, Tinbergen Insti-
tute, Stockholm University, and City University London for useful com-
ments. L.D. gratefully acknowledges ﬁnancial support from the Spanish
Ministry of Education (Programa de Formacion del Profesorado Univer-
sitario). S.G. thanks the Keynes Fellowship for ﬁnancial support. M.L.
thanks the Spanish Ministry of Science and Innovation (project SEJ2007-
62656) and the NWO Complexity program for ﬁnancial support. The views
expressed are our own and do not necessarily reﬂect ofﬁcial positions of De
Nederlandsche Bank.

A supplemental appendix is available online at http://www.mitpress

journals.org/doi/suppl/10.1162/REST_a_00430.

network to research output, whereas the second does not.
Determining causality would clarify the importance of the
two channels. Malheureusement, as is known in the literature on
social interactions (Manski, 1993; Mofﬁt, 2001), identifying
network effects in a causal sense is difﬁcult in the absence of
randomized experiments.

In this paper, we take an alternative route: we focus on
the predictive power of social networks in terms of future
research output. C'est, we investigate how much current
and past information on collaboration networks contributes
to forecasting future research output. Causality in the sense of
prediction informativeness is known as Granger causality and
is commonly analyzed in the macroeconometrics literature;
Par exemple, Stock and Watson (1999) investigate the predic-
tive power of unemployment rate and other macroeconomics
variables on forecasting inﬂation.1

Finding that network variables Granger-cause future out-
put does not constitute conclusive evidence of causal network
effects in the traditional sense. Néanmoins, it implies that
knowledge of a researcher’s network can potentially be
used by an academic department in making recruitment
decisions.

We apply this methodology to evaluate the predictive
power of collaboration networks on future research out-
put, measured in terms of future publications in economics.
We ﬁrst ask whether social network measures help predict
future research output beyond the information contained
in individual past performance. We then investigate which
speciﬁc network variables are informative and how their
informativeness varies over a researcher’s career.

Our ﬁrst set of ﬁndings is about the information value of
réseaux. We ﬁnd that including information about coauthor
networks leads to an improvement in the accuracy of forecasts
about individual output over and above what we can predict
based on past individual output. The effect is signiﬁcant but
modest; the root mean squared error in predicting future pro-
ductivity falls from 0.773 à 0.758 and the R2 increases from
0.395 à 0.417. We also observe that several network vari-
ables, such as productivity of coauthors, closeness centrality,
and the number of coauthors, have predictive power. Of those,
the productivity of coauthors is the most informative network
statistic among those we examine.

Deuxième, the predictive power of network information
varies over a researcher’s career: it is more powerful for
young researchers but declines systematically with career
temps. Par contre, information on recent past output remains
a strong predictor of future output over an author’s entire
career. Par conséquent, fourteen years after the onset of a

1 A few examples of applications that have determined the appropriateness
of a model based on its ability to predict are Swanson and White (1997),
Sullivan, Timmermann, and White (1999), Lettau and Ludvigson (2001),
Rapach and Wohar (2002) and Hong and Lee (2003).

The Review of Economics and Statistics, Décembre 2014, 96(5): 936–948
© 2014 by the President and Fellows of Harvard College and the Massachusetts Institute of Technology. Published under a Creative Commons Attribution 3.0
Unported (CC PAR 3.0) Licence.
est ce que je:10.1162/REST_a_00430

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
r
e
s
t
/

un
r
t
je
c
e
–
p
d

F
/

9
6
5
9
3
6
2
0
1
2
4
9
5
/
r
e
s
t
_
un
_
0
0
4
3
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

SOCIAL NETWORKS AND RESEARCH OUTPUT

937

researcher’s publishing career, networks do not have any
predictive value on future research output over and above
what can be predicted using recent and past output alone.

Our third set of ﬁndings is about the relation between
author ability and the predictive value of networks. Nous
partition individual authors in terms of past productivity
and examine the extent to which network variables pre-
dict their future productivity. We ﬁnd that the predictive
value of network variables is nonmonotonic with respect
to past productivity. Network variables do not predict the
future productivity of individuals with below-average initial
productivity. They are somewhat informative for individu-
als in the highest past-productivity tier group. But they are
most informative about individuals in between. En fait, pour
these individuals, networks contain more information about
their future productivity than recent research output. Taken
ensemble, these results predict that academic recruiters would
beneﬁt from gathering and analyzing information about the
coauthor network of young researchers, especially for those
who are relatively productive.

This paper is a contribution to the empirical study of
social interactions. Traditionnellement, economists have studied the
question of how social interactions affect behavior across
well-deﬁned groups, paying special attention to the difﬁ-
culty of empirically identifying social interaction effects.
(For an overview of this work, voir, par exemple., Mofﬁtt (2001) et
Glaeser and Scheinkman, 2003.) Au cours des dernières années, interest
has shifted to the ways by which the architecture of social
networks inﬂuences behavior and outcomes.2 Recent empir-
ical papers on network effects include Bramoullé, Djebbari
and Fortin (2009), Calvó-Armengol, Patacchini, and Zenou
(2009), Conley and Udry (2010), and Fafchamps, Goyal, et
van der Leij (2010).

This paper is also related to a more specialized litera-
ture on research productivity. Two recent papers, Azoulay,
Zivin, and Wang (2010) and Waldinger (2010), both use
the unanticipated removal of individuals as a natural experi-
ment to measure network effects on researchers’ productivity.
Azoulay et al. (2010) study the effects of the unexpected death
of superstar life scientists. Their main ﬁnding is that coau-
thors of these superstars experience a 5% à 8% decline in
their publication rate. Waldinger (2010) studies the dismissal
of Jewish professors from Nazi Germany in 1933 à 1934. Son
main ﬁnding is that a fall in the quality of a faculty has sig-
niﬁcant and long-lasting effects on the outcomes of research
students. Our paper quantiﬁes the predictive power of net-
work information over and above the information contained
in past output.

The rest of the paper is organized as follows. Section II
lays out the empirical framework. Section III describes the
data and deﬁnes the variables. Section IV presents our ﬁnd-
ings. Section V checks the robustness of our main ﬁndings.
Section VI concludes.

2 For a survey of the theoretical work on social networks see Goyal (2007),

Jackson (2008), and Vega-Redondo (2007).

II. Empirical Framework

It is standard practice in most organizations to look at
the past performance of job candidates as a guide to their
future output. This is certainly true for the recruitment and
promotion of researchers, possibly because research output—
journal articles and books—is publicly observable.

The practice of looking at past performance appears to
rest on two ideas. The ﬁrst is that a researcher’s output
largely depends on ability and effort. The second is that
individuals are aware of the relationship between perfor-
mance and reward and consequently exert effort consistent
with their career goals and ambition. This potentially cre-
ates a stable relationship between ability and ambition, sur
the one hand, and individual performance, on the other hand.
Given this relationship, it is possible to (imperfectly) predict
future output on the basis of past output. In this paper, nous
start by asking how well past performance predicts future
output.

We then ask if future output can be better predicted if we
include information about an individual’s research network.
Social interaction among researchers takes a variety of forms,
some of it more tangible than others. Our focus is on social
interaction, reﬂected in the coauthorship of a published paper,
a concrete and quantiﬁable form of interaction. Coauthorship
of academic articles in economics rarely involves more than
four authors, so it is likely that coauthorship entails personal
interaction. De plus, given the length of papers and the
duration of the review process in economics, it is reasonable
to suppose that collaboration entails communication over
an extended period of time. These considerations—personal
interaction and sustained communication—in turn suggest
several ways by which someone’s coauthorship network can
reveal valuable information on their future productivity. Nous
focus on two: research networks as a conduit of ideas and
coauthorship as a signal about unobserved ability and career
objectifs.

Consider ﬁrst the role of research networks as a conduit for
ideas. Communication in the course of research collaboration
involves the exchange of ideas, so we expect that a researcher
who is collaborating with highly creative and productive peo-
ple has access to more new ideas. This in turn suggests that a
researcher who is close to more productive researchers may
have early access to new ideas. As early publication is a key
element in the research process, early access to new ideas can
lead to greater productivity. These considerations lead us to
expect that other things being equal, an individual who is in
close proximity to highly productive authors will on average
have greater future productivity.

Proximity need not be immediate, cependant: if A coauthors
with B and B coauthors with C, then ideas may ﬂow from A to
C through their common collaborator B. The same argument
can be extended to larger network neighborhoods. It follows
that authors who are more central in the research network
are expected to have earlier and better access to new research
ideas.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
r
e
s
t
/

un
r
t
je
c
e
–
p
d

F
/

9
6
5
9
3
6
2
0
1
2
4
9
5
/
r
e
s
t
_
un
_
0
0
4
3
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

938

THE REVIEW OF ECONOMICS AND STATISTICS

As a ﬁrst step, we look at how the productivity of an
individual, say i, varies with the productivity of his or her
coauthors. We then examine whether i’s future productivity
depends on the past productivity of the coauthors of his or
her coauthors. Enfin, we generalize this idea to i’s central-
ity in the network in terms of how close a researcher is to all
other researchers (closeness) or how critical a researcher is
to connections among other researchers (betweenness)—the
idea being that centrality gives privileged access to ideas that
can help a researcher’s productivity.

Access to new ideas may open valuable opportunities, mais
it takes ability and effort to turn a valuable idea into a publica-
tion in an academic journal. It is reasonable to suppose that the
usefulness of new ideas varies with ability and effort. Au pair-
particulier, a more able researcher is probably better able than a
less able researcher to turn the ideas accessed through the net-
work into publications. Since ability and industriousness are
reﬂected in past performance, we expect the value of a social
network to vary with past performance. To investigate this
possibility, we partition researchers into different-tier groups
based on their past performance and examine whether the
predictive power of having productive coauthors and other
related network variables varies systematically across tier
groupes.

The second way by which network information may help
predict future output is that the quantity and quality of one’s
coauthors is correlated with, and thus can serve as a signal
pour, an individual’s hidden ability and ambition. Given the
commitment of time and effort involved in a research col-
laboration, it is reasonable to assume that researchers do not
casually engage in a collaborative research venture. Ainsi
when a highly productive researcher forms and maintains a
collaboration with another, possibly more junior, researcher
je, this link reveals positive attributes of i that could not be
inferred from other observable data. Over time, cependant,
evidence on i’s performance accumulates, and residual uncer-
tainty about i’s ability and industriousness decreases. Nous
therefore expect the signal value of network characteristics
to be higher at the beginning of a researcher’s career and to
fall afterward.

Our empirical strategy is based on these ideas. Since our
focus is on predictive power, we worry that overﬁtting may
bias inference. To avoid this, we divide the sample into two
halves—one used to obtain parameter estimates and the other
to assess the out-of-sample predictive power of these esti-
mates. We thus begin by randomly dividing the authors into
two equal-size groups. The ﬁrst half of the authors is used
to estimate a regression model of researcher output. We then
use the estimated coefﬁcients obtained from the model ﬁtted
on the ﬁrst half of the authors to predict researcher output for
the authors in the second half of the data. We compare these
predictions with actual output.

The purpose of this procedure is to assess the out-of-sample
prediction performance of the model. The reason for using
out-of-sample predictions is that in-sample errors are likely
to understate forecasting errors. As Fildes and Makridakis

(1995) stated, “The performance of a model on data out-
side that used in its construction remains the touchstone for
its utility in all applications” regarding predictions. Another
drawback of in-sample tests is that they tend to reject the null
hypothesis of predictability. Autrement dit, in-sample tests
of predictability may spuriously indicate predictability when
there is none.3

The rest of this section develops some terminology and
presents the regressions more formally. We begin by describ-
ing the ﬁrst step of our procedure and then explain how we
assess prediction performance. The dependent variable of
interest is a measure yit of the future output of author i at
time t, deﬁned in more detail in section 3. This measure takes
into account the number of articles published, the length of
each article, and the ranking of the journal where the article
appears.

We ﬁrst study predictions of yit based on past output and a
set of controls xit. Control variables include cumulative output
since the start of i’s career until t − 5; career time dummies;
year dummies; and the number of years since i’s last publi-
cation. Career time dummies are included to capture career
cycle effects—that researchers publish less as they approach
retirement. We then examine by how much recent research
output and network characteristics improve the prediction.
We also compare the accuracy of the prediction when we use
only past output and when we combine it with recent network
characteristics.

The order of the regression models we estimate is as fol-
lows. We start with benchmark model 0, which examines the
predictive power of the control variables xit:

Model 0

yi,t+1 = xitβ + εit.

We then include recent individual output yi,t as additional
regressor. This yields model 1:

Model 1

yi,t+1 = xitβ + yitγ1 + εit.

In model 2 we investigate the predictive power of network
variables zi,t:

Model 2

yi,t+1 = xitβ + zitγ2 + εit.

i’s coauthors up
Network variables include the number of
to time t, the productivity of these coauthors, and different
network centrality measures detailed in the data section. Nous
estimate model 2 ﬁrst with one network variable at a time,
then include network variables simultaneously.

Enfin,

in model 3 we ask if network variables zit
improve the prediction of future output over and above

3 Arguments in favor of using out-of-sample predictions can be found
in Ashley, Granger, and Schmalensee (1980) who state that “a sound and
natural approach” to testing predictability “must rely primarily on the out-
of-sample forecasting performance of models relating the original series of
interest” (p. 1149). Along with Fair and Shiller (1990), they also conjecture
that out-of-sample inference is more robust to model selection biases and
to overﬁtting or data mining.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
r
e
s
t
/

un
r
t
je
c
e
–
p
d

F
/

9
6
5
9
3
6
2
0
1
2
4
9
5
/
r
e
s
t
_
un
_
0
0
4
3
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

SOCIAL NETWORKS AND RESEARCH OUTPUT

939

the prediction obtained from model 1, c'est, from past
productivity:

We also consider an unrestricted model with only network
information, model 2’:

Model 3

yi,t+1 = xitβ + yitγ1 + zitγ2 + εit.

Here too we ﬁrst consider one network variable at a time to
ascertain which network characteristics have more predictive
pouvoir. We also estimate model 3 with several network vari-
ables together to evaluate the overall information contained
in the network.

Models 0, 1, et 2 are nested in model 3. A compari-
son of models 1 et 2 allows us to investigate the relative
information content of recent individual output and recent
social network. A comparison of models 1 et 3 examines
whether social network variables have explanatory power
over and above the information contained in recent individual
output.

For models 2 et 3, we consider both regressions with a
single network variable and regressions with multiple net-
work variables. In the latter case, since our ultimate purpose
is to predict research output, we need a criterion to select a
parsimonious set of regressors, so as to avoid overﬁtting. À
select among social network regressors, we use the Bayesian
information criterion (BIC). We ﬁnd that in our case, le
lowest values of the BIC are obtained when all the network
variables are included, which is why our ﬁnal speciﬁcation
of the multivariate model includes them all.

The previous models are called restricted models because
we are imposing the constraint that the lagged productiv-
ity variables since the start of i’s career until t − 5 have
the same effect on future productivity. De plus, in these
models, we consider only ﬁve-year network variables: chaque
network variable is computed assuming that a link between
author i and her coauthor has a predictive effect that lasts
for ﬁve years. These restricted models are simple to esti-
mate and allow us to compare the predictive power of
network variables and recent output. But we may be able to
improve the predictions of the restricted models by relaxing
the constraint that productivity lags have the same coefﬁ-
cient. De la même manière, the predictive power of the network variables
might increase if we include several lags of the network
variables.

To see whether this is the case, we also estimates ver-
sions of models 1, 2, et 3 that include several lags of the
productivity and network variables. The number of lags of
the productivity and network variables is selected using the
BIC. We call these the unrestricted models. The benchmark
unrestricted model, model 1, contains thirteen lags of the
productivity variable and a new set of control variables xit:
career dummies, time dummies, and years since the last pub-
lication. This model examines the predictive power of past
output:

Model 2’

yi,t+1 = xitβ +

T(cid:2)

s=0

zit−sθs + εit,

where T is the maximum lag length of the network variable
selected using the BIC criteria. Par exemple, in T = 14 nous
include lags from zit−14 to zit – zit−14 in the network variable
obtained combining all joint publications from t −14 to t, et
zit is the network variable computed using the joint publica-
tions at period t. A comparison of models 1’ and 2’ provides
insights about the importance of past networks, relative to
past output.

The unrestricted model 3, model 3’, combines all past

output and past network information:

Model 3’

yi,t+1 = xitβ +

12(cid:2)

s=0

yit−sγs +

T(cid:2)

s=0

zit−sθs + εit.

We also estimate models 2’ and 3’ with multiple network
variables. A comparison of models 1’ and 3’ allows us to
examine the explanatory power of network variables over
and above knowledge of past output.

This describes the ﬁrst step of our analysis. In the sec-
ond step, we evaluate the predictive accuracy of the different
models. To this effect, we compare, in the second half of the
data, the actual research output yi,t+1 to the predictions(cid:3)yi,t+1
obtained by applying to authors in the second half of the data
the regression coefﬁcients of restricted models 0 à 3 et
unrestricted models 1’ to 3’ obtained from the ﬁrst half of the
data. To evaluate the prediction accuracy of (cid:3)yi,t+1, we report
the root-mean-squared errors (RMSE) deﬁned as

RMSE =

(cid:4)

1
n

(cid:2)

je,t

(yi,t+1 −(cid:3)yi,t+1)2.

If the introduction of an explanatory variable in (cid:3)yi,t+1
decreases the out-of-sample RMSE, this variable contains
useful information that helps predict researchers’ future
productivity.

In order to assess whether forecasts from two models are
signiﬁcantly different, we use a test described by Diebold and
Mariano (1995). This test is based on the loss differential of
forecasting the future output of an individual i, di,t. Comme nous
measure the accuracy of each forecast by a squared error loss
fonction (RMSE), we apply the Diebold-Mariano test to a
squared loss differential, c'est,

di,t = ε2
Ai,t

− ε2

Bi,t,

Model 1’

yi,t+1 = xitβ +

12(cid:2)

s=0

yit−sγs + εit.

where A is a competing model and B is the benchmark model.
To determine if one model predicts better, we test the
null hypothesis, H0 : E[di,t] = 0, against the alternative,

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
r
e
s
t
/

un
r
t
je
c
e
–
p
d

F
/

9
6
5
9
3
6
2
0
1
2
4
9
5
/
r
e
s
t
_
un
_
0
0
4
3
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

940

THE REVIEW OF ECONOMICS AND STATISTICS

H1 : E[di,t] (cid:2)= 0. Under the null hypothesis, the Diebold-
Mariano test is

d(cid:5)

ˆV (d)/n

(cid:2) N(0, 1),

where d = n−1

(cid:6)

di,t, is the average loss differential and

je,t

√

ˆV (d) is a consistent estimate of the asymptotic (long-run)
nd. We adjust for serial correlation by using a
variance of
Newey-West type estimator of ˆV ( ¯d).4

III. Données

The data used for this paper are drawn from the EconLit
database, a bibliography of journals in economics compiled
by the editors of the Journal of Economic Literature. Depuis
this database, we use information on all articles published
entre 1970 et 1999. These data are the same as those ana-
lyzed by Goyal, van der Leij, and Moraga-González (2006),
Fafchamps et al. (2010), van der Leij and Goyal (2011), et
Ductor (2014).

UN. Deﬁnition of variables

The output qit of author i in year t is deﬁned as

(cid:2)

qit =

journal qualityj,

j∈Sit

(1)

where Sit is the set of articles j of individual i published in
year t. When available, the journal quality variable is taken
from the work of Kodrzycki and Yu (2006, hereafter KY).5
Malheureusement, KY do not include in their analysis all the
journals in the EconLit database. To avoid losing information
and minimizing measurement error in research output, nous
construct a prediction of the KY quality index of journals not
included in their list.6 The actual KY journal quality index is
used whenever available.

(cid:6)

T −t(cid:6)

4 Officiellement, ˆV ( ¯d) =
where wm(T ) is the Bartlett Kernel function:

je(ˆγ0 + 2

τ=1

wm(T ) =

(cid:7) (cid:8)

(cid:9)

1 − τ
m(T )
0,

si 0 ≤ τ
m(T )
otherwise,

≤ 1,

wm(T ) ˆγτ), and ˆγτ = ˆCov(di,t, di,t−τ),

We are interested in predicting future output. In economics,
the annual number of papers per author is small and affected
by erratic publication lags. We therefore need a reasonable
time window over which to aggregate output. The results
presented here are based on a three-year window, but our
ﬁndings are insensitive to the use of alternative window length
(par exemple., ﬁve years).7 Our dependent variable of interest is thus
the output of author i in years t + 1, t + 2, t + 3:

q f
it

= qi,t+1 + qi,t+2 + qi,t+3

(2)

Sans surprise, q f

i has a long upper tail. To avoid our
results from being entirely driven by a handful of highly
productive individuals, we log the dependent variable as
follows:8

(cid:8)

(cid:9)

yi,t+1 = ln

1 + q f
it

The analysis presented in the rest of the paper uses yi,t+1 as
dependent variable.

We expect recent productivity to better predict output over
the next three years than older output. To capture this idea,
we divide past output into two parts in the restricted models:
cumulative output until period t − 5, which captures i’s his-
torical production and is used as control variable, and output
from t − 4 until t, which represents i’s recent productivity
and is expected to be a strong predictor of future output. Nous
it from t to t − 4 comme
deﬁne recent output qr

qr
it

= qit + qi,t−1 + qi,t−2 + qi,t−3 + qi,t−4.

Control variables in the restricted models xit

include
it from the start ti0 of i’s career until t − 5:

cumulative output qc

qc
it

= qi,ti0

+ . . . qi,t−6 + qi,t−5,

where ti0 is the year in which individual i obtained his or
her ﬁrst publication. We use ln(1 + qc
je,t) comme
regressors, since the distribution of both variables presents
fat tails. We also include the number of years rit with no
published article since i’s last article was published:

je,t) and ln(1 + qr

(cid:7)

rit =

if qit > 0
0
ri,t−1 + 1 otherwise.

ri,ti0

= 0.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
r
e
s
t
/

un
r
t
je
c
e
–
p
d

F
/

9
6
5
9
3
6
2
0
1
2
4
9
5
/
r
e
s
t
_
un
_
0
0
4
3
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

and m(T ), also known as the truncation lag, is a number growing with T ,
the number of periods in the panel. The truncation lag has been chosen by
the BIC.

5 We do not consider citations because they often materialize long after a
paper has been published. This means that authors at the beginning of their
career often have a small citation record, donc, for them at least, citations have
little predictive power.

6 Pour faire ça, we regress the KY index on commonly available information
of each journal listed in EconLit, such as the number of published articles per
année, the impact factor, the immediacy index, the Tinbergen Institute Index,
an economics dummy, interaction terms between the economics dummy
and the impact factor, and various citation measures. Estimated coefﬁcients
from this regression are then used to obtain a predicted KY journal quality
index for journals not in their list. Since most of the journals that KY omitted
are not highly ranked, their predicted quality index is quite small.

Variable rit is used as proxy for leave or retirement from aca-
demics: the longer someone has not published, the more likely
he or she has retired or left research. Other controls include
career time dummies cit and year dummies t. To summarize,
xit = {qc

it, rit, cit, t}.

7 The predictive power of network variables is slightly higher under a ﬁve

years window. Results are available in the online appendix.

8 We have considered alternative nonlinear models in which the dependent
variable does not have to be transformed, such as Poisson, nonnegative
binomial, and zero inﬂated nonnegative binomial models. In terms of out-of-
sample RMSE, the speciﬁcation that provides the best forecast is ln(X + 1),
which is the one we report here. See the online appendix for more details.

SOCIAL NETWORKS AND RESEARCH OUTPUT

941

In the unrestricted models 1’ and 3’, we relax the constraint
it. In these models, we consider thirteen

it and qc
imposed in qr
lags of the productivity variable:

(cid:10)

yi,t−s = ln

1 + qi,t−s + qi,t−s−1 + qi,t−s−2

(cid:11)

∀s = 0, . . . , 12.

Within the giant component, we consider, the following

two global proximity measures:9

• Closeness centrality Cc

je,t is the inverse of the average
distance of a node to other nodes within the giant
component and is deﬁned as

Control variables in the unrestricted models are the same

as in the restricted models but excluding past output.

Next we turn to the network variables. Given that we wish
to investigate whether network characteristics have predictive
power over and above that of recent productivity, réseau
variables must be constructed in such a way that they do
not contain information outside the time window of qr
it. Nous
therefore deﬁne the ﬁve-year coauthorship network Gt,5 à
time t over the same time window as qr
it for the restricted
models, c'est, using all joint publications from year t − 4
to t. At time t, two authors i and j are said to have a link gij,t
in Gt,5 if they have published in an EconLit journal in years
t − 4 to t. Otherwise, gij,t = 0.

For unrestricted models 2’ and 3’, we introduce different
coauthorship networks, Gt,s, where s determines the number
of years that a link between author i and her coauthor j lasts.
Par exemple, in network Gt,10, we assume that the effects
from a collaboration last during ten years, from t − 9 to t.

The set of network statistics that we construct from Gt,s is
motivated by the theoretical discussion of section II. Some
of the network statistics we include in our analysis are, sur
a priori grounds, more correlated with access to new scien-
tiﬁc ideas; others are included because they are thought to
have a high signaling potential. Measures of network topol-
ogy such as centrality and degree reﬂect network proximity
and thus belong primarily to the ﬁrst category, while other
measures, such as the productivity of coauthors, are likely to
have greater signaling potential.

Based on these observations, the list of network variables
that we use in the analysis is as follows. We say that there is
a path between i and j in Gt,s if gij,t = 1 at some period from
t − (s − 1) to t or there exists a set of distinct nodes j1, . . . , jm,
such that gij1,t = gj1j2,t = . . . = gjmj,t = 1. The length of such
a path is m + 1. The distance d(je, j; Gt,s) is the length of the
shortest path between i and j in Gt,s. We use the following
standard deﬁnitions:

• (First-order) degree is the number of coauthors that i
has in period t − (s − 1) to t, n1i,t = |Ni(Gt,s)|, où
Ni(Gt,s) = {j : gij,t = 1}.

• (Second-order) degree is the number of nodes at distance
je (Gt,s)|, où

2 from i in period t−(s−1) to t, n2i,t = |N 2
je (Gt,s) = {k : d(je, k; Gt,s) = 2}.
N 2

• Giant component: The giant component in Gt,s is the
largest subset of nodes such that there exists a path
between each pair of nodes in the giant component and
no path to a node outside. We create a dummy vari-
able that takes value 1 if an author belongs to the giant
component and 0 otherwise.

nt − 1
d(je, j; Gt,s)

Cc
je,t

(cid:6)

j(cid:2)=i

where nt is the size of the giant component in year t in
the coauthorship network Gt,s. Because Cc
je,t has fat tails,
we use ln(1 + Cc
je,t) as a regressor instead.

• Betweenness centrality Cb

je,t is the frequency of the short-
est paths passing through node i and is calculated
comme

Cb
je,t

(cid:2)

j(cid:2)=k:j,k(cid:2)=i

τi
j,k(Gt,s)
τj,k(Gt,s)

where τi
j,k(Gt,s) is the number of shortest paths between
j and k in Gt,s that pass through node i, and τj,k(Gt,s) est
the total number of shortest paths between j and k in Gt,s.
In the regression analysis, we similarly use ln(1 + Cb
je,t)
as regressor.

Suivant, we deﬁne regressors that capture the productivity of
coauthors and that of coauthors of coauthors. We apply the
ln(X + 1) transformation to them as well:

• Productivity of coauthors is deﬁned as the output of

coauthors of author i from t − (s − 1) to t,

q1
it

(cid:2)

qr
jt

j∈Ni(Gt,s)

jt is the output of j from period t − (s − 1) à
where qr
period t (excluding papers that are coauthored with i).
• Productivity of coauthors of coauthors is the output of
coauthor of coauthors of author i from t − (s − 1) to t,

q2
it

(cid:2)

qr
kt,

k∈N 2

je (Gt,s)

kt is the output of k from t − (s − 1) to t exclud-
where qr
ing papers that are coauthored with the neighbors of i,
Ni(Gt,s).

We also include a dummy variable that takes the value 1
jt in

for author i if one of i’s coauthors in Gt,s has an output qr
the top 1% of the distribution of qr
it.

In the restricted models, all the network variables are
obtained using Gt,5, c'est, combining all joint publications

9 For a careful discussion on the interpretation of centrality measures, voir

Wasserman and Faust (1994).

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
r
e
s
t
/

un
r
t
je
c
e
–
p
d

F
/

9
6
5
9
3
6
2
0
1
2
4
9
5
/
r
e
s
t
_
un
_
0
0
4
3
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

942

THE REVIEW OF ECONOMICS AND STATISTICS

Table 1.—Summary Statistics

Figure 1.—A Scatter Plot of Future Output and Recent Past Output

Mean

Correlations

Output

Future productivity
Past stock output
Recent past output

Network variables

.41
1.62
.62

.99
1.44
1.20

Degree
Degree of order 2
Giant component
Closeness centrality
Betweenness centrality
Coauthors’ productivity
Coauthors of coauthors’ prod.
Working with top 1%
Number of observations
Number of authors

.58
.90
.10
.01
.50
.59
.58
.01
1,697,415
75,109

1.21
3.12
.30
.02
2.29
1.40
1.58
.11
1,697,415
75,109

1
.44
.69

.55
.46
.47
.48
.48
.58
.54
.34
1,697,415
75,109

Network variables are computed assuming that a link between two authors lasts during ﬁve years
(ﬁve-year network variables). The number of observations used to obtain the statics for future output
est 1,335,428, for recent past output it is 1,230,335, and for past stock output it is 1,132,248. All the
correlations coefﬁcients are obtained using the same number of observations, 872,344.

from t − 4 to t. In contrast, in the unrestricted models, nous
include network variables obtained using different periods of
the coauthorship networks, from Gt,1 to Gt,15. The number of
network periods is selected according to the BIC.

B. Descriptive Statistics

Tableau 1 provides summary statistics of the variables
included in the analysis. Column 1 provides the mean value
of each variable, column 2 the standard deviation, et col-
umn 3 correlations between the different variables and future
productivity.

For the restricted model, we excluded observations rela-
tive to authors in the earliest stage of their career, Pour qui
cit < 6. The reason is that these authors have not yet estab- lished a publication record and network, so there is little information on which to form predictions of future output. This assumption is relaxed in the unrestricted models, where we consider the full sample, 1,335,428 observations, after replacing the missing lagged productivity and network vari- ables by 0s. The rationale for doing so is that authors who have just started their career have no past output and coau- thorship, hence the value of their lagged productivity and network variables is truly 0. We draw attention to some distinctive features of the data. First, we observe that the variance in future output qf it is large, with a standard deviation 2.41 times larger than the mean. There is a high, positive correlation of 0.69 between recent output qr it. Figure 1 shows a scatter plot and a linear regression line with the conﬁdence interval between qf jt for 1,000 random selected observations. This visually conﬁrms that, as anticipated, recent past output has a strong predictive power on future output. jt and future output qf it and qr Second, we observe a high correlation between qf it and several ﬁve-year network variables such as coauthors’ output q1 it, author degree, and closeness and betweenness central- ity. The network variable most highly correlated with future productivity is the productivity of i’s coauthors, q1 it, with a correlation coefﬁcient of 0.58. Other network variables such as degree, closeness, and betweenness centrality are also highly correlated with future output qf it. Figure 2 shows the relationship between some ﬁve-year network variables and future output. IV. Empirical Findings We have seen a reasonably strong correlation between future output and recent past output, but also between future output and the characteristics of i’s recent coauthorship net- work. We now turn to a multivariate analysis and estimate the different models outlined in section II. We start by present- ing the results on the predictive power of recent past output. We then examine the relation between the productivity of an individual author and the predictive power of network variables. A. Predicting Future Output Table 2 presents the prediction results for model 0, the it, rit, cit, t}; model 1, baseline model with controls xit = {qc which includes recent output qr it; and model 2, which includes a network variable, one per regression. Column 1 presents the R2 of the regression on the in-sample data for each model. Column 2 shows the out-of-sample RMSE for each model. Column 3 compares the RMSE of model 1/model 2 with the benchmark model, model 0. Column 4 shows the coefﬁcient of each regressor. Recent output qr it explains slightly less than half of the variation in future output qf it— around 51% of the total variation—remains unexplained after we take qr it into account. The question is whether we can improve on this using network variables. it. Half of the variation in qf We begin by examining the predictive power of the differ- ent network variables when one network variable is added to controls xit. This is achieved by comparing the results from the model 2 regressions with model 0. Results, presented in l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 6 5 9 3 6 2 0 1 2 4 9 5 / r e s t _ a _ 0 0 4 3 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 SOCIAL NETWORKS AND RESEARCH OUTPUT 943 Figure 2.—Scatter Plots of Future Productivity on Closeness Centrality and Coauthors’ Productivity 6 4 2 0 2 − 5 4 3 2 1 0 .04 .06 .08 Closeness Centrality .1 0 2 4 6 Coauthors’ productivity 8 95% CI Future Output Fitted values 95% CI Future Output Fitted values Table 2.—Prediction Accuracy: Restricted Models 1 and 2 Table 3.—Prediction Accuracy: Restricted Models 1 and 3 R2 RMSE RMSE Differential Coefﬁcients R2 RMSE RMSE Differential Coefﬁcients Model 0 Past output Model 1 Recent past output Model 2 Degree Degree of order 2 Giant component Closeness Betweenness Coauthors’ productivity Coauthors of coauthors’ productivity Working with a top 1% .28 .49 .38 .36 .35 .36 .38 .41 .39 .36 .789 .665 .728 .744 .748 .743 .734 .715 .727 .746 – 15.72%∗∗∗ 7.73%∗∗∗ 5.70%∗∗∗ 5.20%∗∗∗ 5.83%∗∗∗ 6.97%∗∗∗ 9.38%∗∗∗ 7.86%∗∗∗ 5.45%∗∗∗ .22∗∗∗ .49∗∗∗ .29∗∗∗ .10∗∗∗ 1.05∗∗∗ 22.96∗∗∗ .11∗∗∗ .30∗∗∗ .24∗∗∗ 1.75∗∗∗ Model 0 Past output Model 1 Recent past output Model 3 Degree Degree of order 2 Giant component Closeness Betweenness Coauthors’ productivity Coauthors of coauthors’ productivity Working with a top 1% .28 .49 .50 .50 .50 .50 .50 .50 .50 .50 .789 .665 .660 .660 .662 .660 .657 .660 .660 .660 – 15.72%∗∗∗ 16.35%∗∗∗ 16.35%∗∗∗ 16.10%∗∗∗ 16.35%∗∗∗ 16.73%∗∗∗ 16.35%∗∗∗ 16.35%∗∗∗ 16.35%∗∗∗ .22∗∗∗ .49∗∗∗ .09∗∗∗ .03∗∗∗ .27∗∗∗ 13.89∗∗∗ .06∗∗∗ .09∗∗∗ .07∗∗∗ .59∗∗∗ Signiﬁcant at ∗∗∗1%, ∗∗5%. Model 0 includes career time dummies, year dummies, number of years since the last publication, and cumulative productivity from the ﬁrst publication till t − 5. Model 1 adds to model 0 recent output. Model 2 adds to model 0 one of the network variables. Each network variable is computed assuming that a link from a collaboration lasts during ﬁve years (ﬁve-year network variable). The number of in-sample observations is 436,440. Signiﬁcant at ∗∗∗1%, ∗∗5%. Model 0 includes career time dummies, year dummies, number of years since the last publication, and cumulative productivity from the ﬁrst publication until t − 5. Model 1 adds to model 0 recent output. Model 3 adds to model 1 one of the network variables. Each network variable is computed assuming that the effects from a collaboration last during ﬁve years (ﬁve-year network variable). The number of in-sample observations is 436,440. table 2, show that coauthors’ productivity q1 it, closeness cen- trality Cc i,t, and the productivity q2 it of coauthors of coauthors are statistically signiﬁcant and help predict future output. However, the predictive power is much less than recent out- put, for example, coauthors’ productivity reduces the RMSE by 9.38% whereas recent output reduces the RMSE by 15.72%. We then combine recent output qr it and network variables in model 3. Results presented in table 3 show that the same network variables remain signiﬁcant once we include qr it as regressor. Being signiﬁcant does not imply that network vari- ables are very informative, however. For this, we have to examine the improvement in prediction that they represent. We compare multivariate model 3, that is, with multiple network variables in the regression, to model 1. Table 4 shows that the R2 of model 3 is greater than the R2 obtained under model 1. This means that network information taken Table 4.—Prediction Accuracy of the Restricted Multivariate Models Model 0 Model 1 Multivariate model 2 Multivariate model 3 R2 .278 .493 .433 .509 RMSE RMSE Differential .789 .665 .700 .654 – 15.72%∗∗∗ 11.28%∗∗∗ 17.11%∗∗∗ Signiﬁcant at ∗∗∗1%. These restricted models include only ﬁve-year network variables. The number of in-sample observations is 436,440. in combination with recent output yields a more accurate prediction than a prediction based on past output alone. The gain in explanatory power is small, however: the R2 rises from 0.49 in model 1 to 0.51 in model 3. In line with this, the RMSE declines from 0.67 down to 0.65 when we incorpo- rate network information. This small difference is statistically signiﬁcant, as shown by the Diebold-Mariano test. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 6 5 9 3 6 2 0 1 2 4 9 5 / r e s t _ a _ 0 0 4 3 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 944 THE REVIEW OF ECONOMICS AND STATISTICS Table 5.—Prediction Accuracy: Unrestricted Models 1’ and 2’ Model 1’ Recent past output Lag Length 13 R2 .39 RMSE RMSE Differential Coefﬁcients .773 – .44∗∗∗ Model 2’ Degree Degree of order 2 Giant component Closeness Betweenness Coauthors’ productivity Coauthors of coauthors’ production Working with a top 1% .24 .23 .23 .24 .26 .29 .27 .24 Signiﬁcant at ∗∗∗1% , ∗∗5%. Model 1’ includes career time dummies, year dummies, number of years since the last publication, and thirteen lags of the productivity variable. Model 2’ contains career time dummies, year dummies, number of years since the last publication, and several lags of a network variable. The maximum lag length for each model is selected using the BIC. For the network variables, the maximum possible lag length considered is 15. The coefﬁcients presented in the table correspond to the ﬁrst lag of the variable. The number of in-sample observations is 667,423. −11.38%∗∗∗ −12.16%∗∗∗ −12.29%∗∗∗ −11.51%∗∗∗ −9.83%∗∗∗ −7.76%∗∗∗ −9.57%∗∗∗ −11.51%∗∗∗ .10∗∗∗ .05∗∗∗ .96∗∗∗ 1.42 .07∗∗ .11∗∗∗ .09∗∗∗ .45∗∗∗ .861 .867 .868 .862 .849 .833 .847 .862 15 14 15 15 15 12 15 14 Model 1’ Past output Table 6.—Prediction Accuracy: Unrestricted Models 1’ and 3’ Lag Length 13 R2 .39 RMSE RMSE Differential Coefﬁcients .773 – .44∗∗∗ Model 3’ Degree Degree of order 2 Giant component Closeness Betweenness Coauthors’ productivity Coauthors of coauthors’ productivity Working with a top 1% .40 .40 .40 .40 .40 .41 .41 .40 Signiﬁcant at ∗∗∗1%, ∗∗5%. Model 1’ includes career time dummies, year dummies, number of years since the last publication, and thirteen lags of the productivity variable. Model 3’ adds to model 1’ several lags of a network variable. The maximum lag length is selected using the BIC criteria. For the network variables, the maximum possible lag length considered is 15. The coefﬁcients presented in the table correspond to the ﬁrst lag of the variable. The number of in-sample observations is 667,423. .65%∗∗∗ .65%∗∗∗ .65%∗∗∗ .78%∗∗∗ .78%∗∗∗ 1.55%∗∗∗ 1.16%∗∗∗ .78%∗∗∗ .14∗∗∗ .06∗∗∗ .58∗∗∗ 2.35∗∗ .02 .09∗∗∗ .07∗∗∗ .39∗∗∗ .768 .768 .768 .767 .767 .761 .764 .767 6 5 8 10 9 12 11 13 Table 7.—Prediction Accuracy of the Unrestricted Multivariate Models Lags R2 RMSE RMSE Differential 13 15 8 0.395 Model 1’ 0.322 Multivariate model 2’ Multivariate model 3’ 0.417 Signiﬁcant at ∗∗∗1%. For multivariate model 3, we consider eight lags for each network variable and thirteen lags of the output. The lag length is selected according to the BIC; for the multivariate models, we considered as candidate models only those where each network variable has the same number of lags. The number of in-sample observations is 667,423 – −5.30%∗∗∗ 1.94%∗∗∗ 0.773 0.814 0.758 Table 5 presents the prediction results for the benchmark unrestricted model 1’ and model 2’. Model 1’ contains thir- teen lags of the productivity variable and the same control variables as in the restricted models except past output. Model 2’ includes the control variables without past output and sev- eral lags of a network variable. Column 1 presents the lag length of each variable; the rest of the columns are analogous to table 2. The predictions obtained from the unrestricted models are consistent with their restricted versions. The net- work variable with the highest predictive power is coauthors’ productivity with an RMSE 7.76% greater than the past out- put model, model 1’. Similar results obtain on the effects of networks when we compare models 1’ and 3’, as preported in table 6. As shown in table 7, the predictive power of network over and above information of past output is slightly higher when we consider the unrestricted version, that is, when we include several lags of the network variables. In the restricted multivariate models, the RMSE is reduced by 1.65% when we add network variables to past and recent output, while in the unrestricted version, the reduction is around 1.94%. From this we conclude that network variables contain pre- dictive information over and above what can be predicted on the basis of past output, but this information gain is modest. B. Networks and Career Cycle Next we estimate the predictive power of network vari- ables for different career time cit. The RMSE of restricted models 0, 1 and multivariate models 2 and 3 (with multiple network variables included in the regression) as well as the RMSE of unrestricted models 1’ and multivariate Models 2’ and 3’ are plotted in Figures 3 and 5, respectively. Career age cit is on the horizontal axis, while RMSE is measured on the vertical axis. Unsurprisingly, the ﬁgures show that the predictive accuracy of all the models improves (reﬂected in the decline in RMSE) with career time. This is primarily because the control variables xit, particularly cumulative out- put qc it, reveal more information about individual ability and preferences over time. To examine whether the relative predictive gain of net- work variables varies with career time, we report in ﬁgures 4 and 6 the difference in RMSE between multivariate mod- els 2 and 3 versus model 1 and the difference in RMSE between their unrestricted versions, respectively. We note a marked decline in the difference between models 1’ and 3’ over the course of a researcher’s career. After time t = 14, l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 6 5 9 3 6 2 0 1 2 4 9 5 / r e s t _ a _ 0 0 4 3 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 SOCIAL NETWORKS AND RESEARCH OUTPUT 945 Figure 3.—RMSE Out-of-Sample Across Career Time: Restricted Models Figure 5.—RMSE Out-of-Sample across Career Time: Unrestricted Models 9 . 8 . 7 . 6 . 5 . 5 10 15 20 5 t RMSE M0 RMSE MV2 RMSE M1 RMSE MV3 10 t RMSE M1’ RMSE MV3’ 15 20 RMSE MV2’ Figure 4.—RMSE % Difference across Career Time: Restricted Models Figure 6.—RMSE % Difference across Career Time: Unrestricted Models 5 0 5 − 0 1 − 9 . 8 . 7 . 6 . 5 . 2 0 2 − 4 − 6 − 5 10 15 20 t 5 10 t 15 20 RMSE % Diff. MV2 M1 RMSE % Diff. MV3 M1 RMSE % Diff. MV2’ M1’ RMSE % Diff. MV3’ M1’ According to the Diebold-Mariano test, the difference between the RMSE of multivariate model 3 and model 1 is statistically signiﬁcant for every career time year. According to the Diebold-Mariano test, the differences between the RMSE of multivariate model 3’ and model 1’ are insigniﬁcant for t = 12 and from t = 14 to t = 20. the prediction accuracy of models with or without network variables becomes virtually indistinguishable. The Diebold- Mariano test shows that the differences between multivariate model 3’ and model 1’ are not statistically signiﬁcant from t = 14 to t = 20. In the restricted models, ﬁgure 4, the decline in the predictive power of network variables is not observed until t = 15.10 This indicates that for senior researchers, net- work variables contain little information over and above the information contained in past and recent output. What does this pattern in the data suggest about the relative importance of the two potential ways in which networks may 10 The fact that the predictive power of networks is still signiﬁcant for mature authors in the restricted model analysis might be a consequence of including inactive authors in the sample—those who do not publish regu- larly. As an inactive author matures, future output and network variables are both more likely to be 0 due to the reduction of output prior to retire- ment, so the predictive power of networks does not dissipate. Indeed, we ﬁnd that if we restrict the analysis to active authors—authors with positive recent output—the predictive power of networks in the restricted model is negligible after the authors have more than ﬁfteen years of experience. matter: ﬂow of ideas and signaling? As time passes, the publication record of a researcher builds up. Since ability, research ambition, and other personality traits are relatively stable over time, this accumulating evidence ought to pro- vide a more accurate estimate of the type of the person. Hence, it should become easier to judge his or her abil- ity and research ambition on the basis of the publication record alone. Based on this, we would expect that the sig- naling value of networks decreases over time, and hence that network variables have less and less additional predictive power. Research networks can, however, be important conduits of valuable research ideas as well. Unlike the signaling value of networks, access to new research ideas remains important throughout a researcher’s career. Thus, if network variables help predict future output because they capture access to new ideas, their predictive value should remain relatively unchanged over a researcher’s career. This is not what we observe, leaving signaling as a stronger contender as the l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 6 5 9 3 6 2 0 1 2 4 9 5 / r e s t _ a _ 0 0 4 3 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 946 THE REVIEW OF ECONOMICS AND STATISTICS possible channel by which network variables help predict future productivity. not shown here to save space, are available in the online appendix. C. Network Information across Productivity Categories In this section we examine whether the predictive power of network information varies systematically with recent out- put qr it. This analysis is predicated on the idea that it takes talent and dedication to transform the new ideas conveyed by the research network into publishable output. Conse- quently, we expect the predictive power of network variables to increase with ability, and hence with qr it, at least over a certain range. To investigate this possibility, we divided the observations into ﬁve tier groups on the basis of their recent output qr it. The top category includes authors in the top 1% in terms of qr it. The second top category includes authors in the 95–99 percentiles of qr it. The third category covers authors in the 90–94 percentiles, the fourth includes authors in the 80–89 percentiles, and the last category is for authors in the 50–79 percentiles.11 Figure 7 shows the RMSE % difference between mod- els 1 and 2 versus model 0 across the different categories. The RMSE % differences are always positive because the restricted benchmark model, model 0, is nested in models 2 and 1; thus, it is very likely that models 2 and 1 have a predictive power greater than model 0. For the most pro- ductive authors, those above the 99th percentile, network variables have predictive power in explaining future research output but much less than recent output. For the next cat- egory of researchers, those in the 95–98 percentile range, network information has greater predictive power. Even more striking, for researchers in the third category, the 90–94 percentile range, network variables are better at predicting future research output than qr it! All the models have sta- tistically signiﬁcant predictive power across the different tiers. By contrast, network information has little but signiﬁcant predictive power for low-productive individuals (those in the 50–79 percentile range). This suggests that for researchers with low ability or research ambition, having published with high-quality coauthors has little informative content regard- ing their future output—perhaps because they are unable to take advantage of the access to information and research ideas that good coauthors provide. Similar patterns are observed when we compare RMSE of unrestricted model 2’ versus model 1’. V. Robustness We have conducted an extensive investigation into the robustness of our results to various assumptions made in constructing the variables used in the estimation. The results of this analysis are summarized here; the details, In the analysis so far, we have used accumulated pro- ductivity from t + 1 to t + 3 as the variable qit we seek to predict (see equation [2]). The rationale for doing so is that the distant future is presumably harder to predict than the immediate future, and we want to give the model a fair chance. Yet in economics, there are long lags between the submission and publication of a paper and wide variation in these lags across papers and journals. Publication lags thus introduce additional variation in the variable we are trying to predict and may thus lead us to underestimate the predic- tive power of network information. To check whether this is affecting our results, we repeat the analysis using average future productivity over a ﬁve-year window instead of three years: q f it = qi,t+1 + qi,t+2 + qi,t+3 + qi,t+4 + qi,t+5, and, as before, we use ln(1+qf it) as the variable we seek to pre- dict. Results are similar to those reported here except that the predictive power of network variables is larger using a ﬁve- year window. In particular, network variables are even more useful than past output to forecast the future performance of a researcher, that is, multivariate model 2’ outperforms model 1’. Next we investigate whether results are sensitive to our deﬁnition of output qit. We examine whether different results obtain if we correct for article length and number of coau- thors. Results show that the predictive power of network variables is unaffected.12 Finally, the main speciﬁcation used so far is a linear model estimated by OLS in which the dependent variable is a loga- rithmic transformation of future research output, ln(qf + 1). it We are concerned that the model might be misspeciﬁed by restricting ourselves to OLS applied to this particular func- tional form. We therefore repeat the analysis with nonlinear regression models frequently used to study research output or citations, such as the Poisson model, the negative bino- mial model, and the zero-inﬂated negative binomial model. Results show that the in-sample log likelihood is higher for the (zero-inﬂated) negative binomial model than for the lin- ear model applied to the ln(y + 1)-transformation. However the out-of-sample RMSE is lowest for the linear model. As the linear model is also easy to interpret and evaluate, we use it as our main speciﬁcation. We also consider panel data models. Fixed-effect models are not useful to predict the productivity of junior researchers so we do not pursue them further.13 We also investigate the predictive power of vector autoregressive (VARs) models where past network variables affect future output and past output inﬂuences future network variables. We estimate such 11 We do not consider authors below the median because the median recent output is 0. 12 See the online appendix for more details. 13 Results from panel data regressions are available in the online appendix. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 6 5 9 3 6 2 0 1 2 4 9 5 / r e s t _ a _ 0 0 4 3 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 SOCIAL NETWORKS AND RESEARCH OUTPUT 947 Figure 7.—RMSE % Difference between Restricted Models across Productivity Tiers Tier 1 (>99%)

Tier 2 (95−98%)

Betweenness

Coauthors’ productivity

Coauthors’ coauthors prod.

Degree

Degree of order 2

Top 1%

Betweenness

Coauthors’ productivity

Coauthors’ coauthors prod.

Degree

Degree of order 2

Top 1%

1.5

2.5

3.5

4.5

1.5

Model 1

Tier 3 (90−94%)

Tier 4 (80−89%)

Betweenness

Coauthors’ productivity

Coauthors’ coauthors prod.

Degree

Degree of order 2

Top 1%

Model 1

Degree

Degree of order 2

Top 1%

Model 1

1.5

Tier 5 (50−79%)

Betweenness

Coauthors’ productivity

Coauthors’ coauthors prod.

Degree

Degree of order 2

Top 1%

Model 1

1.5

VAR models using a seemingly unrelated regressions (SUR)
approche, allowing for correlation in the error terms across
the two equations. The lag length of each equation is selected
using the BIC criteria. The SUR regressions should in prin-
ciple lead to more efﬁcient predictions as long as the two
equations do not include the same set of lagged variables, un

conditions that is fulﬁlled here. Results show that the pre-
dictions generated by the unrestricted SUR model 3 en utilisant
feasible generalized least squares (FGLS) hardly differ from
the unrestricted model 3 estimated using simple OLS. Là-
fore, the SUR model does not outperform, out of sample, le
simple OLS.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
r
e
s
t
/

un
r
t
je
c
e
–
p
d

F
/

9
6
5
9
3
6
2
0
1
2
4
9
5
/
r
e
s
t
_
un
_
0
0
4
3
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

948

THE REVIEW OF ECONOMICS AND STATISTICS

VI. Conclusion

In this paper, we have examined whether a researcher’s
coauthor network helps predict their future output. Under-
lying our study are two main ideas. The ﬁrst idea is that a
collaboration resulting in a published article reveals valuable
information about an author’s ability and research ambitions.
This is particularly true for junior researchers whose type
cannot be fully assessed from their cumulative output. Le
second idea is that professional research networks provide
access to new research ideas. These ideas can subsequently
be turned into published papers provided the researcher
possesses the necessary ability and dedication.

To investigate these ideas, we examine coauthorship in
économie. Our focus is not on statistical signiﬁcance or
causality but rather on predictive power. For this reason,
we adopt a methodology that eliminates data mining and
minimizes the risk of pretesting bias. To this effect, we ran-
domly divide the data into two halves. Parameter estimates
are obtained with one-half and predictions are judged by how
well they perform in the other half of the sample.

We ﬁnd that information about someone’s coauthor net-
works leads to a modest improvement in the forecast accuracy
of their future output over and above what can be predicted
from their past output. The network variables that have the
most information content are the productivity of coauthors,
closeness centrality, and the number of past coauthors. These
results are robust to alternative speciﬁcations and variable
deﬁnitions.

We investigate whether the predictive power of network
variables is stronger for more talented researchers, as would
be the case if taking advantage of new ideas requires talent
and dedication. We ﬁnd that the predictive value of network
variables is nonmonotonic with respect to past productiv-
ville. Network variables do not predict the future productivity
of individuals with below-average initial productivity. Ils
are somewhat informative for individuals in the highest
past productivity tier group. But they are most informative
about individuals in between. En fait, for these individu-
als, networks contain more information about their future
productivity than recent research output.

The work presented here leaves many questions unan-
swered. En particulier, we do not claim to have identiﬁed a
causal effect of coauthorship or network quality on future out-
put. If anything, the signaling hypothesis is based on a reverse
causality argument, and it receives the most support from our
analyse. Cependant, we also ﬁnd evidence that network con-
nections are most useful to talented researchers. Ce résultat
is consistent with a causal relationship between the ﬂow of
research ideas and future output, with the caveat that talent
is needed to turn ideas into publishable papers.

RÉFÉRENCES

Ashley, Richard, Clive W. J.. Granger, and Richard Schmalensee, “Adver-
tising and Aggregate Consumption: An Analysis of Causality,»
Econometrica 48 (1980), 1149–1167.

Azoulay, Pierre, Joshua Graff Zivin, and Jialan Wang, “Superstar Extinc-
tion,” Quarterly Journal of Economics 25 (2010), 549–589.
Bramoullé, Yann, Habiba Djebbari, and Bernard Fortin, “Identiﬁcation of
Peer Effects through Social Networks,” Journal of Econometrics
150:1 (2009), 41–55.

Calvó-Armengol, Antoni, Eleonora Patacchini, and Yves Zenou, “Peer
Effects and Social Networks in Education,” Review of Economic
Études 76:4 (2009), 1239–1267.

Conley, Timothy G., and Christopher R. Udry, “Learning about a New
Technologie: Pineapple in Ghana,” American Economic Review 100
(2010), 35–69.

Diebold, Francis, and Roberto Mariano, “Comparing Predictive Accuracy,»

Journal of Business and Economic Statistics 13 (1995), 253–263.

Ductor, Lorenzo, “Does Co-Authorship Lead to Higher Academic Produc-
tivité?” Oxford Bulletin of Economics and Statistics (2014), est ce que je:
10.1111/obes.12070.

Fafchamps, Marcel, Sanjeev Goyal, and Marco J. van der Leij, “Match-
ing and Network Effects,” Journal of the European Economic
Association 8:1 (2010), 203–231.

Fair, Ray C., and Robert J. Shiller, “Comparing Information in Forecasts
from Econometric Models,” American Economic Review 80 (1990),
375–389.

Fildes, Robert, and Spyros Makridakis, “The Impact of Empirical Accu-
racy Studies on Time Series Analysis and Forecasting,” International
Statistical Review 63 (1995), 289–308.

Glaeser, Edward, and Jose A. Scheinkman, “Nonmarket Interactions,” in M.
Dewatripoint, L. Hansen, et S. Turnovsty, éd., Advances in Eco-
nomics and Econometrics: Theory and Applications, Eighth World
Congrès (Cambridge: Cambridge: Presse universitaire, 2003).
Goyal, Sanjeev, Connections: An Introduction to the Economics of Networks

(Princeton, New Jersey: Princeton University Press, 2007).

Goyal, Sanjeev, Marco J. van Der Leij, and Jos Luis Moraga-González,
“Economics: An Emerging Small World,” Journal of Political
Économie 114:2 (2006), 403–412.

Hong, Yongmiao, and Tae-Hwy Lee, “Inference on Predictability of For-
eign Exchange Rates via Generalized Spectrum and Nonlinear Time
Series Models,” Review of Economics and Statistics 85:4 (2003),
1048–1062.

Jackson, M.. O., Social and Economic Networks (Princeton, New Jersey: Princeton

Presse universitaire, 2008).

Kodrzycki, Yolanda K., and Pingkang Yu, “New Approaches to Ranking
Economics Journals” (Boston: Federal Reserve Bank of Boston,
2006).

Lettau, Martine, and Sydney Ludvigson, “Consumption, Aggregate Wealth,
and Expected Stock Returns,” Journal of Finance 56:3 (2001),
815–849.

Manski, Charles F., “Identiﬁcation of Endogenous Social Effects: Le
Reﬂection Problem,” Review of Economic Studies 60:3 (1993),
531–542.

Mofﬁtt, Robert A., “Policy Interventions, Low-Level Equilibria, and Social
Interactions,” in S. Durlauf and P. Jeune, éd., Social Dynamics
(Cambridge, MA: AVEC Presse, 2001).

Rapach, David E., and Mark E. Wohar, “Testing the Monetary Model
of Exchange Rate Determination: New Evidence from a Cen-
tury of Data,” Journal of International Economics 58:2 (2002),
359–385.

Stock, James H., and Mark W. Watson, “Forecasting Inﬂation,” Journal of

Monetary Economics 44 (1999), 293–335.

Sullivan, Ryan, Allan Timmermann, and Halbert White, “Data-Snooping,
Technical Trading Rule Performance, and the Bootstrap,” Journal of
Finance 54:5 (1999), 1647–1691.

Swanson, Norman R., and Halbert White, “A Model Selection Approach to
Real-Time Macroeconomic Forecasting Using Linear Models and
Artiﬁcial Neural Networks,” this review 79:4 (1997), 540–550.

van der Leij, Marco J., and Sanjeev Goyal, “Strong Ties in a Small World,»

Review of Network Economics 10:2 (2011), 1.

Vega-Redondo, F., Complex Social Networks (Cambridge: Cambridge

Presse universitaire, 2007).

Waldinger, Fabian, “Quality Matters: The Expulsion of Professors and
the Consequences for PhD Student Outcomes in Nazi Germany,»
Journal of Political Economy 118:4 (2010), 787–831.

Wasserman, Stanley, and Katherine Faust, Social Network Analysis: Meth-
ods and Applications (Cambridge: la presse de l'Universite de Cambridge,
1994).

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
r
e
s
t
/

un
r
t
je
c
e
–
p
d

F
/

9
6
5
9
3
6
2
0
1
2
4
9
5
/
r
e
s
t
_
un
_
0
0
4
3
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3 SOCIAL NETWORKS AND RESEARCH OUTPUT image

Télécharger le PDF