RESEARCH ARTICLE - Ricerca sull'intelligenza artificiale specializzata al MIT

RESEARCH ARTICLE

Center–periphery structure in
research communities

Eleanor Wedell1

, Minhyuk Park1

, Dmitriy Korobskiy2

Tandy Warnow1

, and George Chacko1,3

a n o p e n a c c e s s

j o u r n a l

1Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801
2NTT DATA, McLean, VA, 22102
3Office of Research, Grainger College of Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61801

Citation: Wedell, E., Park, M., Korobskiy,
D., Warnow, T., & Chacko, G. (2022).
Center–periphery structure in research
communities. Quantitative Science
Studi, 3(1), 289–314. https://doi.org/10
.1162/qss_a_00184

DOI:
https://doi.org/10.1162/qss_a_00184

Supporting Information:
https://doi.org/10.1162/qss_a_00184

Received: 14 novembre 2021
Accepted: 7 Gennaio 2022

Corresponding Authors:
George Chacko
chackoge@illinois.edu
Tandy Warnow
warnow@illinois.edu

Handling Editor:
Ludo Waltman

Copyright: © 2022 Eleanor Wedell,
Minhyuk Park, Dmitriy Korobskiy,
Tandy Warnow, and George Chacko.
Pubblicato sotto Creative Commons
Attribuzione 4.0 Internazionale (CC BY 4.0)
licenza.

The MIT Press

Keywords: bibliometrics, community finding, clustering, exosome, extracellular vesicles

ABSTRACT

Clustering and community detection in networks are of broad interest and have been the
subject of extensive research that spans several fields. We are interested in the relatively
narrow question of detecting communities of scientific publications that are linked by
citations. These publication communities can be used to identify scientists with shared
interests who form communities of researchers. Building on the well-known k-core algorithm,
we have developed a modular pipeline to find publication communities with center–periphery
structure. Using a quantitative and qualitative approach, we evaluate community finding
results on a citation network consisting of over 14 million publications relevant to the field of
extracellular vesicles. We compare our approach to communities discovered by the widely
used Leiden algorithm for community finding.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
q
S
S
/
UN
R
T
io
C
e
–
P
D

F
/

3
1
2
8
9
2
0
0
8
3
0
0
q
S
S
_
UN
_
0
0
1
8
4
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

INTRODUCTION

We are interested in the structure of research communities that form and collaborate around
research questions (Crane, 1972; Kuhn, 1970). Such research communities represent scientific
specialization (Chubin, 1976; Morris & der Veer Martens, 2009). While prior studies have esti-
mated the size of these research communities to be of the order of a few hundreds to a thou-
sand (Crane, 1972; Kuhn, 1970; Morris, 2005; Mullins, 1985; Price, 1965), this question merits
re-examination in an expanded, diversified, globalized, and electronically connected scien-
tific enterprise. Accordingly, we would like to study research communities at scale.

From the perspective of scientometrics, detecting a research community can be framed as a
community-finding problem in which communities of publications defining areas of research
are discovered from the scientific literature. Primo, the scientific literature is modeled as a net-
work with publications as nodes and citations as directed edges (Boyack & Klavans, 2019). In
such a network, an area of research is defined by a community of publications—a sufficiently
citation-dense area in the network. Because researchers can work on more than one problem
and be members of more than one research community, we begin by finding communities of
publications. Then, for each publication community, the authors of the publications in the
community represent a researcher community.

Once constructed, citation networks can be analyzed using different community finding or
clustering approaches (Ahlgren, Chen et al., 2020; Boyack & Klavans, 2019; Sciabolazza,

Center–periphery structure in research communities

Vacca et al., 2017; Sjögarde & Ahlgren, 2018, 2020; Šubelj, van Eck, & Waltman, 2016; Traag,
Waltman, & van Eck, 2019; Waltman & van Eck, 2012). Of these approaches, a recent
development is the availability of the Leiden algorithm, which offers better partitioning and
performance (Traag et al., 2019) compared to a preceding approach, the Louvain algorithm
(Blondel, Guillaume et al., 2008).

The rich literature on community finding in complex networks is relevant (Fortunato,
2010; Fortunato & Castellano, 2009; Javed, Younis et al., 2018), particularly with modularity
being proposed as a quality function (Newman, 2006). Here, community detection amounts
to identifying groups within a complex network that share some common properties. How-
ever, as observed by Coscia, Giannotti, and Pedreschi (2011), imprecision is inherent in the
definition of community-finding, given the diverse ways in which communities can be
defined and a richness of perspectives. Per esempio, a community detection approach
may focus on vertex similarity or edge density; disjoint, overlapping or hierarchical commu-
nity structure; and static versus dynamic networks. Così, the context of the question being
asked and the techniques being employed tends to determine the flavor of community detec-
tion in a study. In this study, we build upon this literature in focusing on edge density and
disjoint communities.

Beyond detection, we are especially interested in substructure (structure within communi-
ties), as it reflects roles and dynamics among members. Price and Beaver (1966), in their study
of the oxidative phosphorylation community, reported center–periphery substructure (Breiger,
2014): a small core of influential researchers and a much larger transient population. Core–
periphery or center–periphery patterns have been reported in other networks using different
techniques, such as block modeling and k-core decomposition (Borgatti & Everett, 2000;
Gallagher, Young, & Welles, 2021; Malliaros, Giatsidis et al., 2019; Mullins, Hargens et al.,
1977; Rombach, Porter et al., 2014, 2017), arguing for generality in their occurrence.

We have briefly explored substructure in an earlier study (Chandrasekharan, Zaka et al.,
2021) where we developed an ensemble technique that combined the Leiden algorithm
and the Markov Clustering (MCL) (Van Dongen, 2008) community-finding methods. IL
method was coupled to limited qualitative analysis and identified publication communities
that were discovered by both Leiden and MCL. Subsequently, we identified the author com-
munities of these publication communities in networks of biology literature. We were specif-
ically interested in whether the communities we found exhibited substructure indicating the
strong influence of a few researchers associated with the majority of publications. While we
did detect center–periphery substructure in both publication and author communities, our
findings were potentially limited by the clustering methods we used, Leiden and MCL, COME
neither is designed to detect substructure when identifying communities.

Here, we aim to investigate more carefully whether communities exhibiting center–
periphery structure exist in citation networks of the scientific literature. The central idea is that
each community contains core nodes representing the “center” of center–periphery organiza-
tion and additional noncore nodes representing the “periphery”; our approach first finds the
core nodes and then augments the cluster to include peripheral nodes.

To find the core nodes in a community, we combine two concepts represented in the clus-
tering and community detection literature: Primo, that valid communities should have a positive
modularity score (Fortunato & Barthelemy, 2007), and second that each community should be
dense, which is expressed by the ratio between the average node degree and the number of
nodes in the network. By distinguishing between core and noncore nodes, we can require that
the positive modularity and sufficient citation density requirements hold for the subclusters of

Quantitative Science Studies

290

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
q
S
S
/
UN
R
T
io
C
e
–
P
D

F
/

3
1
2
8
9
2
0
0
8
3
0
0
q
S
S
_
UN
_
0
0
1
8
4
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Center–periphery structure in research communities

core nodes but not necessarily for the clusters that combine both core and noncore nodes. Questo
distinction potentially enables us to better model real world community structure.

Although optimizing the total modularity in a clustering has drawbacks (Fortunato &
Barthelemy, 2007), we only require that the clusters individually have modularity scores above
0; this is a relatively mild criterion that seeks internal cohesion, and has been considered in the
prior literature (per esempio., Fortunato & Barthelemy, 2007; Newman & Girvan, 2004) to be evidence
of a valid community.

Così, our approach combines five ideas from the literature: the scientific enterprise orga-
nized into research communities, a center–periphery model, researcher community structure
extended to a model of publication communities, positive modularity for individual commu-
nities, and communities defined by sufficient citation density.

Our modular pipeline is centered around finding dense clusters of core nodes, building on
the ideas of Giatsidis, Thilikos, and Vazirgiannis (2011) who quantified the cohesiveness of a
community using the k-core concept from graph theory (Matula & Beck, 1983). We tested this
pipeline on a network of over 14 million publications that we constructed by harvesting citing
and cited articles from a seed set defined by the keyword exosome. This keyword captures
articles from the field of extracellular vesicles, which may be important for intercellular
communication and development of some diseases, as well as having potential for therapy
(Edgar, 2016; Kalluri & LeBleu, 2020; Raposo, van Niel, & Stahl, 2021). We chose extracellular
vesicles (Raposo et al., 2021) as the focus of this study for two reasons: Primo, it is a large
area di ricerca, and second, it is rapidly expanding—it has seen spectacular numbers of
publications each year since 2010, and therefore represents an excellent test case for
community-finding methods in the modern scientific enterprise.

Expecting that not all communities discovered in such a large network would be directly
relevant to exosomes, noi usiamo 1,218 cited references from 12 recent review articles as expert-
identified markers for specificity in the communities we discover. Any community containing
at least one marker node is considered relevant and communities with significant numbers of
markers are considered to be special interest. We report our findings in the following sections.

2. MATERIALS AND METHODS

2.1. Data

Citation networks of scientific literature can be constructed using different approaches. Direct
citations were used to build clusters of articles from a data set of over 10 million publications
(Waltman & van Eck, 2012); this methodology was also used in building citation maps from 19
E 43 million publications (Boyack & Klavans, 2014). Direct citation, bibliographic coupling,
and cocitation have been compared for their relative value in identifying research fronts
(Boyack & Klavans, 2010), with a hybrid approach involving bibliographic coupling and
textual similarity performing the best. A subsequent study conducted at a larger scale and with
improved evaluation criteria suggested that direct citation was the most promising (Klavans &
Boyack, 2017). We use direct citations in this study.

A citation network consisting of 14,695,475 nodes and 99,663,372 edges was generated
using the Dimensions bibliography (Hook, Porter, & Herzog, 2018). Briefly, a “seed” set, S,
was obtained by performing a text search for the term “exosome” with years of publication
restricted to 2010 or earlier. This constraint was applied to allow every element in the seed
set to have accumulated at least 10 years of citations. The search retrieved 11,156 publications
of type article from Dimensions. To capture publications proximal by citation to the seed set, UN

Quantitative Science Studies

291

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
q
S
S
/
UN
R
T
io
C
e
–
P
D

F
/

3
1
2
8
9
2
0
0
8
3
0
0
q
S
S
_
UN
_
0
0
1
8
4
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Center–periphery structure in research communities

network was constructed using a protocol we labeled SABPQ. First we start with the seed set S.
Set A is the set of publications that cite at least one publication from set S. Allo stesso modo, set B is the
set of publications that are cited by at least one publication from set S. Once the sets S, UN, E
B are identified, we define set P as the set of publications that cite at least one publication from
the set S ∪ A ∪ B. Allo stesso modo, set Q is the set of publications that are cited by at least one pub-
lication from the set S ∪ A ∪ B. Così, S and A are subsets of P and B is a subset of Q. IL
network contains directed edges defined by citations; if publication x cited y then we created
an edge from x to y. The SABPQ protocol was implemented using Dimensions on BigQuery in
Google Cloud Services. The data was then exported to Google Cloud Storage Bucket and sub-
sequently exported to a PostgreSQL database for further analysis.

2.1.1. Marker nodes and specificity

As marker nodes for our analysis, we used 1,218 articles cited in 12 recent reviews on extra-
cellular vesicles and exosome biology (Busatto, Morad et al., 2021; Clancy, Schmidtmann, &
D’Souza-Schorey, 2021; Ghoroghi, Mary et al., 2021; Lui, Hamby, & Jin, 2021; Kalluri &
LeBleu, 2020; Lananna & Imai, 2021; Le Lay, Rome et al., 2021; Leidal & Debnath, 2021;
Raposo et al., 2021; Schnatz, Müller et al., 2021; van Niel, D’Angelo, & Raposo, 2018; Verdi,
Bécot et al., 2021). Tutto 1,218 markers are present in our 14,695,475 nodes network. Marker
nodes were matched to clusters using identifiers in our network. Per esempio, the marker node
with title “Tumour exosome integrins determine organotropic metastasis” and DOI 10.1038/
nature15756 is identified as node 4431204 in our network. The complete list of marker nodes
is available on our Github site (Park et al., 2021).

2.2. Clustering Methods

2.2.1.

Leiden

We used version 1.1.0 of the Java implementation for the Leiden algorithm (Traag et al., 2019)
provided by the Centre for Science and Technology Studies and available in Github (Traag,
2021). We ran Leiden in default mode, which means that the quality function being optimized
was the Constant Potts Model rather than modularity. Leiden includes a parameter for the res-
olution value, which we vary in our experiments from 0.0001 A 0.95.

2.2.2. New clustering methods

Here we describe at a high level the clustering and community-finding methods we developed
in our study. The methods we developed and use in this study are freely available in Github,
and the locations of the software and exact commands we used are provided in the Supple-
mentary materials. We also provide software version numbers and commands for the existing
code that we use in the Supplementary materials. We note here that the code we developed
relies on NetworKit (Staudt, Sazonovs, & Meyerhenke, 2016), which is an open-source Python
module designed for scalable network analysis.

In our approach, the objective was to produce a set of clusters, each of which has core
nodes and peripheral nodes consistent with the “center–periphery” structure described earlier.
These clusters are considered to be “publication communities,” with two types of members:
core members that are densely connected to each other and peripheral (noncore) members
connected to the core members but with fewer edges within a cluster.

Our approach requires values for k and p, where k specifies a minimum connectivity
between the core nodes, and p indicates a minimum connectivity between each noncore

Quantitative Science Studies

292

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
q
S
S
/
UN
R
T
io
C
e
–
P
D

F
/

3
1
2
8
9
2
0
0
8
3
0
0
q
S
S
_
UN
_
0
0
1
8
4
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Center–periphery structure in research communities

(“periphery”) node and the core nodes. These parameter values for k and p are provided by the
user, and different choices for these parameters will produce different clusterings.

The required minimum connectivity k between core nodes is related to the k-core concept
in graph theory, which we now describe. A k-core of a network N is a largest connected sub-
network A of N such that every node in A is adjacent to at least k other nodes in A (Matula &
Beck, 1983; Pittel, Spencer, & Wormald, 1996; Seidman, 1983). The k-cores can be calculated
in polynomial time (Matula & Beck, 1983), and our new clustering methods build on these
algorithms.

We are also interested in the modularity scores of the clusters that are produced by each

method, as given in Definition 1:

Definition 1. The modularity of a single cluster s within a network N, denoted by mod(S), È
given by

mod sð Þ ¼ ls
l

(cid:1) (cid:3)2

− ds
2l

where ls is the number of internal edges in cluster s, ds is the sum of the degrees of the nodes
inside s, and L is the number of total edges in the network N (Fortunato & Barthelemy, 2007).
The total modularity of a clustering is the sum of the modularity scores of its clusters.

Rather than aiming to maximize the total modularity of the clustering, we will only require
that each cluster have positive modularity; as noted in Fortunato and Barthelemy (2007), Questo
approach aims to detect valid communities. We now define some additional terms that we will
use in designing new clustering methods:

Definition 2. Given a network N, a clustering C, and a cluster C drawn from C where C is
partitioned into core nodes and noncore nodes, we will say

(cid:129) C is k-valid if and only if each core node in C is adjacent to at least k other core nodes in C.
(cid:129) C is m-valid if and only if the subcluster induced by the core nodes is connected and

has a positive modularity score.

(cid:129) C is p-valid if and only if each noncore node is adjacent to at least p core nodes in C.
(cid:129) C is kmp-valid if and only if it is k-valid, m-valid, and p-valid.
(cid:129) The clustering C is kmp-valid if and only if every cluster C 2 C is kmp-valid.

Note that if a cluster does not contain noncore nodes then it is vacuously p-valid.

The clustering methods that we develop seek to produce kmp-valid clusters, so that we can
interpret these clusters as communities with center–periphery structure. Inoltre, we are
interested in clusterings that produce a large number of kmp-valid clusters, as well as those
that include as many nodes as possible in the kmp-valid clusters (which by definition must be
nonsingleton when k > 1). Hence we explore different techniques that seek to optimize these
two opposing criteria.

We also require positive modularity in the core node subclusters for each cluster. This is a
relatively mild requirement that avoids cases where the core node subcluster may be k-valid
and connected but might not reflect a preference for itself over the outside. Consider the case
where a 10-clique, a complete graph on 10 nodes, is contained in a clique with 20 nodes. Questo
10-clique would satisfy k-validity for k ≤ 9 and would be connected, but would not have pos-
itive modularity. By enforcing positive modularity, we would avoid returning such clusters.
This example illustrates the advantage of enforcing positive modularity even though the

Quantitative Science Studies

293

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
q
S
S
/
UN
R
T
io
C
e
–
P
D

F
/

3
1
2
8
9
2
0
0
8
3
0
0
q
S
S
_
UN
_
0
0
1
8
4
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Center–periphery structure in research communities

probability of it occurring in a real-world network is likely to be small. We also note that enforc-
ing positive modularity in the core node subcluster (or even in the final cluster that contains
both core and noncore nodes) is not the same as trying to maximize the sum of the modu-
larity scores of the individual clusters (total modularity score). In other words, enforcing
positive modularity does not have the same vulnerability to the resolution limit that was
established for the modularity criterion, which seeks to maximize the total modularity score
(Fortunato & Barthelemy, 2007).

2.2.3.

Four-stage kmp-clustering

We designed a four-stage pipeline that is designed to enable the user to explore different clus-
tering options and guarantee that the output is a kmp-valid clustering. The input is a network
N and values for the parameters k and p. At a high level, the first stages aim to construct the
core member components. The second stage extracts valid subclusters from those generated
by the first stage. The third stage augments these clusters with additional members, most likely
noncore members, though some might qualify as core members, and the fourth stage assigns
core or noncore status to the nodes, and retains only those clusters that are kmp-valid. Noi
begin with a description of the overall multistage structure of our new clustering methods;
note that stages 2 E 3 are optional.

(cid:129) Stage 1: Cluster the network N into disjoint clusters (core members), so that each non-

singleton cluster is k-valid and m-valid.

(cid:129) Stage 2: Attempt to break each nonsingleton cluster produced in Stage 1 into a set of

pairwise disjoint clusters, each of which is k-valid at minimum.

(cid:129) Stage 3: For each nonsingleton cluster, add unclustered nodes (nodes that are not in any
nonsingleton cluster) as noncore (peripheral) members, provided that they are adjacent
to at least p core nodes in the selected cluster. This is the augmentation stage, Quale
adds noncore nodes to the clusters produced in the earlier stage.

(cid:129) Stage 4: Process the clustering that is received so that each final cluster is partitioned into

core and noncore members, and so that the clustering is kmp-valid.

Così, Stages 1 E 2 together are directed at finding core members of clusters, with Stage 1
directed at clusters with large numbers of core nodes and Stage 2 aimed at extracting smaller
clusters within these larger clusters. At the end of Stage 1, all clusters will be k-valid and m-
valid. If the optional Stage 2 is applied, the clusters it produces will be k-valid and connected,
but may no longer have positive modularity. Neither of these stages introduces any noncore
nodes, so the output of each stage is vacuously p-valid. Stage 3 augments the clusters to
include noncore nodes; by design the clusters will be connected, k-valid, and p-valid, Ma
depending on the outcome of Stage 2, they may not have positive modularity and so may
not be m-valid. Stage 4 is designed to ensure that all final clusters are kmp-valid, and so
may modify or discard clusters found in the earlier stages. Tuttavia, after Stage 4 is run,
the output clustering is guaranteed to be kmp-valid. Inoltre, the clusters produced are
parsed into core and noncore nodes.

We now describe the techniques we have developed for each stage. For Stage 1, we present
iterative k-core (IKC) clustering, a method that is inspired by the k-core concept in graph the-
ory. For Stage 2, we also present two different techniques: recursive Graclus (RG) and iterative
Graclus (IG), both of which are based on the Graclus (Dhillon, Guan, & Kulis, 2007) clustering
method used in its default setting and applied to split a graph into two parts. Stage 3 is imple-
mented using a straightforward algorithm that we describe below. In contrast to the earlier

Quantitative Science Studies

294

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
q
S
S
/
UN
R
T
io
C
e
–
P
D

F
/

3
1
2
8
9
2
0
0
8
3
0
0
q
S
S
_
UN
_
0
0
1
8
4
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Center–periphery structure in research communities

stages, Stage 4 involves multiple steps, and is described below. This multistage design provides
a flexible framework by allowing different techniques to be used at each stage.

2.2.4.

Stage 1: Iterative k-core clustering (IKC)

To motivate the IKC algorithm, we first describe the technique that computes k-cores and then
describe the iterative method.

k-core For a network N and specified positive integer value for k, a k-core of a network
2.2.4.1.
N is a maximal connected subnetwork A of N such that every node in A is adjacent to at least k
other nodes in A. Note that for every network that does not have any isolated vertices (cioè., nodes
of degee 0), each connected component is a 1-core of the network. The distribution of k-core
sizes also indicate how quickly a network shrinks as k increases (Leskovec & Horvitz, 2008).

The identification of the k-core for a maximum achieved value of k in a network has been
proposed as a quality measure for community-finding that measures cohesiveness and sug-
gests collaboration (Giatsidis et al., 2011). This idea is reiterated by Kong, Shi et al. (2019)
and Malliaros et al. (2019), who discuss applications of the k-core in biology and real-world
networks.

The k-cores can be calculated in polynomial time (Matula & Beck, 1983), come segue. Primo,
we calculate the degree of every node in the network. Then, every node of degree less than k is
deleted from the graph, and this process repeats until every node has degree at least k (cioè.,
every node is adjacent to at least k other nodes). Every connected component that remains is
called a k-core of the network. As an example, given a network that contains two connected
components, one is a clique of size 100 and the other has a node x that is adjacent to 2,000
other nodes, each of which is only adjacent to x (and so has degree 1). Note that for all k with
2 ≤ k ≤ 99, the k-core of this network is the clique of size 100.

k-core clustering The simple k-core clustering method takes as input a network N and
2.2.4.2.
a value for k, and computes the k-cores of the network. The set of the k-cores is returned as the
clustering. By construction, the simple k-core clustering method produces clusters that are k-
valid and connected. Tuttavia, it does not constrain the clusters to have positive modularity.

Iterative k-core clustering To improve on the simple k-core clustering technique for
2.2.4.3.
our purposes, we developed an iterative k-core (IKC) algorithm. The input to the IKC algorithm
is a network N and a positive integer k. IKC then operates as follows:

(cid:129) We will construct a bin B of clusters that will be returned by the IKC clustering. In this

step, we initialize B to be the empty set.

(cid:129) We run the k-core labeling algorithm, which labels every node in the network with a
nonnegative integer. We let L be the largest label found in this labeling. If L < k, then IKC exits, and returns the clusters in the bin B. Otherwise, the L-cores (i.e., the connected components that are labeled by L) of the network are evaluated as potential clusters. (cid:129) An L-core A is added to the bin B if and only if A has positive modularity. (cid:129) The L-core is then deleted from the network, and the residual network is recursively analyzed by IKC. The stopping condition is when all the nodes have been deleted from the network. This procedure produces a collection of clusters, each of which has the following proper- ties: (i) each cluster is connected and has positive modularity, and hence each cluster is Quantitative Science Studies 295 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Center–periphery structure in research communities m-valid; and (ii) each cluster is k-valid, which means every node in the cluster is adjacent to at least k other nodes in the cluster. 2.2.5. Stage 2: Finding clusters within clusters using Graclus In Stage 2 we seek to discover k-valid clusters that exist within larger k-valid clusters. For this, we use the Graclus clustering method (Dhillon et al., 2007). Graclus has been previously used to cluster bibliometric data (Devarakonda, Korobskiy et al., 2020; Dhillon et al., 2007; Šubelj et al., 2016), but here we use it to split a given cluster into two subclusters. We use Graclus to optimize its default criterion, which is the normalized cut criterion. In this setting, we seek a partition of a given cluster C into two parts C1 and C2 so as to minimize links C1; C2 Þ ð links C1; C Þ ð þ links C1; C2 Þ ð links C2; C Þ ð ; where links(A, B) denotes the number of edges with one endpoint in A and the other endpoint in B. As is the case in other optimization methods, local search in Graclus helps the optimizer escape poor local minima; its default setting does no iterations but this can be modified by specifying the number of iterations. In this study we explore both the default mode, l = 0 (no iterations) and l = 2,000 iterations. We implement our use of Graclus in two different ways: recursively and iteratively. Thus, we use Graclus in four different ways. 2.2.5.1. Recursive Graclus The input to this method is a clustering of the network N and the value for the parameter k. We create a bin B of clusters (setting it initially to the empty set). We take a cluster C from the clustering and apply Graclus recursively using either the default mode or the local search mode. The result of this is a division of the cluster C into two nonempty sets A1 and A2. If A1 has positive modularity and is k-valid, then we add A1 to B (the bin we have created), and similarly for subset A2. If neither A1 nor A2 is added to B, add cluster C to B and delete C from the network, effectively removing it from further consideration by RG. The stopping condition for RG is that all nodes have been deleted from the network. When the stopping condition is reached, the final output of RG is the set of clusters in bin B. Though all clusters that are produced by RG are guaranteed to be k-valid and have positive modularity, they may not be m-valid, as this requires that the core node subsets be connected. Iterative Graclus To use Graclus iteratively, we follow a similar procedure as in RG 2.2.5.2. but with two key differences. The first is that the user provides a parameter for the number of iterations, so that the procedure must stop after that number of iterations, if it hasn’t already stopped. The second difference is that, in contrast to RG, the procedure is guaranteed to pro- duce clusters that are k-valid and m-valid in each iteration, as we now describe. When we apply Graclus, in either default or local search mode, to split a cluster into two subclusters, each of the created subclusters is parsed into core and noncore nodes (using a variant of the algorithm described for Stage 4; see Section 2.2.8). If the core node set is empty, the subcluster will be discarded. However, if the core is nonempty, then the core node set is by definition k-valid, and is then evaluated further. Each core node set is divided into its con- nected components, and each connected component that has positive modularity is passed Quantitative Science Studies 296 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Center–periphery structure in research communities to the next iteration. Any cluster that does not end up producing a subcluster that is passed to the next iteration is added to the bin B as in RG and is then deleted from the network. IG stops when one of two conditions occurs: The number of allowed iterations has been completed, or all the nodes have been deleted from the network. By design, the output of IG is a set of clus- ters each of which is k-valid and m-valid (thus, every cluster is connected and has positive modularity, and every node is adjacent to at least k nodes in its cluster). 2.2.6. Stage 3: Augmentation The purpose of the augmentation step is to assemble the periphery of center–periphery struc- tures. The input to Stage 3 is a set of clusters, so that each nonsingleton cluster is k-valid and m-valid. Here we allow all nodes that are not in any nonsingleton cluster to be added to some cluster as long as it is adjacent to at least p core nodes in the (single) cluster to which it is added. In this study, we set p = 2 to ensure that we captured publications that are linked by cocitation or bibliographic coupling to core nodes in a community. If no such cluster exists such that a node can be added to it in a p-valid manner, the node remains unclustered. We add x to the cluster C that maximizes NC xð Þ j Cj where NC(x) is the number of core node neighbors of node x in cluster C and where |C | denotes the number of nodes in cluster C. In other words, we add node x to the cluster where x has proportionally the most core node neighbors. As an example, suppose C1 and C2 are clusters of core nodes and that x is not yet added as a noncore node to any cluster. Suppose x has five neighbors in cluster C1 and 10 nodes in cluster C2, where |C1| = 1000 and |C2| = 20. This procedure would add x as a noncore mem- ber to C2 because 50% of the nodes in C2 are neighbors of x while only 5/1000 = 0.5% of the nodes in C1 are neighbors of x. 2.2.7. Stage 4: Parsing clusterings to produce kmp-valid clusters Although Stage 2 is guaranteed to produce k-valid clusters, it does not always produce m-valid clusters. Furthermore, the impact of Stage 3 (the augmentation step) is to add nodes to clusters that can participate as noncore nodes, and it is possible for a node added to a cluster in Stage 3 to have sufficient neighbors in its cluster to qualify for core membership. Hence, the result of these three stages is a set of clusters that needs to be “parsed” in order to know which nodes are core members, which nodes are noncore members, and whether the clusters are kmp-valid (as defined in Definition 2). 2.2.7.1. Parsing and modifying a single cluster Here we describe how we perform this parsing on a given cluster C (taken from a clustering C) and values for k and p. (cid:129) Step 1: We label every node in the cluster C using the k-core labeling algorithm, applied only to the subnetwork defined by C. We let C 0 denote the subset of nodes in C, each of whose labels is at least k, and we put the remaining nodes into a bin B(C ). (cid:129) Step 2: We compute the connected components of C 0, and delete the components that do not have positive modularity; the retained components thus have positive modularity, and are referred to as C-derived clusters (to indicate their derivation from the original Quantitative Science Studies 297 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Center–periphery structure in research communities cluster C ). The clusters produced in this step will be the core node members in the final clusters we produce at the end of Step 3. (cid:129) Step 3: We augment the C-derived clusters using the bin B(C ) (i.e., we find noncore members to add) as follows. We examine each node x in bin B(C ) to see if it has at least p neighbors in at least one C-derived cluster; if so, we select the best C-derived cluster A (using the algorithm from Stage 3) and add the node x to Anc (where “nc” refers to “noncore”). After all nodes are examined and processed, we let A0 = A ∪ Anc (for each C-derived cluster A) and output the set of all such clusters A0 as the “final clusters” derived from cluster C. Note that this process indicates the parsing of each final cluster A0 into core (A) and noncore (Anc) nodes. (cid:129) Step 4: We return all the final clusters derived from C. Theorem 1. For any clustering C of a network N, and any positive integers for k and p (with p < k), the output of kmp-processing is a clustering that is kmp-valid. Therefore, the output of the four-stage clustering method is kmp-valid for all networks N and values for k and p. Proof. Let C be an arbitrary clustering. Hence, some of its clusters may not be k-valid, may not be connected, and may not have positive modularity. We will prove that after the kmp- processing, all the clusters are kmp-valid. Specifically, we will prove that the parsing produced in Step 3 into core and noncore satisfies kmp-validity. First, note that by construction the clusters (referred to as C 0) that are produced in Step 1 have the property that every node in these clusters is adjacent to at least k other nodes in their cluster. Hence, treating each cluster as only containing core nodes, these clusters are k-valid. In Step 2, these clusters are divided into components and the components are retained only if they have positive modularity; hence the C-derived clusters that are produced are k-valid and m-valid, under the interpretation that they contain only core nodes. In Step 3, the C-derived clusters are augmented. This augmentation step maintains connectivity, so the final clusters are connected. It remains to establish that after parsing any final cluster into core and noncore nodes, the final cluster would be k-valid (i.e., every core node would be adjacent to at least k other core nodes), p-valid (i.e., every noncore node would be adjacent to at least p core nodes), and the core node subcluster would have positive modularity, and hence also be m-valid. Let A0 be some final cluster. By construction, it is formed by taking a C-derived cluster A produced in Step 2, and then augmenting it. Thus, A0 = A ∪ Anc, where Anc is the set of nodes that are added during the augmentation step. We will establish that applying Stage 4 kmp- parsing to this cluster A0 will not change its decomposition into core and noncore (i.e., A will still be the core nodes and Anc will be the noncore nodes). Note that by construction, all the nodes in A are adjacent to at least k other nodes in A. Hence, when A0 is kmp-parsed, the nodes that are identified as core nodes will contain all the nodes in A and then possibly some nodes from Anc. Independent of whether there are new core nodes, A0 will be p-valid. If there are no new core nodes, therefore, then A0 will be kmp-valid. Here we show that in fact no node in Anc will be labeled as core, so there are no new core nodes. Let x 2 Anc with A a C-derived cluster. Hence, x is drawn from bin B(C ). By Step 1, the label assigned to x during Step 1 (when the nodes in cluster C were labeled) was a value L1 that is strictly less than k. Because A is a C-derived cluster, A0 ⊆ C. Consider the label L2 assigned to x by the k-core labeling of A0. Because A0 ⊆ C, it follows that L2 ≤ L1. Because L1 < k, it follows that L2 < k. Hence, x will not be placed in the core for A0, and the final clusters output in Step 4 □ are therefore kmp-valid. Quantitative Science Studies 298 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Center–periphery structure in research communities 2.2.8. Additional uses of kmp-parsing The Stage 4 kmp-parsing routine is also used in modified forms in other aspects of this study: (cid:129) Strict kmp-parsing: This strict parsing routine evaluates each cluster in a clustering for being kmp-valid: Those clusters that are kmp-valid are retained and all others are dis- carded. We use this strict parsing routine to evaluate Leiden. (cid:129) Using kmp-parsing to extract core node clusters: We also have a variant where use the parsing to extract only the core nodes within each cluster, and then return the compo- nents of the core node subclusters that have positive modularity score; this variant is used within IG. 2.3. Multidimensional Scaling Analysis We used Multidimensional Scaling analysis (MDS) to visualize clusters of marker nodes. For the distance between marker nodes x and y, we calculated the number of clusterings in which x and y were not in the same cluster. Using the matrix of pairwise distances, we produced a two dimensional visualization using metric MDS. See the Supplementary materials for addi- tional details. 3. RESULTS AND DISCUSSION 3.1. Properties of the Citation Network Exosomes are an area of investigation within the larger field of extracellular vesicles in biology (Kalluri & LeBleu, 2020; Raposo et al., 2021). This field has been exponentially expanding, as evidenced by a keyword search for exosome in the Dimensions bibliography yielding a count of less than 100 publications in 1990 and earlier, 11,100 publications from 1991 through 2010, and 115,300 publications from 2011–2021 (rounded). To find communities in this research area, we constructed a large citation network that captured articles concerning exosomes and extracellular vesicles and articles proximal to them through citation. The citation network we built consisted of 14,695,475 publications in 13 components, of which the largest component accounts for 99.998% of the network. The network was gener- ated through amplifying a seed set of 11,156 articles (Section 2). The degree distribution of the nodes in this network is typical of citation networks with a few nodes of high degree and many nodes of low degree (Figure 1). Roughly 68% of the nodes have degree at most five and the 90th, 95th, and 99th percentiles of degree counts are 6,113, 9,186, and 24,510. The highest degree is 256,836 and corresponds to an article describing an assay for protein measurement. The nodes in this data set were roughly distributed by year of publication as follows: 1990 or earlier (1.17 million), 1991–2010 (6.05 million), and 2011–2021 (6.99 million), suggesting not only a rapid growth of exosome publications in the post-1990 period but also a substantial increase in publications linked by citation to the seed articles. 3.2. Results of Clustering Methods We now present results using different clustering methods, including the k-core clustering method, iterative k-core clustering, our four-stage pipeline, and the Leiden algorithm. As noted earlier, we were explicitly interested in discovering citation-dense regions with center- periphery structure that reflect cohesiveness and collaboration (Breiger, 2014; Giatsidis et al., 2011). In this case, collaboration refers to the recognition of prior work by others in Quantitative Science Studies 299 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Center–periphery structure in research communities l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 1. work. x-axis: log degree, y-axis: log node count. Intra-network degree distribution of the 14,695,475 nodes in the exosome citation net- the community through citations. We were not interested in communities that consisted of a single heavily cited article or a single article that cited many references (Chandrasekharan et al., 2021, p. 193). 3.2.1. Results for the Leiden algorithm We first used the Leiden algorithm, which guarantees connected communities (Traag et al., 2019), as a benchmark for community finding. We consider Leiden as a reference method in the scientometrics community, and have previously used it (Chandrasekharan et al., 2021) in combination with Markov Clustering (Van Dongen, 2008) to detect communities in the immu- nology and ecology literature. In the present study, we used the Leiden software (Traag, 2021) with the Constant Potts Model as quality function to cluster the exosome citation network. The resolution factor is designed to modify the clustering, as it determines the required minimum density within communities, and we varied it from 0.0001 to 0.95 (see Table 1). For each res- olution factor value, we examined the number of clusters and their sizes, as well as the node coverage, which is defined to be the fraction of the nodes in the entire network that appear in a nonsingleton cluster. We also examined these statistics after restricting the clusters produced by Leiden to those that are kmp-valid for k = 5 or 10 and p = 2 (Section 2, Definition 2). At the smallest resolution value we examined (0.0001), Leiden produced 12,266 nonsin- gleton clusters and 1,472,233 singletons, the median cluster size was 375, the largest cluster size was 70,001, and the node coverage was 90% (see also Supplementary materials, Table 2). Thus, at the smallest resolution value, there is excellent node coverage with a wide range in cluster sizes, including one large cluster. However, when restricting to the kmp-valid clusters and setting k = 5 and p = 2, there were only 69 clusters (median size 20,522) and the node coverage was only 11.5%, indicating a very large drop in number of clusters and node cov- erage (Supplementary materials, Table 3). There was an even bigger drop in number of clusters Quantitative Science Studies 300 Center–periphery structure in research communities Cluster statistics for Leiden under different resolution values. Node coverage, expressed as percentage, refers to the the ratio of Table 1. nodes in nonsingleton clusters to the total number of nodes in the network. # of clusters refers to the number of nonsingleton clusters. The last three columns refer to the sizes of nonsingleton clusters. Results shown here are before kmp-processing. Resolution 0.0001 0.001 0.01 0.05 0.25 0.50 0.75 0.95 Node coverage 90.0% # Clusters 12,266 85.6% 74.9% 57.0% 15.6% 8.7% 8.2% 8.1% 67,304 276,050 488,285 481,780 434,973 489,937 497,757 # Singletons 1,472,233 2,367,965 3,689,286 6,323,695 12,408,446 13,412,183 13,496,111 13,499,132 Min. 2 Median 375 2 2 2 2 2 2 2 98 29 17 4 2 2 2 Max. 70,001 20,802 3,510 960 192 97 39 16 and node coverage when setting k = 10 and p = 2, where the number of clusters was 18 and node coverage was 4.5% (Supplementary materials, Table 4). Thus at this resolution value, while Leiden does find clusters that are kmp-valid and so exhibit substructure, most of the clusters it finds are not kmp-valid. Results for other resolution values showed similar trends, but increases in resolution value very quickly decrease the node coverage and median cluster size and increase the number of singleton clusters; furthermore, when restricted to kmp-valid clusters, the node coverage and median cluster size drop very quickly. As an example, using resolution value 0.10 and setting k = 5 and p = 2, Leiden produces 5,393 clusters with median size 58, more than 14 million nodes are in singleton clusters, and the overall node coverage is 2.6% (setting k = 10 produces 2,961 clusters and node coverage of 2%). Further increases in the resolution value drop the node coverage and the median and maximum cluster sizes, and increase the number of sin- gleton clusters. Thus overall, reasonable results in terms of node coverage when restricted to kmp-valid clusters are obtained only for the small resolution values (between 0.0001 and 0.01), and range from 7.3% to 13.8% for k = 5 and from 4.5% to 8.8% when k = 10. In summary, while Leiden was able to efficiently cluster our network into communities that represented between 8% and 90% of the network, depending on the resolution value employed, only a small fraction of these communities exhibited kmp-validity. This is not sur- prising, because the Leiden algorithm and optimization criterion were not designed to produce kmp-valid communities, but these trends indicate that Leiden has some limitations for our purposes. 3.2.2. Results for k-core clustering To identify citation-dense regions of the network, we ran the k-core clustering method on our network for different values of k ≥ 1 (Figure 2). The largest value of k for which a cluster is returned is 56, indicating that the degeneracy of the network is 56. The network has no iso- lated vertices, so k = 0 and k = 1 produce the same output, which is 13 clusters, each corre- sponding to a connected component in the network. When k = 2, two clusters were returned. For each 3 < k ≤ 56, only a single cluster was returned; hence only k = 0, 1, 2 produced more than one cluster, and in each of these cases, a single cluster dominated in size. Quantitative Science Studies 301 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Center–periphery structure in research communities l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 2. Cluster sizes using the simple k-core clustering algorithm. The input exosome citation network (Section 2) consists of 13 components summing to 14,695,475 nodes. Of these, a single component accounts for 14,695,226 articles (99.998% of the network). The k-core clustering algo- rithm was applied to the exosome citation network for multiple values of k (x-axis). The size of the k-core is shown on the ordinate (y-axis). A single component is returned in each case, with the exception of k = 1 and k = 2. Node coverage decreases as k increases, and is approximately 36% and 21% when k = 5 and k = 10 respectively. The single cluster at k = 56 has 3,630 nodes, amounting to node coverage of 0.02%. At k = 57 or greater, the entire network dissolves, thus the degeneracy of the network is 56. Clusters shown have positive modularity (mod + ve) when k = 1 or k ≥ 40 and are colored teal. By definition, the connected components of the network are the 1-cores so 13 clusters are returned for k = 1. For k = 2, only two clusters are returned. Clusters of size 100 or less (0.001% of the network) are not included in this plot and pertain only to k = 1 or k = 2. Inset: k-core sizes sampled at intervals of five are displayed using a linear scale. An interesting trend in this analysis is how k impacts the modularity scores of the clusters. By definition, the components in the graph all have positive modularity, so when k = 1 the clusters all have positive modularity. For larger values of k, the modularity scores do not become positive until k = 40, and then all subsequent values of k produce clusters with positive modularity scores. Cluster size decreased monotonically as k increased to 56. However we observe a pattern of relatively stable core sizes at lower values of k followed by a more rapid decrease as k increases above 40. Leskovec and Horvitz (2008) reported similar findings about changes in core sizes on a much larger network of instant messaging data that consisted of 180,000,000 nodes. These authors suggest that the rapid decrease in core size occurs once nodes on the fringe of the network are removed. On our network, this more rapid decrease of core sizes also coincides with the appearance of positive modularity of clusters; modularity was not reported in Leskovec and Horvitz (2008). Thus, we are mainly interested in the k-cores for large values of k: they have positive modularity, small changes in k result in large changes to their sizes, and they have been identified in the prior literature as what is left after the fringe of the network is removed. Quantitative Science Studies 302 Center–periphery structure in research communities While the simple k-core clustering method identifies citation-dense areas in the network, suggesting cohesiveness and collaboration, it has two limitations: It does not ensure positive modularity for every cluster it produces, and, on our data, for every k ≥ 3 it produced only a single large cluster. We note that for k = 0, 1, 2, it produced two or more clusters, one of which was very large. These limitations impact the ability to find multiple communities of interest in the exosome literature, especially considering the possibility that some of the communities of interest could be contained within larger clusters with nonpositive modularity. 3.2.3. Results for iterative k-core (IKC) We designed iterative k-core (IKC) to improve on the k-core clustering algorithm. The input to IKC is the network N and the parameter k. The first cluster that is found in the network is the L- core where L is the largest label assigned to any node. If L ≥ k, then the L-core is produced as a cluster and removed from the network, and otherwise the algorithm stops. If the algorithm has not stopped, it is run recursively on the reduced network. Therefore, IKC(k) will contain all the clusters in IKC(k0) for k ≤ k0. We explored IKC varying k between 5 and 50, and examined the distribution of cluster sizes generated. We also recorded the minimum degree in each cluster. Because IKC clusters satisfy k-validity, the nodes in the clusters are all “core” members, so this minimum degree is also the Minimum Core Degree (MCD) of the cluster. As expected, increasing k results in decreases in cluster size and increases in the MCD value (Figure 3). To include as much of the network as is reasonable and still have sufficient density to define community structure, we selected k = 5 and k = 10 for IKC. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Empirical statistics of the iterative k-core (IKC) clustering methods (varying k) on the exo- Figure 3. some network. The y-axis shows the cluster sizes (logarithmic scale) and the x-axis shows the Min- imum Core Degree (MCD) values, where the MCD of a cluster is the minimum degree of any node in the cluster. By design, IKC(k) contains all the clusters of IKC(k0) if k ≤ k0; thus, the panels showing IKC at lower values of k contain greater numbers of clusters. Quantitative Science Studies 303 Center–periphery structure in research communities While IKC discovered km-valid communities that were trivially p-valid, it too has limita- tions. It identifies only core nodes with modest node coverage of 7.38% and 4.22% at k = 5 and k = 10, respectively. It also generates some large clusters with lower MCD values that leave open the question of whether denser kmp-valid communities exist within them. 3.2.4. Results for the four-stage clustering pipeline To address the limitations of IKC, we designed a four-stage pipeline (Figure 4). This four-stage pipeline is guaranteed to produce a kmp-valid clustering of the network for user-provided values of k and p. In Stage 1 we use IKC with k = 5 and k = 10. Stage 2 breaks the clusters found in Stage 1 into smaller clusters; for this stage we apply either RG or IG, and for each of these we use Graclus either in default mode or in local search mode. Thus, for each setting of k we have four versions of the four-stage pipeline: Iterative and RG run in either default or local search mode. In Stage 3, the clusters produced by Stage 2 are augmented by the addition of nodes that satisfy p-validity for p = 2. In Stage 4, we parse the output of Stage 3 and retain only those clusters that are kmp-valid. We show results for the different versions of this four-stage pipeline in Table 2, where the rows correspond to different ways of setting k and running Stages 2 and 3. Using IKC alone without Stages 2 and 3 resulted in 276 and 119 nonsingleton clusters and had total node coverage of 7.38% and 4.22%, with maximum cluster sizes of 345,139 and 213,670, for k = 5 and k = 10, respectively. Skipping Stage 2 (breaking down clusters) but adding Stage 3 (augmentation) greatly increases the node coverage to at least 33% for both settings of k but l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 4. Our four-stage clustering pipeline takes as input a citation network and produces clusters based on values selected for parameters k and p. Boxes adjacent to edges indicate stages in the pipe- line; boxes with blue borders indicate tests that are performed to determine which clusters are passed to the next stage. In Stage 1, it runs the iterative k-core (IKC) algorithm for the selected value of k; clusters that are km-valid are then passed to the next stage. Stage 2 (optional) breaks the clusters from the first stage into smaller clusters, using either Recursive Graclus or Iterative Graclus (and in each case, using either the default version or a heuristic search version). Clusters that pass the required validity check (k- validity for Recursive Graclus and km-validity for Iterative Graclus) are then passed to the next stage. These clusters are then enlarged with additional nodes in Stage 3, the “Augmentation” step. All clusters produced are p-valid at this point, and are passed to Stage 4, kmp-parsing, which produces a set of kmp-valid clusters, each of which is parsed into their core subcluster and noncore subcluster. Quantitative Science Studies 304 Center–periphery structure in research communities Table 2. Cluster statistics for 12 variants of four-stage clustering. All results include Stages 1 and 4, but some pipelines do not use Stages 2 or 3; all clusterings are kmp-valid. Top: results for k = 5, Bottom: results for k = 10. Stage 2 is performed using either Iterative Graclus (IG) or Recursive Graclus (RG), which are each run with either 0 or 2,000 local search iterations. Node coverage refers to the percentage of network nodes contained in nonsingleton clusters and singletons refers to the number of nodes in singleton clusters. All other statistics refer to nonsingleton clusters, with the last three columns refering to the sizes of nonsingleton clusters. Specific noteworthy trends include: (a) All clusterings that use Stage 3 have node coverage above 18%; (b) all clusterings have at least one very large cluster; (c) Stage 2 choice impacts maximum cluster size; and (d) setting k = 5 produces more clusters with a smaller median cluster size than setting k = 10. Stage 2 k = 5 No No IG(0) IG(2000) RG(0) RG(2000) k = 10 No No IG(0) IG(2000) RG(0) RG(2000) Stage 3 Node coverage Clusters (number) Singletons (number) Min. Median Max. No Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes 7.38% 36.63% 20.01% 20.84% 26.27% 26.09% 4.22% 33.69% 18.51% 20.00% 27.26% 27.61% 276 276 13,709 9,698 2,261 3,417 119 119 4,185 3,044 359 473 13,611,485 9,312,583 11,752,582 11,632,959 10,835,596 10,861,670 14,075,787 9,744,368 11,975,796 11,757,196 10,689,730 10,637,388 8 15 6 6 10 11 12 62 33 31 67 55 28.0 345,139 265.5 118.0 106.0 145.0 105.0 856,623 19,934 32,487 579,720 578,265 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . 85.0 213,670 1638.0 964,503 427.0 488.5 27,007 60,189 1014.0 679,922 761.0 620,491 / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 also increases the maximum cluster size to 856,623 and 964,503 for k = 5 and k = 10, respec- tively. This approach also has large median cluster sizes, especially for k = 10 where the median is 1638. Thus, using Stage 3 (augmentation) but not also Stage 2 produces high node coverage and large cluster sizes. Using all four stages, and hence using Stage 2, reduces the node coverage to values that range from 18.51% to 27.61% and also reduces the median and maximum cluster sizes, but the choice of how Stage 2 is run has a significant impact. Node coverage is higher when using RG rather than IG. Using RG in default mode rather than local search mode tends to produce a smaller number of nonsingleton clusters that are also somewhat larger; for example, when k = 10, default usage of Graclus that doesn’t employ the local search strategy produces a median cluster size of 1,014 as opposed to 761 when using 2,000 local search iterations. The choice between k = 5 and k = 10 also impacts results, with k = 10 producing a much smaller number of nonsingleton clusters that are substantially larger than the results for k = 5. For example, using default RG at k = 10 produces 359 nonsingleton clusters with median size 1,014, while the same setting for k = 5 produces 2,261 nonsingleton clusters with median size 145. Quantitative Science Studies 305 Center–periphery structure in research communities 3.2.5. Comparison between different clustering methods We now compare the different clustering outputs with respect to node coverage, number of nonsingleton clusters, and the distribution of nonsingleton cluster sizes. As our discussion of the different variants of the four-stage pipeline revealed, how Stages 2 and 3 are run and the value for k impact these statistics. If node coverage is the most important criterion, then skip- ping Stage 2 but using Stage 3 is recommended. However, as these approaches produce very large clusters, including Stage 2 is more likely to be desirable. Among the techniques that use Stage 2, IG tends to produce smaller clusters, but RG produces higher node coverage; the choice between these should be made based on the specific question that is being addressed. Similarly, how k is set should depend on the features of the citation network, and picking larger values of k may be suitable under some conditions. A comparison to Leiden is also helpful: Before restricting to kmp-valid clusters, Leiden has very high node coverage (57%) for resolution value 0.05 and 90% at 0.001. After restricting to kmp-valid clusters, however, the best node coverage seen is 13.8% at resolution value 0.0005 when k = 5 and 8.8% at resolution value 0.001 when k = 10. In contrast, the four-stage pipe- line has node coverage that varies from 18.51% to 27.61%, depending on k and how Stage 2 is performed (Table 2). Interestingly, the Leiden algorithm also discovers kmp-valid communi- ties even though it is not designed to specifically identify such communities. These findings support the idea of such center–periphery communities existing in networks and being more or less efficiently discovered depending on the algorithm being used. Thus, a potential user of this approach is presented with options that can be used to address contextual needs. For example, after considering features of the source data and the purpose of community finding, a user may choose Leiden, IKC, IKC with augmentation, or the complete pipeline with its kmp-parsing requirements. 3.3. Marker Node Analysis l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d . / The network we constructed for this study has more than 14 million nodes. We assumed that some of the communities discovered would represent areas of investigation peripheral to exo- somes and extracellular vesicles. Consequently, we used a set of 1,218 independently selected articles from the extracellular vesicle literature, all of which are present in the network, as marker nodes (Section 2) and used them to identify clusters of interest. Any community con- taining at least one marker node was considered relevant; however, we were particularly inter- ested in communities with high numbers of marker nodes. f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 For IKC clustering at either k = 5 or k = 10 (Figure 2), two clusters contained 256 and 227 marker nodes respectively, and together accounted for approximately 40% of the 1,218 markers. The first of these clusters, with 256 marker nodes, was the 56-core of the network, and comprised 3,630 nodes. The second, with 227 marker nodes, exhibited an MCD (mini- mum core degree) value of 12 and consisted of 213,670 nodes. To visualize the distribution of the entire set of marker nodes we used a multidimensional scaling approach using the fre- quency of co-occurrence in clusters across 12 different clustering methods as the measure of similarity between publications. Each of these two sets of marker nodes is found in dense and clearly defined clusters after multidimensional scaling (Figure 5). These two clusters were similar in having a large number of marker nodes but were other- wise different from each other with respect to MCD and size; therefore, we used the two sets of 256 and 227 markers as examples for further study. Quantitative Science Studies 306 Center–periphery structure in research communities Figure 5. Multidimensional scaling (MDS) of 1,218 marker nodes based on the frequency with which they are placed in the same cluster in 12 clustering outputs. The circled beige cluster (bottom right) corresponds to the 256 markers found using either IKC(5) or IKC(10) in a single cluster of minimum cluster degree (MCD) of 56, while the circled blue cluster (top) corresponds to the 227 markers found in a different cluster with an MCD value of 12. A second criterion considered was robustness to clustering method, which we measure by the frequency with which marker nodes were found co-located in the same cluster across the 12 clusterings we studied. We began this evaluation by first determining which of these nodes are always placed in nonsingleton clusters for all 12 clustering methods. We found that 27 of 256 and 35 of 227 marker nodes respectively were always placed in nonsingleton clusters, and studied these two sets further. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d . / The first set (A) consisting of 27 marker nodes contained 17 articles and eight reviews that were published between 2006 and 2019, with 23 of these published in 2010 or later. Based upon inspection of titles and journal, the contents of articles in set A spanned basic cell biol- ogy, the role of exosomes in cancer, and exosome isolation methods with some variation in terms of being descriptive or mechanistic. The second set (B) of 35 markers contained 26 arti- cles and nine reviews published between 2013 and 2021, largely focused on basic and trans- lational studies of exosomes in nervous tissue but also including a few articles on exosomes in pregnancy and exosome isolation methods. f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 We then examined the publications in sets A and B for co-occurrence in the same cluster across all 12 clusterings. We found that eight articles from set A were always found in the same cluster and four articles from set B were always found in the same cluster. While the numbers of markers in these sets are small, they serve to identify a larger community, which can then be characterized further by other techniques, such as detailed scholarly examination or textual content analysis. We identified the smallest cluster across the 12 clusterings that contained the eight articles from A; the selected cluster (Cluster 1) consisted of 73 articles and was focused on extracellular vesicles in cancer. We performed a similar analysis for the four articles from B, and found a cluster (Cluster 2) of 145 articles that was focused on extracellular vesicles in the nervous system. Thus, both these clusters were relatively homogeneous with respect to article content, one focused on Quantitative Science Studies 307 Center–periphery structure in research communities Statistics on the two selected small clusters. Cluster 1 and Cluster 2 were selected based Table 3. on marker nodes: The greatest number of marker nodes that were found co-located in the same cluster across all 12 clustering methods were used to define their common clusters, and the smallest of the common clusters for each set was returned. Both clusters are kmp-valid for k = 5 and p = 2. Articles refers to the total number of nodes in the cluster, core nodes refers to the number of these nodes that are core, and markers refers to the number of nodes that are markers. Authors denotes the total number of authors of articles in the cluster, Auth_5 refers to the number of authors that have at least five publications in the cluster, and Max_Pubs refers to the largest number of publications in the cluster by any single author. Cluster 1 Articles 73 Core nodes 47 Markers 16 Authors 356 Auth_5 9 Max_Pubs 17 Cluster 2 145 129 31 742 4 7 extracellular vesicles in cancer and the other on extracellular vesicles in the nervous system. Our use of a single label for each cluster should be considered a subjective approximation; an alternate view is that Cluster 2 also includes articles on cancer and has some focus on astro- cytes. Both clusters were derived from the IKC(5)-IG branch of the pipeline. We then extracted the authors of articles in these two clusters (Table 3). Cluster 1 involved 356 authors of which nine were authors of at least five articles in the cluster and one person was an author of 17 articles in the cluster. However, 301 authors (84.6%) had contributed to only one article in the cluster. Cluster 2 involved 742 authors, of which four were authors of at least five articles in the cluster. One author had contributed seven articles and 650 authors (87.6%) had contributed only one article each. Interestingly, the two publication clusters share 14 authors. Thus, the two author communities defined by two disjoint publication clusters overlap. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 6. Coauthor clusters in two communities identified using marker nodes. Two clusters were selected for analysis because they con- tained the greatest number of marker nodes that were co-located across all 12 clustering methods, and were the smallest of such clusters (see text). Discrete coauthor groups are found in both clusters when inclusion in a coauthor groups requires at least two instances of intracluster coauthorship between two authors. (a) Cluster 1 consists of 73 articles contributed by 356 authors. This cluster contains four nonoverlapping coauthor groups, with 5, 8, 11, and 28 authors. (b) Cluster 2 consists of 145 articles authored by 356 authors. This cluster contains 17 non- overlapping groups, with four groups of two authors, three groups of three authors, three groups of four authors, two groups of five authors, four groups of seven authors, and one group of 18 authors. Quantitative Science Studies 308 Center–periphery structure in research communities These trends are strikingly similar to the observations of Price and Beaver (1966) in that the authors segregate into a small number with large numbers of papers in the cluster and many with only one paper in the cluster. Although we only examined two clusters, both exhibited the center–periphery structure described in Price and Beaver (1966); our sample is too small to draw conclusions beyond suggesting that the trends observed may be true for other clusters, and this should be evaluated. We also found discrete coauthor groups within these clusters (Figure 6). Cluster 1 featured four nonoverlapping coauthor groups where authors were linked to each other if they had coauthored at least two articles in the cluster. Cluster 2 featured 17 such discrete coauthor groups suggesting, despite its larger size, that influence within the group was more distributed considering the larger number of coauthor groups and the smaller number of articles written by individual authors. These examples are provided to illustrate the potential utility of the pipeline and the use of marker nodes. 4. CONCLUSIONS Based on historical studies of research communities, we posed corresponding properties for the graphical structure of communities in networks. We developed an analytic pipeline to ask whether communities of publications with center–periphery substructure exist in citation net- works. In designing a four-stage pipeline to find communities of this form, we were implicitly asking whether the information encoded in the graphical structure of communities can be used to make inferences on the social structure of these communities, for example, discrete coau- thor groups. We examined these questions using a citation network representative of exosomes and extracellular vesicles, a field that has rapidly expanded in recent years. In this citation network, our pipeline found many publication communities that exhibit center–periphery structure. This finding supports our hypothesis that communities of this type exist within the extracellular vesicles research community, and shows that the pipeline we used can find such communities. Whether such communities exist in other citation networks and whether our pipeline is successful at finding such communities are important questions that future work should address. We note that our work explored one very large network with respect to a particular center–periphery model. However, other center–periphery models have been proposed that can be explored and evaluated (Borgatti & Everett, 2000; Gallagher et al., 2021; Havemann, Gläser, & Heinz, 2018; Rombach et al., 2014, 2017), and future work should examine whether these provide better insight into the structure of scientific research communities. Our pipeline is designed to enable investigators to interrogate their data with different options for each stage and different settings for k and p. As we observed in our study, changes to the settings for the parameters k and p as well as how each stage was performed produced clusterings that differ from each other in terms of the the node coverage as well as the number and sizes of nonsingleton clusters, effectively providing different views of the network. Thus, the specific question of interest and the properties of the citation network are important in choosing how to set these parameters. Alternatively, the pipeline can be used to generate many different clusterings, and the investigator may assess community structure through an integrative analysis that does not depend on a single clustering method. We also saw significant differences between clusterings produced by our pipeline and those produced by the Leiden software. While there is some overlap in the range of cluster sizes generated, our pipeline tends to generate much larger clusters than the Leiden software, and all of our pipeline clusterings produced at least one very large cluster with more than Quantitative Science Studies 309 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Center–periphery structure in research communities 19,000 nodes and, in the case of RG, the largest cluster contained more than 500,000 nodes. As we note earlier, the Leiden algorithm was not designed to identify substructure detected as kmp-valid clusters but it also detects citation-dense communities. Given our focus on the small clusters produced in our various pipelines, we did not explore large publication clusters. It is not clear to us what insights can be gained from clusters that have tens of thousands of nodes. We speculate that they may reflect the many connections in a rapidly expanding field. Alternatively, they are targets for future versions of Stage 2, in which we break large clusters into smaller valid ones, perhaps with different criteria for validity. Further work is clearly needed. This study suggests several other directions for future work. For method development, our four-stage pipeline is designed to enable substitutions to how each stage is performed; as noted above, breaking up large clusters might be more successfully executed using new approaches rather than either recursive or iterative Graclus. We also note that all the clustering methods we developed produce disjoint clusters, yet publications may be expected to be members of more than one research community. Some clustering methods exist that can produce overlap- ping clusters (Rossetti, 2020); those that cluster edges rather than nodes (Ahn, Bagrow, & Lehmann, 2010; Evans & Lambiotte, 2009; Havemann, 2021) are a next step towards nondis- joint approaches for us. Exploring overlapping clusters in the context of large citation networks is likely to provide additional insight. Combining information from multiple clustering methods could also lead to greater insight, so principled development of ensemble methods (a standard approach in machine learning) is another direction for future research. One of the most intriguing directions for future research is the life cycle of these research communities, both in terms of how ideas and questions being focused on by the research com- munity as a whole change over time and how authors move between different communities over time. This is challenging to study because the network evolves through growth each year, and community-finding in dynamic graphs is a promising direction (Rossetti, 2020). Additional clusters could be examined for in-depth examination based, for example, on specific authors, articles of particular importance, funding sources, or clusters identified by marker nodes co-located in many but not all 12 clusterings. Other insights could be obtained by examining the relationship between publication communities (e.g., citations between communities) in a single clustering or comparisons of communities obtained using different clustering methods; such investigations would help elucidate how the research ideas and communities relate to each other, and the extent to which these communities are hierarchi- cally organized. For extracellular vesicles, insight into community life cycles can be obtained by studying, over time, communities containing key studies such as those on transferrin recycling (Harding, Heuser, & Stahl, 1983; Pan & Johnstone, 1983), the observation that B-lymphocytes secrete antigen-presenting vesicles (Raposo, Nijman et al., 1996), a report of exosome-mediated trans- fer of mRNA and microRNA (Ratajczak, Miekus et al., 2006; Valadi, Ekström et al., 2007), and the biological effects of transferring exosomes between lean and obese mice (Ying, Riopel et al., 2017). Our analysis was based on a single set of marker nodes; extending the set of marker nodes and annotating each marker node with respect to content would allow finer- grained evaluation. Exploration of author communities associated to these publication communities would offer additional insights into the social structure of the research community, potentially identifying authors with high influence within a particular emerging research area, and others that are highly influential across several areas. Quantitative Science Studies 310 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Center–periphery structure in research communities We close with comments about the high-level approach we took to understanding commu- nity structure. Our approach relied entirely on the graphical structure of the citation network. This restriction was used in order to provide a scalable approach that did not rely on any other information or expert knowledge; however, this limits which communities can be detected (McCain, 1986). Textual analysis, relationships between authors based on institutions, and other social interactions such as conference presentations and sources of funding could be used to used to supplement citation data and would likely lead to a different set of publication or author communities with potentially different properties. Understanding author role is important, and again our reliance on citations to identify influ- ential researchers is biased towards well-cited and well-funded authors. Perhaps one of the benefits, therefore, of our approach to community detection is that we can use it to find small and thematically focused publication communities and hence identify those authors who are influential within these small communities. Nevertheless, we propose that while scalable methods for community detection may generally tend to rely on purely graph-theoretic prop- erties, mixed method approaches support more a more nuanced understanding of social struc- tures and dynamics within the scientific enterprise. ACKNOWLEDGMENTS We thank Valerie King from the University of Victoria for directing us to the k-core literature. We thank Phil Stahl from Washington University in St Louis for helpful discussions and for drawing our attention to recent reviews of the extracellular vesicle literature. We thank Digital Science, Google, the Grainger Foundation, and the Thomas and Stacey Siebel Foundation. AUTHOR CONTRIBUTIONS Eleanor Wedell: Formal analysis, Investigation, Methodology, Software, Writing—Original draft, Writing—Review & editing. Minhyuk Park: Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing—Original draft, Writing—Review & editing. Dimitriy Korobskiy: Data curation, Formal analysis, Software, Validation, Writing—Review & editing. Tandy Warnow: Conceptualization, Investigation, Methodology, Project administration, Resources, Supervision, Writing—Original draft, Writing—Review & editing. George Chacko: Conceptualization, Data curation, Formal analysis, Funding acquisi- tion, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing—Original draft, Writing—Review & editing. COMPETING INTERESTS The authors have no competing interests. Dimensions data were made available by Digital Science through the free data access for scientometrics research projects program. Digital Science personnel did not participate in conceptualization, experimental design, review of results, or conclusions presented. DK is an employee of NTT DATA, which had no role in this study. FUNDING INFORMATION EW is a Siebel Scholar. TW receives funding from the Grainger Foundation. Research reported in this manuscript was supported by the Google Cloud Research Credits program through award GCP19980904 to GC. Quantitative Science Studies 311 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Center–periphery structure in research communities DATA AVAILABILITY Access to the bibliographic data analyzed in this study requires access from Digital Science. Code generated for this study is freely available from our Github site (Park et al., 2021). REFERENCES Ahlgren, P., Chen, Y., Colliander, C., & van Eck, N. J. (2020). Enhancing direct citations: A comparison of relatedness mea- sures for community detection in a large set of PubMed publica- tions. Quantitative Science Studies, 1(2), 714–729. https://doi.org /10.1162/qss_a_00027 Ahn, Y.-Y., Bagrow, J. P., & Lehmann, S. (2010). Link communities reveal multiscale complexity in networks. Nature, 466(7307), 761–764. https://doi.org/10.1038/nature09182, PubMed: 20562860 Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008. https://doi.org/10.1088/1742-5468/2008/10/P10008 Borgatti, S. P., & Everett, M. G. (2000). Models of core/periphery structures. Social Networks, 21(4), 375–395. https://doi.org/10 .1016/S0378-8733(99)00019-2 Boyack, K. W., & Klavans, R. (2010). Co-citation analysis, biblio- graphic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12), 2389–2404. https://doi.org/10.1002/asi.21419 Boyack, K. W., & Klavans, R. (2014). Including cited non-source items in a large-scale map of science: What difference does it make? Journal of Informetrics, 8(3), 569–580. https://doi.org/10 .1016/j.joi.2014.04.001 Boyack, K. W., & Klavans, R. (2019). Creation and analysis of large-scale bibliometric networks. In W. Glänzel, H. F. Moed, U. Schmoch, & M. Thelwall (Eds.), Springer handbook of science and technology indicators (pp. 187–212). Cham: Springer Inter- national. https://doi.org/10.1007/978-3-030-02511-3_8 Breiger, R. (2014). Explorations in structural analysis (RLE social theory): Dual and multiple networks of social interaction. New York: Routledge. https://doi.org/10.4324/9781315765488 Busatto, S., Morad, G., Guo, P., & Moses, M. A. (2021). The role of extracellular vesicles in the physiological and pathological regulation of the blood–brain barrier. FASEB Bio Advances, 3(9), 665–675. https://doi.org/10.1096/fba.2021-00045, PubMed: 34485835 Chandrasekharan, S., Zaka, M., Gallo, S., Zhao, W., Korobskiy, D., … Chacko, G. (2021). Finding scientific communities in citation graphs: Articles and authors. Quantitative Science Studies, 2(1), 184–203. https://doi.org/10.1162/qss_a_00095 Chubin, D. E. (1976). State of the field the conceptualization of sci- entific specialties. The Sociological Quarterly, 17(4), 448–476. https://doi.org/10.1111/j.1533-8525.1976.tb01715.x Clancy, J. W., Schmidtmann, M., & D’Souza-Schorey, C. (2021). The ins and outs of microvesicles. FASEB Bio Advances, 3(6), 399–406. https://doi.org/10.1096/fba.2020-00127, PubMed: 34124595 Coscia, M., Giannotti, F., & Pedreschi, D. (2011). A classification for community discovery methods in complex networks. Statisti- cal Analysis and Data Mining, 4(5), 512–546. https://doi.org/10 .1002/sam.10133 Crane, D. (1972). Invisible colleges; diffusion of knowledge in sci- entific communities. Chicago: University of Chicago Press. Devarakonda, S., Korobskiy, D., Warnow, T., & Chacko, G. (2020). Viewing computer science through citation analysis: Salton and Bergmark redux. Scientometrics, 125(1), 271–287. https://doi.org /10.1007/s11192-020-03624-0, PubMed: 33746310 Dhillon, I., Guan, Y., & Kulis, B. (2007). Weighted graph cuts with- out eigenvectors: A multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI ), 29(11), 1944–1957. https://doi.org/10.1109/ TPAMI.2007.1115, PubMed: 17848776 Edgar, J. R. (2016). Q & A: What are exosomes, exactly? BMC Biology, 14(46). https://doi.org/10.1186/s12915-016-0268-z, PubMed: 27296830 Evans, T. S., & Lambiotte, R. (2009). Line graphs, link partitions, and overlapping communities. Physical Review E, 80(1), 016105. https://doi.org/10.1103/physreve.80.016105, PubMed: 19658772 Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3–5), 75–174. https://doi.org/10.1016/j.physrep .2009.11.002 Fortunato, S., & Barthelemy, M. (2007). Resolution limit in commu- nity detection. Proceedings of the National Academy of Sciences, 104(1), 36–41. https://doi.org/10.1073/pnas.0605965104, PubMed: 17190818 Fortunato, S., & Castellano, C. (2009). Community structure in graphs. In Computational complexity (pp. 490–512). New York: Springer. https://doi.org/10.1007/978-1-4614-1800-9_33 Gallagher, R. J., Young, J.-G., & Welles, B. F. (2021). A clarified typology of core-periphery structure in networks. Science Advances, 7(12), eabc9800. https://doi.org/10.1126/sciadv .abc9800, PubMed: 33731343 Ghoroghi, S., Mary, B., Asokan, N., Goetz, J. G., & Hyenne, V. (2021). Tumor extracellular vesicles drive metastasis (it’s a long way from home). FASEB Bio Advances, 3(11), 930–943. https:// doi.org/10.1096/fba.2021-00079, PubMed: 34761175 Giatsidis, C., Thilikos, D. M., & Vazirgiannis, M. (2011). Evaluating cooperation in communities with the k-core structure. In 2011 International Conference on Advances in Social Networks Analysis and Mining. IEEE. https://doi.org/10.1109/asonam.2011.65 Harding, C., Heuser, J., & Stahl, P. (1983). Receptor-mediated endo- cytosis of transferrin and recycling of the transferrin receptor in rat reticulocytes. Journal of Cell Biology, 97(2), 329–339. https:// doi.org/10.1083/jcb.97.2.329, PubMed: 6309857 Havemann, F. (2021). Topics as clusters of citation links to highly cited sources: The case of research on international relations. Quantitative Science Studies, 2(1), 204–223. https://doi.org/10 .1162/qss_a_00108 Havemann, F., Gläser, J., & Heinz, M. (2018). Communities as well separated subgraphs with cohesive cores: Identification of core-periphery structures in link communities. In Studies in com- putational intelligence (pp. 219–230). Springer International. https://doi.org/10.1007/978-3-030-05411-3_18 He, B., Hamby, R., & Jin, H. (2021). Plant extracellular vesicles: Trojan horses of cross-kingdom warfare. FASEB Bio Advances, 3(9), 657–664. https://doi.org/10.1096/fba.2021-00040, PubMed: 34485834 Quantitative Science Studies 312 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Center–periphery structure in research communities Hook, D. W., Porter, S. J., & Herzog, C. (2018). Dimensions: Building context for search and evaluation. Frontiers in Research Metrics and Analytics, 3, 23. https://doi.org/10.3389/frma.2018 .00023 Javed, M. A., Younis, M. S., Latif, S., Qadir, J., & Baig, A. (2018). Community detection in networks: A multidisciplinary review. Journal of Network and Computer Applications, 108, 87–111. https://doi.org/10.1016/j.jnca.2018.02.011 Kalluri, R., & LeBleu, V. S. (2020). The biology, function, and bio- medical applications of exosomes. Science, 367(6478). https:// doi.org/10.1126/science.aau6977, PubMed: 32029601 Klavans, R., & Boyack, K. W. (2017). Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge? Journal of the Association for Information Science and Technology, 68(4), 984–998. https://doi.org/10.1002/asi.23734 Kong, Y.-X., Shi, G.-Y., Wu, R.-J., & Zhang, Y.-C. (2019). k-core: Theories and applications. Physics Reports, 832, 1–32. https:// doi.org/10.1016/j.physrep.2019.10.004 Kuhn, T. (1970). The structure of scientific revolutions. Chicago: University of Chicago Press. Lananna, B. V., & Imai, S. (2021). Friends and foes: Extracellular vesicles in aging and rejuvenation. FASEB Bio Advances, 3(10), 787–801. https://doi.org/10.1096/fba.2021-00077, PubMed: 34632314 Le Lay, S., Rome, S., Loyer, X., & Nieto, L. (2021). Adipocyte-derived extracellular vesicles in health and diseases: Nano-packages with vast biological properties. FASEB Bio Advances, 3(6), 407–419. https://doi.org/10.1096/fba.2020-00147, PubMed: 34124596 Leidal, A. M., & Debnath, J. (2021). Emerging roles for the autoph- agy machinery in extracellular vesicle biogenesis and secretion. FASEB Bio Advances, 3(5), 377–386. https://doi.org/10.1096/fba .2020-00138, PubMed: 33977236 Leskovec, J., & Horvitz, E. (2008). Planetary-scale views on a large instant-messaging network. In Proceeding of the 17th Interna- tional Conference on World Wide Web—WWW ‘08. ACM Press. https://doi.org/10.1145/1367497.1367620 Malliaros, F. D., Giatsidis, C., Papadopoulos, A. N., & Vazirgiannis, M. (2019). The core decomposition of networks: Theory, algo- rithms and applications. VLDB Journal, 29(1), 61–92. https://doi .org/10.1007/s00778-019-00587-4 Matula, D. W., & Beck, L. L. (1983). Smallest-last ordering and clus- tering and graph coloring algorithms. Journal of the Association for Computing Machinery, 30(3), 417–427. https://doi.org/10 .1145/2402.322385 McCain, K. W. (1986). The paper trails of scholarship: Mapping the literature of genetics. Library Quarterly, 56, 258–271. https://doi .org/10.1086/601770 Morris, S. A. (2005). Manifestation of emerging specialties in jour- nal literature: A growth model of papers, references, exemplars, bibliographic coupling, cocitation, and clustering coefficient dis- tribution. Journal of the American Society for Information Science and Technology, 56(12), 1250–1273. https://doi.org/10.1002/asi .20208 Morris, S. A., & der Veer Martens, B. V. (2009). Mapping research specialties. Annual Review of Information Science and Technology, 42(1), 213–295. https://doi.org/10.1002/aris.2008.1440420113 Mullins, N. C. (1985). Invisible colleges as science elites. Sciento- metrics, 7, 357–368. https://doi.org/10.1007/BF02017154 Mullins, N. C., Hargens, L. L., Hecht, P. K., & Kick, E. L. (1977). The group structure of cocitation clusters: A comparative study. American Sociological Review, 42(4), 552–562. https://doi.org /10.2307/2094554 Newman, M. E. J. (2006). Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23), 8577–8582. https://doi.org/10.1073/pnas.0601602103, PubMed: 16723398 Newman, M. E. J., & Girvan, M. (2004). Finding and evaluating com- munity structure in networks. Physical Review E, 69(2), 026113. https://doi.org/10.1103/physreve.69.026113, PubMed: 14995526 Pan, B.-T., & Johnstone, R. M. (1983). Fate of the transferrin recep- tor during maturation of sheep reticulocytes in vitro: Selective externalization of the receptor. Cell, 33(3), 967–978. https://doi .org/10.1016/0092-8674(83)90040-5, PubMed: 6307529 Park, M., Wedell, E., Korobskiy, D., Warnow, T., & Chacko, G. (2021). Community Finding and Clustering Project. Github repository, University of Illinois Urbana-Champaign. https:// github.com/chackoge/ERNIE_Plus/tree/master/Illinois/clustering. Pittel, B., Spencer, J., & Wormald, N. (1996). Sudden emergence of a giant k-core in a random graph. Journal of Combinatorial Theory, Series B, 67, 111–151. https://doi.org/10.1006/jctb.1996.0036 Price, D. d. S. (1965). Networks of scientific papers. Science, 149, 510–515. https://doi.org/10.1126/science.149.3683.510, PubMed: 14325149 Price, D. d. S., & Beaver, D. D. (1966). Collaboration in an invisible college. American Psychologist, 21(11), 1011–1018. https://doi .org/10.1037/h0024051, PubMed: 5921694 Raposo, G., Nijman, H. W., Stoorvogel, W., Liejendekker, R., Harding, C. V., … Geuze, H. J. (1996). B lymphocytes secrete antigen-presenting vesicles. Journal of Experimental Medicine, 183(3), 1161–1172. https://doi.org/10.1084/jem.183.3.1161, PubMed: 8642258 Raposo, G., van Niel, G., & Stahl, P. D. (2021). Extracellular vesi- cles and homeostasis—An emerging field in bioscience research. FASEB BioAdvances, 3(6), 456–458. https://doi.org/10.1096/fba .2021-00009, PubMed: 34124600 Ratajczak, J., Miekus, K., Kucia, M., Zhang, J., Reca, R., … Ratajczak, M. Z. (2006). Embryonic stem cell-derived microvesicles repro- gram hematopoietic progenitors: Evidence for horizontal transfer of mRNA and protein delivery. Leukemia, 20(5), 847–856. https:// doi.org/10.1038/sj.leu.2404132, PubMed: 16453000 Rombach, M. P., Porter, M. A., Fowler, J. H., & Mucha, P. J. (2014). Core-periphery structure in networks. SIAM Journal on Applied Mathematics, 74(1), 167–190. https://doi.org/10.1137/120881683 Rombach, P., Porter, M. A., Fowler, J. H., & Mucha, P. J. (2017). Core–periphery structure in networks (revisited). SIAM Review, 59(3), 619–646. https://doi.org/10.1137/17m1130046 Rossetti, G. (2020). ANGEL: Efficient, and effective, node-centric community discovery in static and dynamic networks. Applied Network Science, 5, 26. https://doi.org/10.1007/s41109-020 -00270-6 Schnatz, A., Müller, C., Brahmer, A., & Krämer-Albers, E.-M. (2021). Extracellular vesicles in neural cell interaction and CNS homeo- stasis. FASEB Bio Advances, 3(8), 577–592. https://doi.org/10 .1096/fba.2021-00035, PubMed: 34377954 Sciabolazza, V. L., Vacca, R., Okraku, T. K., & McCarty, C. (2017). Detecting and analyzing research communities in longitudinal scientific networks. PLOS ONE, 12(8), e0182516. https://doi .org/10.1371/journal.pone.0182516, PubMed: 28797047 Seidman, S. B. (1983). Network structure and minimum degree. Social Networks, 5(3), 269–287. https://doi.org/10.1016/0378 -8733(83)90028-x Sjögarde, P., & Ahlgren, P. (2018). Granularity of algorithmically constructed publication-level classifications of research publi- cations: Identification of topics. Journal of Informetrics, 12(1), 133–152. https://doi.org/10.1016/j.joi.2017.12.006 Quantitative Science Studies 313 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Center–periphery structure in research communities Sjögarde, P. & Ahlgren, P. (2020). Granularity of algorithmically constructed publication-level classifications of research publica- tions: Identification of specialties. Quantitative Science Studies, 1(1), 207–238. https://doi.org/10.1162/qss_a_00004 Staudt, C. L., Sazonovs, A., & Meyerhenke, H. (2016). NetworKit: A tool suite for large-scale complex network analysis. Network Science, 4(4), 508–530. https://doi.org/10.1017/nws.2016.20 Šubelj, L., van Eck, N. J., & Waltman, L. (2016). Clustering scien- tific publications based on citation relations: A systematic com- parison of different methods. PLOS ONE, 11(4), e0154404. https://doi.org/10.1371/journal.pone.0154404, PubMed: 27124610 Traag, V. (2021). Network analysis, GitHub page for the Leiden algorithm. https://github.com/vtraag/networkanalysis (accessed October 18, 2021). Traag, V. A., Waltman, L., & van Eck, N. J. (2019). From Louvain to Leiden: Guaranteeing well-connected communities. Scientific Reports, 9, 5233. https://doi.org/10.1038/s41598-019-41695-z, PubMed: 30914743 Valadi, H., Ekström, K., Bossios, A., Sjöstrand, M., Lee, J. J., & Lötvall, J. O. (2007). Exosome-mediated transfer of mRNAs and microRNAs is a novel mechanism of genetic exchange between cells. Nature Cell Biology, 9(6), 654–659. https://doi.org/10.1038 /ncb1596, PubMed: 17486113 Van Dongen, S. (2008). Graph clustering via a discrete uncoupling process. SIAM Journal on Matrix Analysis and Applications, 30(1), 121–141. https://doi.org/10.1137/040608635 van Niel, G., D’Angelo, G., & Raposo, G. (2018). Shedding light on the cell biology of extracellular vesicles. Nature Reviews Molec- ular Cell Biology, 19(4), 213–228. https://doi.org/10.1038/nrm .2017.125, PubMed: 29339798 Verdi, V., Bécot, A., van Niel, G., & Verweij, F. J. (2021). In vivo imaging of EVs in zebrafish: New perspectives from “the water- side.” FASEB Bio Advances, 3(11), 918–929. https://doi.org/10 .1096/fba.2021-00081, PubMed: 34761174 Waltman, L., & van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the American Society for Information Science and Technology, 63(12), 2378–2392. https://doi.org/10.1002/asi.22748 Ying, W., Riopel, M., Bandyopadhyay, G., Dong, Y., Birmingham, A., … Olefsky, J. M. (2017). Adipose tissue macrophage-derived exosomal miRNAs can modulate in vivo and in vitro insulin sensitivity. Cell, 171(2), 372–384. https://doi.org/10.1016/j.cell .2017.08.035, PubMed: 28942920 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 3 1 2 8 9 2 0 0 8 3 0 0 q s s _ a _ 0 0 1 8 4 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Quantitative Science Studies 314 RESEARCH ARTICLE image

Scarica il pdf