ARTICLE DE RECHERCHE
An open data set of scholars on Twitter
Philippe Mongeon1,2
, Timothy D. Bowman3
, and Rodrigo Costas4,5
1School of Information Management, Dalhousie University, Halifax, Nova Scotia, Canada
2Centre interuniversitaire de recherche sur la science et la technologie (CIRST), Université du Québec à Montréal,
Montréal, Québec, Canada
3School of Information Sciences, Wayne State University, Detroit, MI, Etats-Unis
4Centre for Science and Technology Studies (CWTS), Leiden University, Leiden, The Netherlands
5DSI-NRF Centre of Excellence in Scientometrics and Science, Technology and Innovation Policy (SciSTIP),
Stellenbosch University, Stellenbosch, Afrique du Sud
Mots clés: altmetrics, bibliométrie, open data, social media metrics, Twitter
ABSTRAIT
The role played by research scholars in the dissemination of scientific knowledge on social
media has always been a central topic in social media metrics (altmetrics) recherche. Different
approaches have been implemented to identify and characterize active scholars on social
media platforms like Twitter. Some limitations of past approaches were their complexity and,
most importantly, their reliance on licensed scientometric and altmetric data. The emergence
of new open data sources such as OpenAlex or Crossref Event Data provides opportunities to
identify scholars on social media using only open data. This paper presents a novel and simple
approach to match authors from OpenAlex with Twitter users identified in Crossref Event Data.
The matching procedure is described and validated with ORCID data. The new approach
matches nearly 500,000 matched scholars with their Twitter accounts with a level of high
precision and moderate recall. The data set of matched scholars is described and made openly
available to the scientific community to empower more advanced studies of the interactions of
research scholars on Twitter.
1.
INTRODUCTION
Engagement with academic research on social media has been a central research topic in sci-
entometrics, particularly in altmetrics and social media metrics. In the early days of social
media metrics research, the focus was primarily on investigating the relationship between
the number of mentions of research publications on social media platforms (particularly Twit-
ter) and citations, with most of the studies finding weak relationships between social media
metrics and citations (Costas, Zahedi, & Wouters, 2014; Sugimoto, Work et al., 2017;
Thelwall, Haustein et al., 2013). Cependant, recent theoretical proposals have initiated a shift
in the focus of altmetric research from analyzing mentions and correlations to more interactive
perspectives. Ainsi, Haustein (2016) proposed that social media metrics need not be restricted
to the mentions of scholarly outputs on social media but could also include the mentions and
activities of individual scholars. Plus récemment, Costas, Rijcke, and Marres (2021) proposed the
notion of “heterogeneous couplings” as a common framework to study the interactions
between academic and nonacademic actors as captured via online and social media platforms
(see also Williams (2022)), in which the interactions of individual scholars on Twitter are
another fundamental form of online interaction relating to how science is being communi-
cated to society (Brainard, 2022).
un accès ouvert
journal
Citation: Mongeon, P., Bowman, T. D.,
& Costas, R.. (2023). An open data set of
scholars on Twitter. Quantitative
Science Studies, 4(2), 314–324. https://
doi.org/10.1162/qss_a_00250
EST CE QUE JE:
https://doi.org/10.1162/qss_a_00250
Peer Review:
https://www.webofscience.com/api
/gateway/wos/peer-review/10.1162
/qss_a_00250
Reçu: 23 Août 2022
Accepté: 3 Février 2023
Auteur correspondant:
Philippe Mongeon
pmongeon@dal.ca
Éditeur de manipulation:
Vincent Larivière
droits d'auteur: © 2023 Philippe Mongeon,
Timothy D. Bowman, and Rodrigo
Costas. Published under a Creative
Commons Attribution 4.0 International
(CC PAR 4.0) Licence.
La presse du MIT
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
un
_
0
0
2
5
0
p
d
.
/
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
An open data set of scholars on Twitter
In the quest to study scholars’ activities on Twitter, one long-lasting challenge is the iden-
tification of social media accounts belonging to researchers. Dans 2020 we published a paper that
introduced a method to match Web of Science authors with their Twitter accounts (Costas,
Mongeon et al., 2020) and reported on the distribution of scholars on Twitter across countries,
disciplines, academic age, and gender. One of the main features of that data set was that it
allowed us, for the first time, to investigate the relationship between the research profiles and
activities of scholars and their profiles and activities on Twitter on a large scale (Ferreira,
Mongeon, & Costas, 2021). Past data sets did not allow for this because they were either
too small or because they provided information on whether or not a Twitter account likely
belonged to a researcher without identifying the specific researcher to whom the account
belonged.
Limitations of the data set we produced with this initial work included that it used propri-
etary data from the Web of Science and Altmetric.com, which made it impossible to share the
author–tweeter pairs openly, and it was complicated for others to use the data set without
access to these databases. The large number of steps involved in the previously reported
process also possibly contributed to its lack of implementation by other researchers and its
lack of transferability to other data sets.
New developments in Open Science scientometric and altmetric databases (namely the
OpenAlex and Crossref Event Data databases) have changed this landscape, now allowing
for the creation of matches of academic authors and their research publications with their
Twitter profiles. This data paper aims to introduce a data set of scholars’ Twitter accounts iden-
tified with a naïve algorithm based entirely on available open data, presenting the process in
detail with accompanying R and Python scripts so that the process can be easily replicated
and/or improved upon by the research community. We hope this data set will support further
research on the interactions of scientific authors on social media and serve as a base for devel-
oping alternative and/or complementary approaches to match Twitter users and authors.
The paper is structured as follows. D'abord, we provide an overview of the different data
sources we used and the detailed process we used to match the Twitter accounts with
OpenAlex authors. We then report precision and recall estimates for our matching approach,
followed by an overview of the characteristics of the scholars found on Twitter.
2. DATA AND METHODS
2.1. Data Sources
2.1.1. OpenAlex
The research publications data source for this study is the OpenAlex (Priem, Piwowar, & Orr,
2022) data dump from May 20, 2022, which was downloaded and parsed into a relational
database model hosted at the Maritime Institute for Science, Technologie, and Society (MISTS)
in Canada. In the OpenAlex database, authors are represented by a unique identifier
(author_id) associated with their works (see the works_authorships table of the OpenAlex
schema). The OpenAlex database we used contains 220,870,820 author_ids. Our ultimate
objective is to assign a twitter_id to these author_id values. OpenAlex also includes a link
between the author_id and ORCID. It is worth noting that a single individual can have multiple
author_ids in OpenAlex, so that the same ORCID can be associated with multiple author_ids.
It is not clear why authors with the same IDs are not merged together in the OpenAlex data-
base, but it is likely due to the clustering approach used to construct the author entities. While
we could have chosen to perform this merge ourselves as part of our process, we chose to use
Études scientifiques quantitatives
315
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
un
_
0
0
2
5
0
p
d
.
/
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
An open data set of scholars on Twitter
all our data sources as they are and we did not merge any of the OpenAlex authors. It is always
possible for the users of our data set to group OpenAlex authors together based on an author
disambiguation process of their choice, which may include combining authors with the same
ORCID.
2.1.2. Crossref Event Data
We use a data dump of Crossref Event Data from January 2022 available at the Centre for Science
and Technology Studies (CWTS), containing over 60 million Twitter events from 5,288,867
unique Twitter accounts, which contain the tweet identifier and the DOI of the papers mentioned
in that tweet. The dump includes 4.7 million unique DOIs tweeted at least once and recorded in
Crossref Event Data (CED). We use the Twitter API to rehydrate the profile information of the
Twitter users recorded in the CED dump. An important difference between CED and Altmetric
is that CED focuses on identifying tweets to DOIs, while Altmetric also identifies Twitter mentions
to preprints (par exemple., from ArXiv) and other publication identifiers (par exemple., PMIDs). Donc, CED will
typically identify fewer tweets to publications than Altmetric (Ortega, 2018).
Our use of an altmetric database such as Crossref Event Data to identify researchers on Twitter
stems from the expectation, as in Costas et al. (2020), that researchers are more likely to tweet
research publications than nonresearchers (Tsou, Bowman et al., 2015) and therefore to be
recorded in this database. By considering only Twitter users that have mentioned scholarly work
in their tweets as recorded in Crossref Event Data, we presumably increase precision at the
expense of excluding all scholarly Twitter users that have never tweeted any research publications.
2.1.3. ORCID
The OpenAlex database includes the ORCID ids for approximately 2% of all the authors indexed
in the database. Because some researchers include their Twitter handle in their public ORCID
profiles, we leverage the information recorded in the ORCID Public Data File 2021 (Blackburn,
Cabral et al., 2021) to retrieve the Twitter account for those researchers who self-reported a
Twitter profile in their ORCID profile. We used the ORCID data dump (2021) hosted by the Centre
for Science and Technology Studies (CWTS) to obtain a set of 13,208 matching OpenAlex author
ids and ORCID profiles. This data set has been used as a golden set to evaluate the performance of
our matching process. It should be noted that Twitter accounts listed in ORCID profiles are not
necessarily valid. Even when these accounts are valid, it is important to note that the Twitter han-
dle and user name may not include an actual name, which makes it impossible to match the
accounts using the process described below as it relies on matching names. These factors may
artificially penalize the recall, precision, and F-scores reported in the results section.
2.2. Matching Process
A central element in our current approach is the assumption that among the Twitter users
tweeting a given publication, there will likely be one or more of the authors of that publication.
Ainsi, this method is limited to identifying scholars on Twitter who tweeted (at least once) un
of their publications (recorded on Crossref Event Data). This differs from the previous approach
(Costas et al., 2020), which attempted to capture a broader range of relationships between
Twitter users and authors to identify matches. This focus on “self-tweets” is likely to increase
precision in matching authors and Twitter accounts but likely to decrease recall. Cependant, nous
still expect to correctly identify a substantial number of authors on Twitter, as self-tweeting has
been seen as an important form of researchers’ engagement on Twitter (Ferreira et al., 2021).
This focus on self-tweets also has the advantage of being less complex and computationally
intensive, thus more easily implementable and replicable by others.
Études scientifiques quantitatives
316
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
un
_
0
0
2
5
0
p
d
.
/
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
An open data set of scholars on Twitter
2.2.1. Matching Twitter users with the authors of the tweeted papers
To identify tweeter–author pairs, we take every tweeted paper and attempt to determine whether
one of the authors’ names matches the name of the Twitter user. An important feature of our
approach to matching authors to Twitter users (similar to Costas et al. [2020]) is that researchers
must, to some degree, use a similar form of their name in their Twitter profile name. Our process
does not aim, and would not be able, to match the OpenAlex author id and the Twitter account of
a researcher who uses a substantially different name in their Twitter and their authored works. Pour
example, we will not match an author named Jane Smith using “squirl1” as their Twitter profile name.
Cependant, although we require some similarity between author names and Twitter names,
they can be recorded differently in Twitter and OpenAlex (and from one OpenAlex author
record to another). Name variations can include the use of initials instead of the full first name,
the inclusion/omission of middle names in full or initial form, and the inclusion of extra char-
acters representing professional titles (par exemple., Dr., Ph.D., M.D.). This requires normalizing the
names from both data sources to maximize the likelihood that valid matches will be identified.
Following the process used by Mongeon, Robinson-Garcia et al. (2017) to match data set cre-
ators to Web of Science authors, we extract the last names(s), first name(s), and initial(s) of both
Twitter users and OpenAlex authors and store them in distinct table columns. In those cases
where a name contains more than two parts, we create an entry for all name combinations,
assuming that all middle parts of the name can be part of the first or the last name. Pour
instance, two entries would be created for the name John William Smith, one considering
William as the second given name and one considering the token William as the first part
of the last name. For each entry, we add a table column containing the initials (the first letter
of each token that forms the first name), a table column containing the first initial only, and a
table column containing the first token of the first name only.
The temporary tables used in the matching process include the unique ID of the individual
(tweeter_id for Twitter and author_id for OpenAlex), the name, the deconstructed name var-
iations, and the DOI tweeted in the event. Tableau 1 displays an example set of tweeter records.
We repeat the same process for OpenAlex authors and obtain a table like Table 2.
We use different matching steps with different levels of expected precision and recall ranging
from exact matches on the full Twitter profile name and Author display name (highest expected
precision) to matches between the first initial and the last name (lowest expected precision). Nous
perform the same set of steps using the profile name and the handle name. Cependant, because the
handle names are a single string without spaces, for those matching attempts, we concatenate the
parts of the author’s name in a single string as well. We also remove all nonalphabetical characters
(par exemple., underscore, espace, numbers, and other special characters) from the handle name prior to
the matching. Tableau 3 presents a list of attempted matches, including example matches.
3. RÉSULTATS
Our results are divided into two parts. D'abord, we report on the performance of our matching
algorithm using the recall, precision, and F-score based on our golden set of authors with
Tableau 1.
Example of a Twitter user record and extracted name components and variants
Tweeter_id
12345678
Handle
jwsmith
Profile name
John William Smith
First name
John
Last name
William Smith
12345678
jwsmith
John William Smith
John William
Forgeron
Initials
J.
JW
First initial
J.
First token
John
J.
John
Études scientifiques quantitatives
317
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
un
_
0
0
2
5
0
p
d
.
/
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
An open data set of scholars on Twitter
Tableau 2.
Example OpenAlex record and extracted name components and variants
Author_id
12345678
Display name
John William Smith
First name
john
Last name
william smith
12345678
John William Smith
john william
smith
Initials
j
js
First initial
J.
First token
john
J.
john
an ORCID listed in OpenAlex and a Twitter handle in their ORCID account. In the second
part, we describe our data set by presenting the distribution of authors with Twitter accounts
across fields and countries.
3.1. Performance of the Matching Algorithm
We use the self-reported tweeter–author matches obtained from ORCID to evaluate the per-
formance of the matching process at each step of the matching process. Precision is calculated
by dividing the number of true positives by the total number of matches found for the tweeters
in our golden set:
Precision ¼
true positives
true positives þ false positives
The recall is obtained by dividing the number of true positives by the total number of tweeter–
author pairs in the golden set:
Recall ¼
true positives
true positives þ false negatives
The F-score is a measure of a model’s accuracy on a data set that is obtained with the
following formula:
F ¼ 2 (cid:2) precision (cid:2) recall
precision þ recall
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
un
_
0
0
2
5
0
p
d
/
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
All three indicators take a value between 0 et 1, où 1 is the best possible score.
Tableau 3. Matching steps used to match Twitter profiles and OpenAlex author profiles
Step
Full name exact match
Twitter data field
Profile name
Examples
john william smith = john william smith
Full name substring*
Profile name
john smith = john smith Jr.
Last name + initials
Profile name
jw smith = jw smith
Last name + first token
Profile name
john w smith = john smith
Last name + first initial
Profile name
jw smith = j smith
Full name exact match
Last name + initials
Last name + first token
Last name + first initial
Handle
Handle
Handle
Handle
john william smith = johnwilliamsmith
jw smith = jwsmith
john w smith = johnsmith
jw smith = jsmith
* Note that the full name substring step is only performed with the Twitter profile name because preliminary
attempts to search for the full name as a substring of the Twitter handle performed extremely poorly (F-score < 0.1).
Quantitative Science Studies
318
An open data set of scholars on Twitter
Matching step
Criteria
Last name + first token
Table 4.
Results of the matching for each criterion
Distinct matches
Field
Handle
OpenAlex authors
24,929
Twitter accounts
21,755
Pairs
24,929
Recall
0.041
Full name exact match
Handle
Last name + initials
Handle
Last name + first initial
Handle
Full name exact match
Profile name
Last name + first token
Profile name
Full name substring
Profile name
Last name + initials
Profile name
Last name + first initial
Profile name
Combined
Combined
19,147
13,577
8,528
307,270
419,805
317,723
343,469
471,763
492,142
16,795
11,693
7,247
272,409
368,832
281,499
299,646
406,389
19,147
0.033
13,579
0.021
8,530
0.012
308,880
0.423
422,341
0.553
319,984
0.442
346,529
0.458
477,383
0.593
423,924
498,680
0.623
Test
Precision
0.981
F-score
0.078
0.979
0.977
0.976
0.971
0.971
0.968
0.967
0.961
0.958
0.063
0.041
0.024
0.590
0.705
0.607
0.621
0.734
0.755
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
-
p
d
l
f
/
/
/
/
4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
a
_
0
0
2
5
0
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Table 4 provides, for each of our matching criteria, the number of distinct matches obtained
and the OpenAlex authors and Twitter accounts that form these matches, as well as the pre-
cision, recall, and F-scores obtained by testing each set of results against the golden set of
author–tweeter pairs from ORCID.
Because the same pairs can be obtained at different precision steps, we report in Table 5 a
different set of results where each step is performed hierarchically from most precise to least pre-
cise as per Table 4, and where each step considers only new pairs that were not identified in the
previous ones. This does not change the overall result of the matching but provides a more unam-
biguous indication of the contribution of each step to the recall and precision. For instance, we
can see that the lowest precision rate (0.80) is obtained when matching the last name and first
initial with the tweeter’s profile name. This is consistent with the test results presented in Table 4.
However, the difference in the precision between steps is more remarkable here because the
count of true positives is not inflated by the valid matches from previous, more precise steps.
We can observe from Table 5 that the last three matching steps have a significantly lower pre-
cision rate than the other steps, especially the matches on the Last name + first initial between the
author’s name and the tweeter’s profile name. This result was expected given that matching on
the initials only would mean that an author named John Doe would match with both John Doe
and Jane Doe. Perhaps more surprising is the still somewhat high levels of precision obtained.
This is likely explained by our use of self-tweets only, which would match Jane Doe with John
Doe only if Jane tweeted one of John’s papers. Still, researchers who might use our data set or our
process are advised to exercise caution with these less precise matching steps. While the results
presented here do not include any manual data validation, we performed such a validation for the
matches obtained with the three least precise steps. The data set available on Zenodo (https://
zenodo.org/record/7013518) includes a validation column alongside the tweeter_id and the
author_id for the matches, which will allow users of the data set to use the entire set or to filter
out the matches that our team identified as likely to be false positives. This data set also includes a
column indicating which of the matching steps identified the match, so users of the data set can
reconstruct a data set that does not include some of the steps, for instance.
Quantitative Science Studies
319
An open data set of scholars on Twitter
Table 5. New matches identified at each step of the hierarchical matching process
Matching step
Criteria
Last name + first token
Field
Handle
OpenAlex authors
24,929
Twitter accounts
21,755
Pairs
24,929
Recall
0.041
Distinct matches
Test
Precision
0.981
F-score
0.078
Full name exact match*
Handle
Last name + initials
Handle
Last name + first initial
Handle
Full name exact match
Profile name
Last name + first token
Profile name
Full name substring
Profile name
Last name + initials
Profile name
Last name + first initial
Profile name
Combined
Combined
0
13,325
1,976
83,862
96,587
16,401
35,467
22,989
92,142
0
11,513
1,758
0
0
13,327
0.020
1,976
0.003
251,936
285,373
0.380
87,508
14,620
32,104
20,815
23,924
97,088
0.101
16,756
0.026
35,772
0.031
23,459
0.017
498,680
0.623
0
0.976
0.966
0.971
0.964
0.901
0.903
0.806
0.958
0
0.040
0.006
0.546
0.182
0.050
0.059
0.033
0.755
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
-
p
d
l
f
/
/
/
/
4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
a
_
0
0
2
5
0
p
d
/
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
* All matches obtained during the full name matching with the tweeter handle were also found in the previous step, meaning that this step could, in principle, be
skipped. However, we keep the step in our process for consistency and to account for the possibility that this step might yield new matches in future imple-
mentations of the process.
Table 6 presents yet another variation of the hierarchical matching process reported in
Table 5, with rows containing the cumulative matches so we can more clearly see how each
step adds to the overall results.
3.2. Overview of the Data Set
In this section, we provide an overview of the composition of the data set by looking at the
discipline and countries of the researchers for which we were able to assign a Twitter account.
The main aim of these analyses is merely to provide a descriptive overview of the distribution
Table 6.
Cumulative number of matches at each step of the hierarchical matching process
Matching step
Criteria
Last name + first token
Field
Handle
OpenAlex authors
24,929
Twitter accounts
21,755
Pairs
24,929
Recall
0.041
Distinct matches
Test
Precision
0.981
F-score
0.078
Full name exact match
Handle
Last name + initials
Handle
Last name + first initial
Handle
Full name exact match
Profile name
Last name + first token
Profile name
Last name + initials
Profile name
Full name substring
Profile name
Last name + first initial
Profile name
Combined
Combined
Quantitative Science Studies
24,929
38,248
40,223
323,935
420,167
436,170
470,307
492,142
492,142
21,755
33,201
34,841
286,528
369,257
382,820
407,658
423,924
24,929
0.041
38,256
0.061
40,232
0.064
325,605
0.445
422,693
0.548
439,449
0.574
475,221
0.605
498,680
0.623
423,924
498,680
0.623
0.981
0.979
0.979
0.972
0.970
0.967
0.963
0.958
0.958
0.078
0.115
0.120
0.611
0.700
0.720
0.743
0.755
0.755
320
An open data set of scholars on Twitter
of matches across publication disciplines and author countries. These overviews are meant to
support future research and researchers interested in the data set, and to make users of the data
aware of the general disciplinary and country representation of the data set.
There is no discipline classification in OpenAlex, but works are linked to Wikidata concepts
(Priem et al., 2022), with a score ranging from 0 to 1 representing the strength of the association
between the work and the concept. The concepts are hierarchical (levels 0 to 5), with level 0 essen-
tially representing large disciplines (e.g., environmental science, economics, engineering, chem-
istry, medicine). More information about the concepts and their matching can be found on the
OpenAlex website (https://docs.openalex.org/about-the-data/concept) and in this white paper
(https://docs.google.com/document/d/1OgXSLriHO3Ekz0OYoaoP_h0sPcuvV4EqX7VgLLblKe4/).
For each unique author matched with a Twitter account, we retrieve their works and the level 0 con-
cepts associated with these works, as well as the score. While these concepts are unlikely to provide
a highly accurate classification of works, they are still helpful, we believe, in getting a sense of the
breadth of disciplines represented in our data set. Table 7 shows the number and percentage of
authors assigned to each discipline based on the discipline with the highest score based on all their
Table 7. Number of authors and the average score by discipline (concepts from OpenAlex)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
-
p
d
l
f
/
/
/
/
4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
a
_
0
0
2
5
0
p
d
/
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Discipline
Medicine
Biology
Psychology
Computer science
Political science
Chemistry
Materials science
Environmental science
Business
Sociology
Geography
Economics
Geology
Physics
Art
History
Philosophy
Mathematics
Engineering
Total
Number of authors
138,968
Percentage of authors
28.3
Average score
0.517
75,246
46,793
39,675
36,531
30,876
19,347
17,861
17,485
17,248
13,513
8,075
7,204
7,138
6,963
3,913
2,415
1,637
326
15.3
9.5
8.1
7.4
6.3
3.9
3.6
3.6
3.5
2.8
1.6
1.5
1.5
1.4
0.8
0.5
0.3
0.1
0.388
0.265
0.201
0.229
0.242
0.233
0.208
0.158
0.185
0.155
0.162
0.217
0.152
0.133
0.134
0.097
0.068
0.042
491,214
100
Quantitative Science Studies
321
An open data set of scholars on Twitter
Table 8. Distribution of authors’ last known affiliation recorded in OpenAlex
Country
United States
Great Britain
Australia
Canada
Spain
Germany
France
Netherlands
India
Italy
Brazil
Switzerland
Sweden
Ireland
Belgium
China
Finland
Denmark
Japan
Other countries
Total
Number of authors
142,059
72,430
24,457
22,516
19,308
18,338
10,794
10,605
9,967
9,016
7,330
6,251
5,684
5,382
5,375
5,250
4,834
4,543
4,361
54,093
442,593
Percentage of authors
32.1
16.4
5.5
5.1
4.4
4.1
2.4
2.4
2.3
2.0
1.7
1.4
1.3
1.2
1.2
1.2
1.1
1.0
1.0
12.2
100
publications. Because authors are associated with each discipline to some degree (represented
by the average score of concepts in their publications), we also display the average score for
each discipline as an alternate representation of the distribution of disciplines in the data set.
For the countries, we use the last_known_affiliation field of the OpenAlex authors table and
present the relative frequency of countries in Table 8.
4. DISCUSSION AND CONCLUSION
The work presented in this paper can be framed as a step forward in developing more advanced stud-
ies of the interactions between science and society, particularly by enabling the study of the role of
scientists in disseminating scientific results on Twitter. In addition, using open data sources (OpenAlex
and Crossref Event Data) allows researchers to continue to improve and adapt this method for further
possibilities without worrying about contractual limitations or data unavailability.
Overall, the results of our matching process show a high level of precision and a moderate level
of recall, which was expected given our consideration of only self-tweets in the matching process.
Quantitative Science Studies
322
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
-
p
d
l
f
/
/
/
/
4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
a
_
0
0
2
5
0
p
d
/
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
An open data set of scholars on Twitter
This focus on self-tweets naturally makes a more precise matching strategy but at the expense of
recall, as the method excludes from the matching all those scholars who never tweeted any of their
publications (or none of their publications were included in the Crossref Event Data database).
The resulting matched database is more extensive than the one reported by Costas et al.
(2020), which is most likely due to the broader coverage of OpenAlex compared to the
Web of Science and improvements to the matching process. Other factors could include
increasing Twitter use by researchers over time and/or increasing paper-sharing practices on
Twitter or that more scholars are using their full names in their Twitter profiles.
It is important to emphasize the limitations of the matching approach (and resulting data set)
presented in this paper. First, the approach is limited to tweets recorded in Crossref Event Data,
which does not include tweets before 2017, and publications and researchers recorded in
OpenAlex. Furthermore, our matching algorithm requires that names be written in the Latin
alphabet, which may exclude some Twitter users or authors who are not using the Latin alphabet.
On top of this limitation in the coverage of our data sources, our reliance on self-tweets to increase
the precision of our matches comes at the expense of some degree of recall. While we find that this
strategy does provide high precision levels, our examination of the data indicates that very com-
mon names can still generate some false positives. This provides some support for our choice not
to cast a broader net by removing the self-tweet criterion, which would have likely flooded our
data set with false positives for low expected gains in true positives and would also have dramat-
ically increased the computational resources to perform the matching. It may also create a gender
bias in our data set because a study by Peng, Teplitskiy et al. (2022) found men to be more likely to
self-promote on Twitter than women. Finally, we also know from past research that there is a lower
uptake of Twitter use in some regions or countries (Zahedi, 2017).
It is also worth mentioning here that our data set is limited to the matches that were created
through the process, which uses only the names of the authors and the tweeters. There exist
other sources of researcher–tweeter pairs that could be used to complement the data set. For
instance, one could consider adding the data set of ORCID accounts with associated Twitter
handles that we used as a golden set to validate our approach. OpenAlex also includes
Twitter handles for a few hundred authors, although it is not clear what data source these
pairs come from.
The characterization of the social media users and audiences that are engaging with scientific
publications is an important element in the development of more advanced studies on the inter-
actions between science and society. Thus, further and better curation of data around the inter-
actions between social media users and academic objects is a fundamental step that needs to be
considered in future altmetric research. Further developments of this work include expanding the
matching to include additional matching criteria and developing approaches (e.g., citation or
coauthor networks) to form tweeter–author pairs without relying only on self-tweets. Future devel-
opments could also tackle the challenge of matching scholars with Twitter users that did not tweet
any publication. Nevertheless, we believe that the relatively simple matching process outlined in
this paper and the open data set of nearly half a million pairs of OpenAlex author IDs and tweeter
IDs generated and made available to the community are valuable contributions to the field of
quantitative science studies, and more specifically to the study of the activities of scholars on Twit-
ter and the interaction between social media and science.
ACKNOWLEDGMENTS
The authors would like to thank Poppy Riddle, Kydra Mayhew, and Maddie Hare, who helped
with data cleaning.
Quantitative Science Studies
323
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
-
p
d
l
f
/
/
/
/
4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
a
_
0
0
2
5
0
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
An open data set of scholars on Twitter
AUTHOR CONTRIBUTIONS
Philippe Mongeon: Conceptualization, Data curation, Formal analysis, Methodology, Visual-
ization, Writing–original draft, Writing–review and editing. Timothy D. Bowman: Conceptual-
ization, Data curation, Formal analysis, Methodology, Writing–original draft, Writing–review
and editing. Rodrigo Costas: Conceptualization, Data curation, Methodology, Writing–original
draft, Writing–review and editing.
COMPETING INTERESTS
The authors have no competing interests.
FUNDING INFORMATION
Rodrigo Costas is partially funded by the South African DSI-NRF Centre of Excellence in Sci-
entometrics and Science, Technology and Innovation Policy (SciSTIP).
DATA AVAILABILITY
The open data set of scholars on Twitter (Mongeon, Bowman, & Costas, 2022) produced with the
process reported in this paper is available at https://doi.org/10.5281/zenodo.7013518. The data
set includes a column indicating which criteria were successful in identifying the match, as well as
a column indicating whether the match was considered valid upon manual inspection.
REFERENCES
Blackburn, R., Cabral, T., Cardoso, A., Cheng, E., Costa, P., … White,
P. (2021). ORCID Public Data File 2021 (p. 100806462894 Bytes)
[Data set]. ORCID. https://doi.org/10.23640/07243.16750535.V1
Brainard, J. (2022). Riding the Twitter wave. Science, 375(6587),
1344–1347. https://doi.org/10.1126/science.abq1541, PubMed:
35324287
Costas, R., Mongeon, P., Ferreira, M. R., van Honk, J., & Franssen,
T. (2020). Large-scale identification and characterization of
scholars on Twitter. Quantitative Science Studies, 1(2),
771–791. https://doi.org/10.1162/qss_a_00047
Costas, R., Rijcke, S., & Marres, N. (2021). “Heterogeneous cou-
plings”: Operationalizing network perspectives to study
science-society interactions through social media metrics. Jour-
nal of the Association for Information Science and Technology,
72(5), 595–610. https://doi.org/10.1002/asi.24427
Costas, R., Zahedi, Z., & Wouters, P. (2014). Do “altmetrics” corre-
late with citations? Extensive comparison of altmetric indicators
with citations from a multidisciplinary perspective. Journal of the
Association for Information Science and Technology, 66(10),
2003–2019. https://doi.org/10.1002/asi.23309
Ferreira, M. R., Mongeon, P., & Costas, R. (2021). Large-scale com-
parison of authorship, citations, and tweets of web of science
authors. Journal of Altmetrics, 4(1), Article 1. https://doi.org/10
.29024/joa.38
Haustein, S. (2016). Grand challenges in altmetrics: Heterogeneity,
data quality and dependencies. Scientometrics, 108(1), 413–423.
https://doi.org/10.1007/s11192-016-1910-9
Mongeon, P., Bowman, T., & Costas, R. (2022). Open dataset of
scholars on Twitter [Data set]. Zenodo. https://doi.org/10.5281
/zenodo.7013518
Mongeon, P., Robinson-Garcia, N., Jeng, W., & Costas, R. (2017).
Incorporating data sharing to the reward system of science: Link-
ing DataCite records to authors in the Web of Science. Aslib
Journal of Information Management, 69(5), 545–556. https://doi
.org/10.1108/AJIM-01-2017-0024
Ortega, J. L. (2018). Reliability and accuracy of altmetric providers:
A comparison among Altmetric.com, PlumX and Crossref Event
Data. Scientometrics, 116, 2123–2138. https://doi.org/10.1007
/s11192-018-2838-z
Peng, H., Teplitskiy, M., Romero, D., & Horvát, E.-Á. (2022). The
gender gap in scholarly self-promotion on social media [Pre-
print]. In review. https://doi.org/10.21203/rs.3.rs-1765948/v1
Priem, J., Piwowar, H., & Orr, R. (2022). OpenAlex: A fully-open
index of scholarly works, authors, venues, institutions, and con-
cepts. arXiv, arXiv:2205.01833. https://doi.org/10.48550/arXiv
.2205.01833
Sugimoto, C. R., Work, S., Larivière, V., & Haustein, S. (2017).
Scholarly use of social media and altmetrics: A review of the
literature. Journal of the Association for Information Science
and Technology, 68(9), 2037–2062. https://doi.org/10.1002/asi
.23833
Thelwall, M., Haustein, S., Lariviere, V., & Sugimoto, C. R. (2013).
Do altmetrics work? Twitter and ten other social web services.
PLOS ONE, 8(5), e64841. https://doi.org/10.1371/journal.pone
.0064841, PubMed: 23724101
Tsou, A., Bowman, T. D., Ghazinejad, A., & Sugimoto, C. (2015).
Who tweets about science. In Proceedings of ISSI 2015 Istanbul:
15th International Society of Scientometrics and Informetrics
Conference (pp. 95–100).
Williams, K. (2022). What counts: Making sense of metrics of
research value. Science and Public Policy, 49(3), 518–531.
https://doi.org/10.1093/scipol/scac004
Zahedi, Z. (2017). What explains the imbalance use of social media
across different countries? A cross country analysis of presence of
Twitter users tweeting scholarly publications ( Version 2). figshare.
https://doi.org/10.6084/m9.figshare.5454475.v2
Quantitative Science Studies
324
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
-
p
d
l
f
/
/
/
/
4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
a
_
0
0
2
5
0
p
d
/
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3