ARTICLE DE RECHERCHE - Recherche en IA spécialisée au MIT

ARTICLE DE RECHERCHE

An open data set of scholars on Twitter

Philippe Mongeon1,2

, Timothy D. Bowman3

, and Rodrigo Costas4,5

1School of Information Management, Dalhousie University, Halifax, Nova Scotia, Canada
2Centre interuniversitaire de recherche sur la science et la technologie (CIRST), Université du Québec à Montréal,
Montréal, Québec, Canada
3School of Information Sciences, Wayne State University, Detroit, MI, Etats-Unis
4Centre for Science and Technology Studies (CWTS), Leiden University, Leiden, The Netherlands
5DSI-NRF Centre of Excellence in Scientometrics and Science, Technology and Innovation Policy (SciSTIP),
Stellenbosch University, Stellenbosch, Afrique du Sud

Mots clés: altmetrics, bibliométrie, open data, social media metrics, Twitter

ABSTRAIT

The role played by research scholars in the dissemination of scientific knowledge on social
media has always been a central topic in social media metrics (altmetrics) recherche. Different
approaches have been implemented to identify and characterize active scholars on social
media platforms like Twitter. Some limitations of past approaches were their complexity and,
most importantly, their reliance on licensed scientometric and altmetric data. The emergence
of new open data sources such as OpenAlex or Crossref Event Data provides opportunities to
identify scholars on social media using only open data. This paper presents a novel and simple
approach to match authors from OpenAlex with Twitter users identified in Crossref Event Data.
The matching procedure is described and validated with ORCID data. The new approach
matches nearly 500,000 matched scholars with their Twitter accounts with a level of high
precision and moderate recall. The data set of matched scholars is described and made openly
available to the scientific community to empower more advanced studies of the interactions of
research scholars on Twitter.

INTRODUCTION

Engagement with academic research on social media has been a central research topic in sci-
entometrics, particularly in altmetrics and social media metrics. In the early days of social
media metrics research, the focus was primarily on investigating the relationship between
the number of mentions of research publications on social media platforms (particularly Twit-
ter) and citations, with most of the studies finding weak relationships between social media
metrics and citations (Costas, Zahedi, & Wouters, 2014; Sugimoto, Work et al., 2017;
Thelwall, Haustein et al., 2013). Cependant, recent theoretical proposals have initiated a shift
in the focus of altmetric research from analyzing mentions and correlations to more interactive
perspectives. Ainsi, Haustein (2016) proposed that social media metrics need not be restricted
to the mentions of scholarly outputs on social media but could also include the mentions and
activities of individual scholars. Plus récemment, Costas, Rijcke, and Marres (2021) proposed the
notion of “heterogeneous couplings” as a common framework to study the interactions
between academic and nonacademic actors as captured via online and social media platforms
(see also Williams (2022)), in which the interactions of individual scholars on Twitter are
another fundamental form of online interaction relating to how science is being communi-
cated to society (Brainard, 2022).

un accès ouvert

journal

Citation: Mongeon, P., Bowman, T. D.,
& Costas, R.. (2023). An open data set of
scholars on Twitter. Quantitative
Science Studies, 4(2), 314–324. https://
doi.org/10.1162/qss_a_00250

EST CE QUE JE:
https://doi.org/10.1162/qss_a_00250

Peer Review:
https://www.webofscience.com/api
/gateway/wos/peer-review/10.1162
/qss_a_00250

Reçu: 23 Août 2022
Accepté: 3 Février 2023

Auteur correspondant:
Philippe Mongeon
pmongeon@dal.ca

Éditeur de manipulation:
Vincent Larivière

droits d'auteur: © 2023 Philippe Mongeon,
Timothy D. Bowman, and Rodrigo
Costas. Published under a Creative
Commons Attribution 4.0 International
(CC PAR 4.0) Licence.

La presse du MIT

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d

F
/

4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
un
_
0
0
2
5
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

An open data set of scholars on Twitter

In the quest to study scholars’ activities on Twitter, one long-lasting challenge is the iden-
tification of social media accounts belonging to researchers. Dans 2020 we published a paper that
introduced a method to match Web of Science authors with their Twitter accounts (Costas,
Mongeon et al., 2020) and reported on the distribution of scholars on Twitter across countries,
disciplines, academic age, and gender. One of the main features of that data set was that it
allowed us, for the first time, to investigate the relationship between the research profiles and
activities of scholars and their profiles and activities on Twitter on a large scale (Ferreira,
Mongeon, & Costas, 2021). Past data sets did not allow for this because they were either
too small or because they provided information on whether or not a Twitter account likely
belonged to a researcher without identifying the specific researcher to whom the account
belonged.

Limitations of the data set we produced with this initial work included that it used propri-
etary data from the Web of Science and Altmetric.com, which made it impossible to share the
author–tweeter pairs openly, and it was complicated for others to use the data set without
access to these databases. The large number of steps involved in the previously reported
process also possibly contributed to its lack of implementation by other researchers and its
lack of transferability to other data sets.

New developments in Open Science scientometric and altmetric databases (namely the
OpenAlex and Crossref Event Data databases) have changed this landscape, now allowing
for the creation of matches of academic authors and their research publications with their
Twitter profiles. This data paper aims to introduce a data set of scholars’ Twitter accounts iden-
tified with a naïve algorithm based entirely on available open data, presenting the process in
detail with accompanying R and Python scripts so that the process can be easily replicated
and/or improved upon by the research community. We hope this data set will support further
research on the interactions of scientific authors on social media and serve as a base for devel-
oping alternative and/or complementary approaches to match Twitter users and authors.

The paper is structured as follows. D'abord, we provide an overview of the different data
sources we used and the detailed process we used to match the Twitter accounts with
OpenAlex authors. We then report precision and recall estimates for our matching approach,
followed by an overview of the characteristics of the scholars found on Twitter.

2. DATA AND METHODS

2.1. Data Sources

2.1.1. OpenAlex

The research publications data source for this study is the OpenAlex (Priem, Piwowar, & Orr,
2022) data dump from May 20, 2022, which was downloaded and parsed into a relational
database model hosted at the Maritime Institute for Science, Technologie, and Society (MISTS)
in Canada. In the OpenAlex database, authors are represented by a unique identifier
(author_id) associated with their works (see the works_authorships table of the OpenAlex
schema). The OpenAlex database we used contains 220,870,820 author_ids. Our ultimate
objective is to assign a twitter_id to these author_id values. OpenAlex also includes a link
between the author_id and ORCID. It is worth noting that a single individual can have multiple
author_ids in OpenAlex, so that the same ORCID can be associated with multiple author_ids.
It is not clear why authors with the same IDs are not merged together in the OpenAlex data-
base, but it is likely due to the clustering approach used to construct the author entities. While
we could have chosen to perform this merge ourselves as part of our process, we chose to use

Études scientifiques quantitatives

315

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d

F
/

4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
un
_
0
0
2
5
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

An open data set of scholars on Twitter

all our data sources as they are and we did not merge any of the OpenAlex authors. It is always
possible for the users of our data set to group OpenAlex authors together based on an author
disambiguation process of their choice, which may include combining authors with the same
ORCID.

2.1.2. Crossref Event Data

We use a data dump of Crossref Event Data from January 2022 available at the Centre for Science
and Technology Studies (CWTS), containing over 60 million Twitter events from 5,288,867
unique Twitter accounts, which contain the tweet identifier and the DOI of the papers mentioned
in that tweet. The dump includes 4.7 million unique DOIs tweeted at least once and recorded in
Crossref Event Data (CED). We use the Twitter API to rehydrate the profile information of the
Twitter users recorded in the CED dump. An important difference between CED and Altmetric
is that CED focuses on identifying tweets to DOIs, while Altmetric also identifies Twitter mentions
to preprints (par exemple., from ArXiv) and other publication identifiers (par exemple., PMIDs). Donc, CED will
typically identify fewer tweets to publications than Altmetric (Ortega, 2018).

Our use of an altmetric database such as Crossref Event Data to identify researchers on Twitter
stems from the expectation, as in Costas et al. (2020), that researchers are more likely to tweet
research publications than nonresearchers (Tsou, Bowman et al., 2015) and therefore to be
recorded in this database. By considering only Twitter users that have mentioned scholarly work
in their tweets as recorded in Crossref Event Data, we presumably increase precision at the
expense of excluding all scholarly Twitter users that have never tweeted any research publications.

2.1.3. ORCID

The OpenAlex database includes the ORCID ids for approximately 2% of all the authors indexed
in the database. Because some researchers include their Twitter handle in their public ORCID
profiles, we leverage the information recorded in the ORCID Public Data File 2021 (Blackburn,
Cabral et al., 2021) to retrieve the Twitter account for those researchers who self-reported a
Twitter profile in their ORCID profile. We used the ORCID data dump (2021) hosted by the Centre
for Science and Technology Studies (CWTS) to obtain a set of 13,208 matching OpenAlex author
ids and ORCID profiles. This data set has been used as a golden set to evaluate the performance of
our matching process. It should be noted that Twitter accounts listed in ORCID profiles are not
necessarily valid. Even when these accounts are valid, it is important to note that the Twitter han-
dle and user name may not include an actual name, which makes it impossible to match the
accounts using the process described below as it relies on matching names. These factors may
artificially penalize the recall, precision, and F-scores reported in the results section.

2.2. Matching Process

A central element in our current approach is the assumption that among the Twitter users
tweeting a given publication, there will likely be one or more of the authors of that publication.
Ainsi, this method is limited to identifying scholars on Twitter who tweeted (at least once) un
of their publications (recorded on Crossref Event Data). This differs from the previous approach
(Costas et al., 2020), which attempted to capture a broader range of relationships between
Twitter users and authors to identify matches. This focus on “self-tweets” is likely to increase
precision in matching authors and Twitter accounts but likely to decrease recall. Cependant, nous
still expect to correctly identify a substantial number of authors on Twitter, as self-tweeting has
been seen as an important form of researchers’ engagement on Twitter (Ferreira et al., 2021).
This focus on self-tweets also has the advantage of being less complex and computationally
intensive, thus more easily implementable and replicable by others.

Études scientifiques quantitatives

316

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d

F
/

4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
un
_
0
0
2
5
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

An open data set of scholars on Twitter

2.2.1. Matching Twitter users with the authors of the tweeted papers

To identify tweeter–author pairs, we take every tweeted paper and attempt to determine whether
one of the authors’ names matches the name of the Twitter user. An important feature of our
approach to matching authors to Twitter users (similar to Costas et al. [2020]) is that researchers
must, to some degree, use a similar form of their name in their Twitter profile name. Our process
does not aim, and would not be able, to match the OpenAlex author id and the Twitter account of
a researcher who uses a substantially different name in their Twitter and their authored works. Pour
example, we will not match an author named Jane Smith using “squirl1” as their Twitter profile name.

Cependant, although we require some similarity between author names and Twitter names,
they can be recorded differently in Twitter and OpenAlex (and from one OpenAlex author
record to another). Name variations can include the use of initials instead of the full first name,
the inclusion/omission of middle names in full or initial form, and the inclusion of extra char-
acters representing professional titles (par exemple., Dr., Ph.D., M.D.). This requires normalizing the
names from both data sources to maximize the likelihood that valid matches will be identified.
Following the process used by Mongeon, Robinson-Garcia et al. (2017) to match data set cre-
ators to Web of Science authors, we extract the last names(s), first name(s), and initial(s) of both
Twitter users and OpenAlex authors and store them in distinct table columns. In those cases
where a name contains more than two parts, we create an entry for all name combinations,
assuming that all middle parts of the name can be part of the first or the last name. Pour
instance, two entries would be created for the name John William Smith, one considering
William as the second given name and one considering the token William as the first part
of the last name. For each entry, we add a table column containing the initials (the first letter
of each token that forms the first name), a table column containing the first initial only, and a
table column containing the first token of the first name only.

The temporary tables used in the matching process include the unique ID of the individual
(tweeter_id for Twitter and author_id for OpenAlex), the name, the deconstructed name var-
iations, and the DOI tweeted in the event. Tableau 1 displays an example set of tweeter records.

We repeat the same process for OpenAlex authors and obtain a table like Table 2.

We use different matching steps with different levels of expected precision and recall ranging
from exact matches on the full Twitter profile name and Author display name (highest expected
precision) to matches between the first initial and the last name (lowest expected precision). Nous
perform the same set of steps using the profile name and the handle name. Cependant, because the
handle names are a single string without spaces, for those matching attempts, we concatenate the
parts of the author’s name in a single string as well. We also remove all nonalphabetical characters
(par exemple., underscore, espace, numbers, and other special characters) from the handle name prior to
the matching. Tableau 3 presents a list of attempted matches, including example matches.

3. RÉSULTATS

Our results are divided into two parts. D'abord, we report on the performance of our matching
algorithm using the recall, precision, and F-score based on our golden set of authors with

Tableau 1.

Example of a Twitter user record and extracted name components and variants

Tweeter_id
12345678

Handle
jwsmith

Profile name
John William Smith

First name

John

Last name
William Smith

12345678

jwsmith

John William Smith

John William

Forgeron

Initials
J.

First initial
J.

First token
John

John

Études scientifiques quantitatives

317

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d

F
/

4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
un
_
0
0
2
5
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

An open data set of scholars on Twitter

Tableau 2.

Example OpenAlex record and extracted name components and variants

Author_id
12345678

Display name
John William Smith

First name
john

Last name
william smith

12345678

John William Smith

john william

smith

Initials
j

First initial
J.

First token
john

john

an ORCID listed in OpenAlex and a Twitter handle in their ORCID account. In the second
part, we describe our data set by presenting the distribution of authors with Twitter accounts
across fields and countries.

3.1. Performance of the Matching Algorithm

We use the self-reported tweeter–author matches obtained from ORCID to evaluate the per-
formance of the matching process at each step of the matching process. Precision is calculated
by dividing the number of true positives by the total number of matches found for the tweeters
in our golden set:

Precision ¼

true positives
true positives þ false positives

The recall is obtained by dividing the number of true positives by the total number of tweeter–
author pairs in the golden set:

Recall ¼

true positives
true positives þ false negatives

The F-score is a measure of a model’s accuracy on a data set that is obtained with the
following formula:

F ¼ 2 (cid:2) precision (cid:2) recall
precision þ recall

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d

F
/

4
2
3
1
4
2
1
3
6
3
7
8
q
s
s
_
un
_
0
0
2
5
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

All three indicators take a value between 0 et 1, où 1 is the best possible score.

Tableau 3. Matching steps used to match Twitter profiles and OpenAlex author profiles

Step
Full name exact match

Twitter data field
Profile name

Examples
john william smith = john william smith

Full name substring*

Profile name

john smith = john smith Jr.

Last name + initials

Profile name

jw smith = jw smith

Last name + first token

Profile name

john w smith = john smith

Last name + first initial

Profile name

jw smith = j smith

Full name exact match

Last name + initials

Last name + first token

Last name + first initial

Handle

john william smith = johnwilliamsmith

jw smith = jwsmith

john w smith = johnsmith

jw smith = jsmith

* Note that the full name substring step is only performed with the Twitter profile name because preliminary
attempts to search for the full name as a substring of the Twitter handle performed extremely poorly (F-score < 0.1). Quantitative Science Studies 318 An open data set of scholars on Twitter Matching step Criteria Last name + first token Table 4. Results of the matching for each criterion Distinct matches Field Handle OpenAlex authors 24,929 Twitter accounts 21,755 Pairs 24,929 Recall 0.041 Full name exact match Handle Last name + initials Handle Last name + first initial Handle Full name exact match Profile name Last name + first token Profile name Full name substring Profile name Last name + initials Profile name Last name + first initial Profile name Combined Combined 19,147 13,577 8,528 307,270 419,805 317,723 343,469 471,763 492,142 16,795 11,693 7,247 272,409 368,832 281,499 299,646 406,389 19,147 0.033 13,579 0.021 8,530 0.012 308,880 0.423 422,341 0.553 319,984 0.442 346,529 0.458 477,383 0.593 423,924 498,680 0.623 Test Precision 0.981 F-score 0.078 0.979 0.977 0.976 0.971 0.971 0.968 0.967 0.961 0.958 0.063 0.041 0.024 0.590 0.705 0.607 0.621 0.734 0.755 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 4 2 3 1 4 2 1 3 6 3 7 8 q s s _ a _ 0 0 2 5 0 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Table 4 provides, for each of our matching criteria, the number of distinct matches obtained and the OpenAlex authors and Twitter accounts that form these matches, as well as the pre- cision, recall, and F-scores obtained by testing each set of results against the golden set of author–tweeter pairs from ORCID. Because the same pairs can be obtained at different precision steps, we report in Table 5 a different set of results where each step is performed hierarchically from most precise to least pre- cise as per Table 4, and where each step considers only new pairs that were not identified in the previous ones. This does not change the overall result of the matching but provides a more unam- biguous indication of the contribution of each step to the recall and precision. For instance, we can see that the lowest precision rate (0.80) is obtained when matching the last name and first initial with the tweeter’s profile name. This is consistent with the test results presented in Table 4. However, the difference in the precision between steps is more remarkable here because the count of true positives is not inflated by the valid matches from previous, more precise steps. We can observe from Table 5 that the last three matching steps have a significantly lower pre- cision rate than the other steps, especially the matches on the Last name + first initial between the author’s name and the tweeter’s profile name. This result was expected given that matching on the initials only would mean that an author named John Doe would match with both John Doe and Jane Doe. Perhaps more surprising is the still somewhat high levels of precision obtained. This is likely explained by our use of self-tweets only, which would match Jane Doe with John Doe only if Jane tweeted one of John’s papers. Still, researchers who might use our data set or our process are advised to exercise caution with these less precise matching steps. While the results presented here do not include any manual data validation, we performed such a validation for the matches obtained with the three least precise steps. The data set available on Zenodo (https:// zenodo.org/record/7013518) includes a validation column alongside the tweeter_id and the author_id for the matches, which will allow users of the data set to use the entire set or to filter out the matches that our team identified as likely to be false positives. This data set also includes a column indicating which of the matching steps identified the match, so users of the data set can reconstruct a data set that does not include some of the steps, for instance. Quantitative Science Studies 319 An open data set of scholars on Twitter Table 5. New matches identified at each step of the hierarchical matching process Matching step Criteria Last name + first token Field Handle OpenAlex authors 24,929 Twitter accounts 21,755 Pairs 24,929 Recall 0.041 Distinct matches Test Precision 0.981 F-score 0.078 Full name exact match* Handle Last name + initials Handle Last name + first initial Handle Full name exact match Profile name Last name + first token Profile name Full name substring Profile name Last name + initials Profile name Last name + first initial Profile name Combined Combined 0 13,325 1,976 83,862 96,587 16,401 35,467 22,989 92,142 0 11,513 1,758 0 0 13,327 0.020 1,976 0.003 251,936 285,373 0.380 87,508 14,620 32,104 20,815 23,924 97,088 0.101 16,756 0.026 35,772 0.031 23,459 0.017 498,680 0.623 0 0.976 0.966 0.971 0.964 0.901 0.903 0.806 0.958 0 0.040 0.006 0.546 0.182 0.050 0.059 0.033 0.755 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 4 2 3 1 4 2 1 3 6 3 7 8 q s s _ a _ 0 0 2 5 0 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 * All matches obtained during the full name matching with the tweeter handle were also found in the previous step, meaning that this step could, in principle, be skipped. However, we keep the step in our process for consistency and to account for the possibility that this step might yield new matches in future imple- mentations of the process. Table 6 presents yet another variation of the hierarchical matching process reported in Table 5, with rows containing the cumulative matches so we can more clearly see how each step adds to the overall results. 3.2. Overview of the Data Set In this section, we provide an overview of the composition of the data set by looking at the discipline and countries of the researchers for which we were able to assign a Twitter account. The main aim of these analyses is merely to provide a descriptive overview of the distribution Table 6. Cumulative number of matches at each step of the hierarchical matching process Matching step Criteria Last name + first token Field Handle OpenAlex authors 24,929 Twitter accounts 21,755 Pairs 24,929 Recall 0.041 Distinct matches Test Precision 0.981 F-score 0.078 Full name exact match Handle Last name + initials Handle Last name + first initial Handle Full name exact match Profile name Last name + first token Profile name Last name + initials Profile name Full name substring Profile name Last name + first initial Profile name Combined Combined Quantitative Science Studies 24,929 38,248 40,223 323,935 420,167 436,170 470,307 492,142 492,142 21,755 33,201 34,841 286,528 369,257 382,820 407,658 423,924 24,929 0.041 38,256 0.061 40,232 0.064 325,605 0.445 422,693 0.548 439,449 0.574 475,221 0.605 498,680 0.623 423,924 498,680 0.623 0.981 0.979 0.979 0.972 0.970 0.967 0.963 0.958 0.958 0.078 0.115 0.120 0.611 0.700 0.720 0.743 0.755 0.755 320 An open data set of scholars on Twitter of matches across publication disciplines and author countries. These overviews are meant to support future research and researchers interested in the data set, and to make users of the data aware of the general disciplinary and country representation of the data set. There is no discipline classification in OpenAlex, but works are linked to Wikidata concepts (Priem et al., 2022), with a score ranging from 0 to 1 representing the strength of the association between the work and the concept. The concepts are hierarchical (levels 0 to 5), with level 0 essen- tially representing large disciplines (e.g., environmental science, economics, engineering, chem- istry, medicine). More information about the concepts and their matching can be found on the OpenAlex website (https://docs.openalex.org/about-the-data/concept) and in this white paper (https://docs.google.com/document/d/1OgXSLriHO3Ekz0OYoaoP_h0sPcuvV4EqX7VgLLblKe4/). For each unique author matched with a Twitter account, we retrieve their works and the level 0 con- cepts associated with these works, as well as the score. While these concepts are unlikely to provide a highly accurate classification of works, they are still helpful, we believe, in getting a sense of the breadth of disciplines represented in our data set. Table 7 shows the number and percentage of authors assigned to each discipline based on the discipline with the highest score based on all their Table 7. Number of authors and the average score by discipline (concepts from OpenAlex) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 4 2 3 1 4 2 1 3 6 3 7 8 q s s _ a _ 0 0 2 5 0 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Discipline Medicine Biology Psychology Computer science Political science Chemistry Materials science Environmental science Business Sociology Geography Economics Geology Physics Art History Philosophy Mathematics Engineering Total Number of authors 138,968 Percentage of authors 28.3 Average score 0.517 75,246 46,793 39,675 36,531 30,876 19,347 17,861 17,485 17,248 13,513 8,075 7,204 7,138 6,963 3,913 2,415 1,637 326 15.3 9.5 8.1 7.4 6.3 3.9 3.6 3.6 3.5 2.8 1.6 1.5 1.5 1.4 0.8 0.5 0.3 0.1 0.388 0.265 0.201 0.229 0.242 0.233 0.208 0.158 0.185 0.155 0.162 0.217 0.152 0.133 0.134 0.097 0.068 0.042 491,214 100 Quantitative Science Studies 321 An open data set of scholars on Twitter Table 8. Distribution of authors’ last known affiliation recorded in OpenAlex Country United States Great Britain Australia Canada Spain Germany France Netherlands India Italy Brazil Switzerland Sweden Ireland Belgium China Finland Denmark Japan Other countries Total Number of authors 142,059 72,430 24,457 22,516 19,308 18,338 10,794 10,605 9,967 9,016 7,330 6,251 5,684 5,382 5,375 5,250 4,834 4,543 4,361 54,093 442,593 Percentage of authors 32.1 16.4 5.5 5.1 4.4 4.1 2.4 2.4 2.3 2.0 1.7 1.4 1.3 1.2 1.2 1.2 1.1 1.0 1.0 12.2 100 publications. Because authors are associated with each discipline to some degree (represented by the average score of concepts in their publications), we also display the average score for each discipline as an alternate representation of the distribution of disciplines in the data set. For the countries, we use the last_known_affiliation field of the OpenAlex authors table and present the relative frequency of countries in Table 8. 4. DISCUSSION AND CONCLUSION The work presented in this paper can be framed as a step forward in developing more advanced stud- ies of the interactions between science and society, particularly by enabling the study of the role of scientists in disseminating scientific results on Twitter. In addition, using open data sources (OpenAlex and Crossref Event Data) allows researchers to continue to improve and adapt this method for further possibilities without worrying about contractual limitations or data unavailability. Overall, the results of our matching process show a high level of precision and a moderate level of recall, which was expected given our consideration of only self-tweets in the matching process. Quantitative Science Studies 322 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 4 2 3 1 4 2 1 3 6 3 7 8 q s s _ a _ 0 0 2 5 0 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 An open data set of scholars on Twitter This focus on self-tweets naturally makes a more precise matching strategy but at the expense of recall, as the method excludes from the matching all those scholars who never tweeted any of their publications (or none of their publications were included in the Crossref Event Data database). The resulting matched database is more extensive than the one reported by Costas et al. (2020), which is most likely due to the broader coverage of OpenAlex compared to the Web of Science and improvements to the matching process. Other factors could include increasing Twitter use by researchers over time and/or increasing paper-sharing practices on Twitter or that more scholars are using their full names in their Twitter profiles. It is important to emphasize the limitations of the matching approach (and resulting data set) presented in this paper. First, the approach is limited to tweets recorded in Crossref Event Data, which does not include tweets before 2017, and publications and researchers recorded in OpenAlex. Furthermore, our matching algorithm requires that names be written in the Latin alphabet, which may exclude some Twitter users or authors who are not using the Latin alphabet. On top of this limitation in the coverage of our data sources, our reliance on self-tweets to increase the precision of our matches comes at the expense of some degree of recall. While we find that this strategy does provide high precision levels, our examination of the data indicates that very com- mon names can still generate some false positives. This provides some support for our choice not to cast a broader net by removing the self-tweet criterion, which would have likely flooded our data set with false positives for low expected gains in true positives and would also have dramat- ically increased the computational resources to perform the matching. It may also create a gender bias in our data set because a study by Peng, Teplitskiy et al. (2022) found men to be more likely to self-promote on Twitter than women. Finally, we also know from past research that there is a lower uptake of Twitter use in some regions or countries (Zahedi, 2017). It is also worth mentioning here that our data set is limited to the matches that were created through the process, which uses only the names of the authors and the tweeters. There exist other sources of researcher–tweeter pairs that could be used to complement the data set. For instance, one could consider adding the data set of ORCID accounts with associated Twitter handles that we used as a golden set to validate our approach. OpenAlex also includes Twitter handles for a few hundred authors, although it is not clear what data source these pairs come from. The characterization of the social media users and audiences that are engaging with scientific publications is an important element in the development of more advanced studies on the inter- actions between science and society. Thus, further and better curation of data around the inter- actions between social media users and academic objects is a fundamental step that needs to be considered in future altmetric research. Further developments of this work include expanding the matching to include additional matching criteria and developing approaches (e.g., citation or coauthor networks) to form tweeter–author pairs without relying only on self-tweets. Future devel- opments could also tackle the challenge of matching scholars with Twitter users that did not tweet any publication. Nevertheless, we believe that the relatively simple matching process outlined in this paper and the open data set of nearly half a million pairs of OpenAlex author IDs and tweeter IDs generated and made available to the community are valuable contributions to the field of quantitative science studies, and more specifically to the study of the activities of scholars on Twit- ter and the interaction between social media and science. ACKNOWLEDGMENTS The authors would like to thank Poppy Riddle, Kydra Mayhew, and Maddie Hare, who helped with data cleaning. Quantitative Science Studies 323 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 4 2 3 1 4 2 1 3 6 3 7 8 q s s _ a _ 0 0 2 5 0 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 An open data set of scholars on Twitter AUTHOR CONTRIBUTIONS Philippe Mongeon: Conceptualization, Data curation, Formal analysis, Methodology, Visual- ization, Writing–original draft, Writing–review and editing. Timothy D. Bowman: Conceptual- ization, Data curation, Formal analysis, Methodology, Writing–original draft, Writing–review and editing. Rodrigo Costas: Conceptualization, Data curation, Methodology, Writing–original draft, Writing–review and editing. COMPETING INTERESTS The authors have no competing interests. FUNDING INFORMATION Rodrigo Costas is partially funded by the South African DSI-NRF Centre of Excellence in Sci- entometrics and Science, Technology and Innovation Policy (SciSTIP). DATA AVAILABILITY The open data set of scholars on Twitter (Mongeon, Bowman, & Costas, 2022) produced with the process reported in this paper is available at https://doi.org/10.5281/zenodo.7013518. The data set includes a column indicating which criteria were successful in identifying the match, as well as a column indicating whether the match was considered valid upon manual inspection. REFERENCES Blackburn, R., Cabral, T., Cardoso, A., Cheng, E., Costa, P., … White, P. (2021). ORCID Public Data File 2021 (p. 100806462894 Bytes) [Data set]. ORCID. https://doi.org/10.23640/07243.16750535.V1 Brainard, J. (2022). Riding the Twitter wave. Science, 375(6587), 1344–1347. https://doi.org/10.1126/science.abq1541, PubMed: 35324287 Costas, R., Mongeon, P., Ferreira, M. R., van Honk, J., & Franssen, T. (2020). Large-scale identification and characterization of scholars on Twitter. Quantitative Science Studies, 1(2), 771–791. https://doi.org/10.1162/qss_a_00047 Costas, R., Rijcke, S., & Marres, N. (2021). “Heterogeneous cou- plings”: Operationalizing network perspectives to study science-society interactions through social media metrics. Jour- nal of the Association for Information Science and Technology, 72(5), 595–610. https://doi.org/10.1002/asi.24427 Costas, R., Zahedi, Z., & Wouters, P. (2014). Do “altmetrics” corre- late with citations? Extensive comparison of altmetric indicators with citations from a multidisciplinary perspective. Journal of the Association for Information Science and Technology, 66(10), 2003–2019. https://doi.org/10.1002/asi.23309 Ferreira, M. R., Mongeon, P., & Costas, R. (2021). Large-scale com- parison of authorship, citations, and tweets of web of science authors. Journal of Altmetrics, 4(1), Article 1. https://doi.org/10 .29024/joa.38 Haustein, S. (2016). Grand challenges in altmetrics: Heterogeneity, data quality and dependencies. Scientometrics, 108(1), 413–423. https://doi.org/10.1007/s11192-016-1910-9 Mongeon, P., Bowman, T., & Costas, R. (2022). Open dataset of scholars on Twitter [Data set]. Zenodo. https://doi.org/10.5281 /zenodo.7013518 Mongeon, P., Robinson-Garcia, N., Jeng, W., & Costas, R. (2017). Incorporating data sharing to the reward system of science: Link- ing DataCite records to authors in the Web of Science. Aslib Journal of Information Management, 69(5), 545–556. https://doi .org/10.1108/AJIM-01-2017-0024 Ortega, J. L. (2018). Reliability and accuracy of altmetric providers: A comparison among Altmetric.com, PlumX and Crossref Event Data. Scientometrics, 116, 2123–2138. https://doi.org/10.1007 /s11192-018-2838-z Peng, H., Teplitskiy, M., Romero, D., & Horvát, E.-Á. (2022). The gender gap in scholarly self-promotion on social media [Pre- print]. In review. https://doi.org/10.21203/rs.3.rs-1765948/v1 Priem, J., Piwowar, H., & Orr, R. (2022). OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and con- cepts. arXiv, arXiv:2205.01833. https://doi.org/10.48550/arXiv .2205.01833 Sugimoto, C. R., Work, S., Larivière, V., & Haustein, S. (2017). Scholarly use of social media and altmetrics: A review of the literature. Journal of the Association for Information Science and Technology, 68(9), 2037–2062. https://doi.org/10.1002/asi .23833 Thelwall, M., Haustein, S., Lariviere, V., & Sugimoto, C. R. (2013). Do altmetrics work? Twitter and ten other social web services. PLOS ONE, 8(5), e64841. https://doi.org/10.1371/journal.pone .0064841, PubMed: 23724101 Tsou, A., Bowman, T. D., Ghazinejad, A., & Sugimoto, C. (2015). Who tweets about science. In Proceedings of ISSI 2015 Istanbul: 15th International Society of Scientometrics and Informetrics Conference (pp. 95–100). Williams, K. (2022). What counts: Making sense of metrics of research value. Science and Public Policy, 49(3), 518–531. https://doi.org/10.1093/scipol/scac004 Zahedi, Z. (2017). What explains the imbalance use of social media across different countries? A cross country analysis of presence of Twitter users tweeting scholarly publications ( Version 2). figshare. https://doi.org/10.6084/m9.figshare.5454475.v2 Quantitative Science Studies 324 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 4 2 3 1 4 2 1 3 6 3 7 8 q s s _ a _ 0 0 2 5 0 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 RESEARCH ARTICLE image

Télécharger le PDF