CARTA
A tipping point for open citation data
B. Ian Hutchins
un acceso abierto
diario
Citación: Hutchins, B. I. (2021). A
tipping point for open citation data.
Estudios de ciencias cuantitativas, 2(2),
433–437. https://doi.org/10.1162/qss_c
_00138
DOI:
https://doi.org/10.1162/qss_c_00138
ABSTRACTO
Open citation data can improve the transparency and robustness of scientific portfolio analysis,
improve science policy decision-making, stimulate downstream commercial activity, y
increase the discoverability of scientific articles. Once sparsely populated, public-domain
citation databases crossed a threshold of one billion citations in February 2021. Shortly
thereafter, the threshold of one billion public domain citations from the Crossref database alone
was crossed. As the relative advantage of withholding data in closed databases has diminished
with the flood of public domain data, this likely constitutes an irreversible change in the citation
data ecosystem. The successes of this movement can guide future open data efforts.
1. BACKGROUND
Science builds on the knowledge of past discoveries to advance the frontier of research. El
spread of new knowledge, and the provenance of that information, is documented in the
scientific citation graph. Por esta razón, the citations are often used in studies of knowledge
flow, scientific attribution, and research assessment. Access to this information can improve
discoverability, stimulate entrepreneurship, and advance basic and applied research in infor-
mation networks (Dugan, Fenner et al., 2017). Sin embargo, for decades, the main sources of
citation data were confined by licensing restrictions from commercial providers.
The landscape has recently shifted toward open citation data. Databases such as Semantic
Scholar, The Lens, Microsoft Academic, and Digital Science’s Dimensions have indexed cita-
tions from the literature and released them at no cost on web services or in bulk under rela-
tively permissive licenses. Sin embargo, the most permissive license is a public domain license.
En tono rimbombante, two large and explicitly public domain platforms have been developed: COCI,
the OpenCitations Index of Crossref open DOI-to-DOI Citations (peroni & Shotton, 2020)
and NIH-OCC, the National Institutes of Health Open Citation Collection (Hutchins, Panadero
et al., 2019). COCI indexes public citation data from Crossref (Heibi, peroni, & Shotton,
2018) that many publishers opened up in response to the Initiative for Open Citations
(Dugan et al., 2017). The NIH-OCC, which I spearheaded as senior data scientist while work-
ing at NIH, is a merger of several citation databases, plus references parsed with machine
learning from crawled full text articles (Hutchins et al., 2019). NIH-OCC citations are distrib-
uted through the iCite web service (iCite, 2021) and as database snapshots (Hutchins &
Santangelo, 2019). These platforms have permanently released restriction-free citation data
Derechos de autor: © 2021 B. Ian Hutchins.
Publicado bajo Creative Commons
Atribución 4.0 Internacional
(CC POR 4.0) licencia.
La prensa del MIT
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d
yo
F
/
/
/
/
2
2
4
3
3
1
9
3
0
7
4
7
q
s
s
_
C
_
0
0
1
3
8
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
A tipping point for open citation data
to the public, and when combined, recently hit the major milestone of over one billion cita-
tions in the public domain (Figura 1A). The next COCI update is expected to have this many
citations from Crossref alone, following ingestion of references recently opened in Crossref by
Elsevier (Plume, 2020) and the American Chemical Society (Clinton, 2021).
What fraction of citations does this represent? One estimate of the upper bound puts
Elsevier’s share of Crossref references at approximately 30%. Other data sources accounted
para 60% of the total already, so public domain citation coverage for articles with Crossref
Digital Object Identifiers may exceed 90% of the total (Dugan et al., 2017; Molteni, 2017;
waltman, 2020). Checking the Crossref application programming interface at the time of writ-
ing indicates that 88% reference-containing documents are open. Notablemente, the fraction when
looking at journal articles only is higher, con 92% having opened their references.
Sin embargo, this accounting overestimates the fraction of all citations that are publicly avail-
capaz, as many print-only articles precede widespread indexing in Crossref, or publishers have
not yet supplied reference information for the digital articles that are included. Aquí, a compar-
ison of the biomedical literature may be instructive, as PubMed broadly indexes both digital and
print-only articles. The PubMed Knowledge Graph is another source of citation information that
uses Web of Science as a source of biomedical references (Xu, Kim y cols., 2020), and contains
references for many such historical print-only articles. Combining the references from the NIH
Open Citation Collection and the PubMed Knowledge Graph shows that 86% of this union is
explicitly public domain. While this estimate applies specifically to the biomedical literature, él
can anchor expectations for coverage in other fields as well. De este modo, the reality of citation coverage
aligns closely with expectations even when accounting for print-only articles.
En tono rimbombante, en años recientes, encima 95% of references are explicitly public domain (Figura 1B),
demonstrating nearly complete public domain coverage of recent papers. This result reinforces
estimates that over 85% of Crossref citations are entering into the public domain.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d
yo
F
/
/
/
/
2
2
4
3
3
1
9
3
0
7
4
7
q
s
s
_
C
_
0
0
1
3
8
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 1.
(A) Unique and overlapping citations in the public domain NIH-OCC and COCI databases. Unique public domain citations now
total 1.2 billion, 46% of which are biomedical PubMed-to-PubMed citations. Black line, 1 billion citations. The final data point (2021-04) es
estimated from a cache of open references from the Crossref API (the corresponding COCI data set is still being generated at the time of
writing). (B) Fraction of biomedical citations that are in the public domain, by publication year.
Estudios de ciencias cuantitativas
434
A tipping point for open citation data
2. AN IRREVERSIBLE STATE OF AFFAIRS
When citation indexing first began, it was a laborious affair. Information extraction and
processing for the Science Citation Index took teams in excess of 100 operators to generate
regular database updates (garfield, 1979). A lack of public investment necessitated that
databases be paywalled to finance operation. Closed-access data became the norm and this
persisted for decades.
Despite an explosion in the number of scientific articles published each year, the burden of
indexing citations has fallen dramatically. This situation makes a prediction about the future of
open citation data: that this open data ecosystem likely constitutes an irreversible change. En
otras palabras, a tipping point has been reached. It is now unlikely that the situation will revert to
a closed data ecosystem, for three related reasons.
Primero, the increased availability of data as journals open up access to their articles (Cervecero,
Principal, & Orr, 2019), combined with these advances in hardware, open-source software, y
improved algorithms, have made large-scale citation processing tractable even to small teams
(Hutchins et al., 2019; Jefferson, Jaffe et al., 2018). Segundo, there are multiple, overlapping
public domain citation data providers (NIH, Crossref, and OpenCitations). These account
for over 85% of citation data going forward, and ensure that the data ecosystem is resilient
to recissions by any one provider. Por ejemplo, Microsoft announced recently that its
Academic Graph will be retired at the end of 2021 (Microsoft Academic, 2021).
Overlapping portfolios in this context is a feature, not a bug. Finalmente, public domain licensing
guarantees that citation data are permanently available without restriction, meaning that new
providers can redistribute the existing base of public domain citations even if the previous
providers shut down.
The competitive benefits of closing access to citation data diminish with each new citation
released to the public domain, but the benefits of open data remain. Going forward, citation
data is almost completely public domain (Cifra 1). In this environment, the benefits of open
data overwhelm the rapidly decaying costs, even to publishers that previously monetized ac-
cess to their citation data (Plume, 2020).
3. ADDITIONAL BENEFITS
The scientometrics community called on publishers and funding agencies to open citation
datos, noting that this would improve the transparency and reproducibility of scientometric
analiza (Singh Chawla, 2019; Sugimoto, Waltman et al., 2017). As such analyses are used
for science policy, this will improve decision-making as well. Opening citation data also
improves discoverability of research articles, as readers can follow citations to more recent
relevant work.
Public domain citation data excel in certain applications compared to other permissive li-
censes. The first is in downstream entrepreneurship and stimulating commercial activity.
Public domain citation data do not even need to be reviewed for licensing suitability, y
can be immediately federated into commercial applications (Dimensions, 2019). Other li-
censes must be reviewed carefully, as they may contain no-commercial-use restrictions.
Even some licenses that permit commercial use might exclude monetizing the redistribution
of data, which could prohibit commercial curation or data cleaning. Public domain data, en
contrast, are well suited for stimulating all kinds of downstream economic activity.
A second area in which public domain licenses excel is in improving science policy.
Because my professional experience lies with the U.S. federal government, I will focus on that
Estudios de ciencias cuantitativas
435
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d
yo
F
/
/
/
/
2
2
4
3
3
1
9
3
0
7
4
7
q
s
s
_
C
_
0
0
1
3
8
pag
d
/
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
A tipping point for open citation data
perspectiva, although other countries may face similar constraints. Permissive licenses targeted
toward academics and research institutions may contain indemnification clauses, cual
render such data sets unusable in many circumstances by the U.S. federal government. Este
is in part due to restrictions in the Antideficiency Act (O’Connell, 2014); such clauses often
function as a no-government-use license. Perversely, this can drive U.S. federal agencies to
source data from closed providers, reducing the transparency and robustness of the analyses
used in science policy decision-making. Public domain data are a powerful remedy for this
situation.
4. CONCLUSIÓN
The broad coverage of public domain citation data means that these resources are well suited
for use in scientometric analyses that traditionally used proprietary data. This community has
uniquely contributed to the development of public domain citation data, and I exhort its mem-
bers to transition from using proprietary citation data to using these comprehensive public do-
main data sources in their projects as much as possible. This will reduce dependence on
closed data sources, increase the number of researchers using and contributing to open cita-
tion efforts, and improve the reproducibility of these projects. Además, I would encourage
members of the community to contribute their support to other open science movements, semejante
as the Initiative for Open Abstracts (Duce, Hendricks et al., 2020). The utility of open citation
data will be multiplied when article text, author metadata, institutional linkages, and article
classifications are also made open.
The successful movement toward public domain citation data can offer guidance for other
open access efforts. Primero, identify the important gatekeepers of the current data sources.
Explore win-win value propositions such as discoverability and reproducibility, and clearly artic-
ulate the benefits of open access. Segundo, build a network of open access providers. A NOSOTROS. ciencia
funding agencies aspire to data transparency and can be powerful allies in this respect. Datos
acquired by U.S. agencies under the “Unlimited Rights in Data” clauses of the Federal
Acquisition Regulations and disseminated to the public can ensure their permanent open re-
lease. Finalmente, work toward a comprehensive release of open data. Incentives for participants
in the data ecosystem will lean in favor of contributing to open data initiatives as more data
are opened up and the remaining competitive value of closed data diminishes. If structured well,
opening data is irreversible, so the advantages to withholding data will give way while the
benefits of open data remain strong.
EXPRESIONES DE GRATITUD
I thank Nees Jan van Eck for providing the Crossref bulk data for estimating the number of
citations available in the 2021-05 COCI release. I also thank the NIH Office of Portfolio
Analysis for providing regular iCite database snapshots containing the NIH-OCC.
CONFLICTO DE INTERESES
The author spearheaded the NIH-OCC as senior data scientist while working at NIH.
INFORMACIÓN DE FINANCIACIÓN
Support for this work was provided by the Office of the Vice Chancellor for Research and
Graduate Education at the University of Wisconsin-Madison with funding from the
Wisconsin Alumni Research Foundation.
Estudios de ciencias cuantitativas
436
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d
yo
F
/
/
/
/
2
2
4
3
3
1
9
3
0
7
4
7
q
s
s
_
C
_
0
0
1
3
8
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
A tipping point for open citation data
DISPONIBILIDAD DE DATOS
iCite NIH Open Citation Collection data are available at Figshare (https://doi.org/10.35092
/yhjc.c.4586573). The raw data supporting the analysis in Figure 1B can be found at
Figshare (https://doi.org/10.6084/m9.figshare.14544405). The Crossref data set used for the
2021-04 data point is available at Zenodo (https://doi.org/10.5281/zenodo.4748336). El
PubMed Knowledge Graph data can be accessed at the University of Texas (https://er.tacc
.utexas.edu/datasets/ped).
REFERENCIAS
Clinton, A. (2021). ACS Publications signs DORA. ACS Axial.
https://axial.acs.org/2021/02/18/acs-publications-signs-dora/
Dimensions. (2019). Where are citations in Dimensions coming
de? Digital Science. https://dimensions.freshdesk.com/support
/solutions/articles/23000018849-where-are-citations-in
-dimensions-coming-from
Duce, h., Hendricks, GRAMO., Kramer, B., Larivière, v., Maccallum, C. J.,
… Waltman, l. (2020). Initiative for Open Abstracts. https://i4oa
.org/
Dugan, J., Fenner, METRO., Gerlach, J., MacCallum, C., Mietchen, D., …
Taraborelli, D. (2017). Iniciativa para citas abiertas. https://i4oc
.org
garfield, mi. (1979). The design and production of a citation index.
In Citation indexing – Its theory and application in science, tech-
nology and humanities. Chichester: John Wiley & Sons.
Heibi, I., peroni, S., & Shotton, D. (2018). COCI CSV dataset of all the
citation data. Figshare. https://doi.org/10.6084/m9.figshare.6741422
Hutchins, B. I., Panadero, k. l., davis, METRO. T., Diwersy, METRO. A., Haque, MI.,
… Santangelo, GRAMO. METRO. (2019). The NIH Open Citation Collection: A
public access, broad coverage resource. PLOS Biology, 17(10),
e3000385. https://doi.org/10.1371/journal.pbio.3000385,
PubMed: 31600197
Hutchins, B. I., & Santangelo, GRAMO. METRO. (2019). iCite Database
Snapshots (NIH Open Citation Collection). Figshare. https://doi
.org/10.35092/yhjc.c.4586573
iCite. (2021). iCite: A powerful web application that provides a
panel of bibliometric information for scientific publications.
https://icite.od.nih.gov/
Jefferson, oh. A., Jaffe, A., Ashton, D., Warren, B., Koellhofer, D., …
Jefferson, R. A. (2018). Mapping the global influence of published
research on industry and innovation. Naturaleza Biotecnología, 36(1),
31–39. https://doi.org/10.1038/nbt.4049, PubMed: 29319684
Microsoft Academic. (2021). Next steps for Microsoft Academic –
Expanding into new horizons. https://www.microsoft.com/en-us
/research/project/academic/articles/microsoft-academic-to
-expand-horizons-with-community-driven-approach/
Molteni, METRO. (2017). Tearing down science’s citation paywall, uno
link at a time. Wired Magazine. https://www.wired.com/2017/04
/tearing-sciences-citation-paywall-one-link-time/
O’Connell, S. METRO. (2014). Contractual indemnification by the
Federal Government. Law School Student Scholarship, 279.
peroni, S., & Shotton, D. (2020). OpenCitations, an infrastructure
organization for open scholarship. Quantitative Science
Estudios, 1(1), 428–444. https://doi.org/10.1162/qss_a_00023
Cervecero, h., Principal, J., & Orr, R. (2019). The future of OA: Un gran-
scale analysis projecting open access publication and reader-
barco. bioRxiv, 795310. https://doi.org/10.1101/795310
Plume, A. (2020). Advancing responsible research assessment.
https://www.elsevier.com/connect/advancing-responsible
-research-assessment
Singh Chawla, D. (2019). Open-access row prompts editorial board
of Elsevier journal to resign. Naturaleza. https://doi.org/10.1038
/d41586-019-00135-8
Sugimoto, C. r., waltman, l., Larivière, v., van Eck, norte. J., Boyack,
k. w., … de Rijcke, S. (2017). Open citations: A letter from the
scientometric community to scholarly publishers. https://www
.issi-society.org/open-citations-letter/
waltman, l. (2020). q&A about Elsevier’s decision to open its cita-
ciones. https:// leidenmadtrics.nl/articles/q-a-about-elseviers
-decision-to-open-its-citation
Xu, J., kim, S., Song, METRO., jeong, METRO., kim, D., Kang, J., … Ding, Y.
(2020). Building a PubMed knowledge graph. Scientific Data, 7(1),
205. https://doi.org/10.1038/s41597-020-0543-2, PubMed:
32591513
Estudios de ciencias cuantitativas
437
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d
yo
F
/
/
/
/
2
2
4
3
3
1
9
3
0
7
4
7
q
s
s
_
C
_
0
0
1
3
8
pag
d
/
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3