RESEARCH ARTICLE
Identifying scientific publications countrywide
and measuring their open access: The case of
the French Open Science Barometer (BSO)
a n o p e n a c c e s s
j o u r n a l
Université PSL, Paris, France
Lauranne Chaignon
and Daniel Egret
Keywords: databases, open science, publications
Citation: Chaignon, L., & Egret, D.
(2022). Identifying scientific
publications countrywide and
measuring their open access: The case
of the French Open Science Barometer
(BSO). Quantitative Science Studies,
3(1), 18–36. https://doi.org/10.1162/qss
_a_00179
DOI:
https://doi.org/10.1162/qss_a_00179
Peer Review:
https://publons.com/publon/10.1162
/qss_a_00179
Received: 16 July 2021
Accepted: 17 January 2022
Corresponding Author:
Daniel Egret
daniel.egret@psl.eu
Handling Editor:
Ludo Waltman
ABSTRACT
We use several sources to collect and evaluate academic scientific publication on a country-
wide scale, and we apply it to the case of France for the years 2015–2020, while presenting a
more detailed analysis focused on the reference year 2019. These sources are diverse:
databases available by subscription (Scopus, Web of Science) or open to the scientific
community (Microsoft Academic Graph), the national open archive HAL, and databases
serving thematic communities (ADS and PubMed). We show the contribution of the different
sources to the final corpus. These results are then compared to those obtained with another
approach, that of the French Open Science Barometer for monitoring open access at the
national level. We show that both approaches provide a convergent estimate of the open
access rate. We also present and discuss the definitions of the concepts used, and list the main
difficulties encountered in processing the data. The results of this study contribute to a better
understanding of the respective contributions of the main databases and their complementarity
in the broad framework of a countrywide corpus. They also shed light on the calculation of
open access rates and thus contribute to a better understanding of current developments in the
field of open science.
1.
INTRODUCTION
Open access to publications (e.g., Laakso & Björk, 2012; Piwowar, Priem et al., 2018) within
the general framework of Open Science is now an issue shared by many institutions, univer-
sities and research organizations, and funders. France is no exception: Two national plans for
Open Science have been successively launched, in 2018 and 2021, by the Ministry of Higher
Education, Research and Innovation (MESRI). Generalizing open access to publications is the
first axis of these two plans, with a goal of 100% of French scientific publications in open
access by 20301, either through a publication natively in open access or through a deposit
in an open archive. This national plan is in line with the European Plan S2.
To support the policies thus deployed, a good knowledge of the state of publications and
their open access rate seems necessary, and many measurement tools have been developed
for this purpose, in different contexts, such as the European Open Science Monitor (OSM), the
1 National Plan for Open Science: https://www.ouvrirlascience.fr/national-plan-for-open-science-4th-july
-2018/; https://www.ouvrirlascience.fr/second-national-plan-for-open-science/.
2 Plan S: https://www.coalition-s.org/.
Copyright: © 2022 Lauranne Chaignon
and Daniel Egret. Published under a
Creative Commons Attribution 4.0
International (CC BY 4.0) license.
The MIT Press
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
German Open Access Monitor (OAM), the Danish Open Access Indicator, or the COKI Open
Access Dashboard. Other countries have also adopted national strategies for monitoring open
access (Carvalho, Laranjeira et al., 2017).
In its guide to assisting research organizations and funders in setting up a tool for monitor-
ing open access publications (Philipp, Botz et al., 2021), the organization Science Europe con-
siders the constitution of the corpus of publications to be analyzed as one of the key stages in
the process. We could add that it is even one of the major challenges of this exercise. Indeed,
no database provides an easy and complete answer to this question. The large databases, such
as the Web of Science ( WoS) and Scopus, have the advantage of systematically listing a large
part of the millions of scientific publications published each year in the world. The metadata
are standardized and allow for efficient searching. However, the coverage of science, technol-
ogy, and medicine (STM) and of English-language publications in international journals is pri-
vileged, while other disciplinary fields, other languages of publication, and other sources or
document types are less fully surveyed (Mongeon & Paul-Hus, 2016; Van Leeuwen, Moed
et al., 2001; Vera-Baceta, Thelwall, & Kousha, 2019). Moreover, these databases are accessi-
ble only by subscription, so their data are not open or reusable. If we consider thematic data-
bases such as PubMed or NASA/ADS, their metadata are both high quality and open. On the
other hand, they cover a very specific disciplinary field: An exhaustive census of publications
in a multidisciplinary context will therefore require multiple sources.
As for open archives, while they have the advantage of listing types of publications,
languages, and sources that are often absent from large databases, they offer insufficiently
standardized metadata, which complicates their collection and processing. Thus, no single
database offers comprehensiveness, standardized metadata, and openness. As Huang, Neylon
et al. (2020) conclude in a recent article: “Any institutional evaluation framework that is seri-
ous about coverage should consider incorporating multiple bibliographic sources.”
Current Research Information Systems (CRIS) can be a way around this difficulty, provided
that they are not fed solely by the large commercial databases mentioned above. They are
increasingly being used in universities to help manage, understand, and evaluate research
activities. However, most CRIS are, today, still used only at an institutional level (Sivertsen,
2019). Although their aggregation at the country level to constitute a national base is progres-
sing, it is still most often correlated with the implementation of a public funding policy based
on scientific publication performance, as is the case in Denmark, Finland, Hungary, Italy,
Norway, and Poland (Puuska, Nikkanen et al., 2020). If the motivation is primarily financial,
a national database is an opportunity to set up an effective monitoring of open access policies
at the country level, as Finland has experimented with (Pölönen, Laakso et al., 2020).
For countries that do not have such a pool of data, the implementation of a monitoring tool
on this scale implies selecting from among the existing databases, whether commercial or not,
those that will best meet the objective set. The German Ministry of Education and Research
has thus chosen to use the Dimensions and WoS databases to establish its corpus3. Universities
UK, the association of 140 UK universities, has chosen to use Scopus to produce its latest
report on the effects of new policies to promote open access4.
In the case of France, the objective of the MESRI was to set up a tool that would enable the
steering of the national policy on open science, by measuring, on an annual basis, the level of
3 https://jugit.fz-juelich.de/synoa/oam-dokumentation/-/wikis/Quelldatenbanken/Quelldatenbanken.
4 https://www.universitiesuk.ac.uk/sites/default/files/field/downloads/2021-09/monitoring-transition-open
-access-2017-annexe-1-methodology.pdf.
Quantitative Science Studies
19
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
open access of all publications with at least one French affiliation. This request was accom-
panied by a very specific requirement: “a transparent methodology and reproducible results.”
It is with this in mind that the French Open Science Barometer (BSO) was carried out5, as
described by Eric Jeangirard (2019). For the BSO, the constitutive choice is to use only open
sources. The methodology used consists in scanning all the papers referenced in Unpaywall
and in the national open archive HAL (see below) to identify either the French authors or the
presence of the mention of France in the affiliation. The publications thus identified were then
enriched with information on their scientific discipline, using natural language processing
(NLP), also based on open source code, to determine, from the title, the discipline to which
a document belongs. Finally, the open access status was determined using the Unpaywall
database. The corpus obtained by this strategy is available in open access from the MESRI
OpenData portal6. In accordance with the recommendations made at the European level
(Open Access Monitoring: Philipp et al., 2021), the French National Open Science Barometer
is published on an annual basis.
About 150,000 publications are thus identified each year by the BSO. The purpose of this
study is to consider an alternative approach, this time based on the use of the main open or
nonopen bibliographic databases, and to analyze the extent to which this new corpus differs
from that of the BSO. Our approach is based on the use of six complementary sources, namely
WoS, Scopus, Microsoft Academic Graph, PubMed, NASA/ADS, and the HAL open archive, to
identify and assess academic scientific publication at the scale of a country, in this case
France, for publications released during the 6 years 2015–2020. As the year scale seemed
to us more relevant to characterize scientific production, we chose to highlight, in the context
of this article, the data related to the year 20197. We then compare the corpus obtained with
that of the BSO, and we show to what extent the diversity of the sources used makes it possible
to refine the identification and characterization of French scientific production, as well as the
estimation of the open access rate.
While there is an abundant literature on the comparison between Scopus, WoS, and other
generalist databases (see, for example, in a national production context Archambault,
Campbell et al. [2009], Bartol, Budimir et al. [2014], and Moed, Markusova, and Akoev
[2018], or for a statistical comparison of large reference databases Mongeon and Paul-Hus
[2016], Pranckutė [2021], and Visser, van Eck, and Waltman [2021]), our study provides a
detailed quantitative view in the specific context of French research. Far from identifying a
source that would be optimal, our study shows the importance of diversifying the sources used
to provide complementary views on a country’s publication.
2. CONSTITUTION OF THE FRANCE 2015–2020 CORPUS: DATA AND METHODS
2.1. Definitions
Before describing in detail the methodology used to establish our corpus, we present and dis-
cuss here the main concepts used.
2.1.1. Digital Object Identifier (DOI)
The DOI8 is a persistent identifier that can be assigned to any type of content, be it text, software,
data sets, etc. (Simmonds, 1999). It will be used as a common metadata for the entire study.
5 https://bso.esr.gouv.fr.
6 https://data.enseignementsup-recherche.gouv.fr/explore/dataset/open-access-monitor-france/.
7 The counts for each of the 6 years are available in the supplementary data file.
8 DOIs are managed by the nonprofit association CrossRef (Hendricks, Tkaczyk et al., 2020).
Quantitative Science Studies
20
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
2.1.2.
Scientific publications
We consider here scientific publications indexed in databases (private or public) and accessible
in open archives. All types of documents are taken into account. This primarily concerns articles,
generally published in international peer-reviewed journals, but also conference proceedings,
book chapters, or any other publication, provided that it has a DOI. However, the restriction to
only documents with a DOI is an important restriction, which we must explain here.
To facilitate the aggregation of results, and to avoid duplication, we have chosen, as does
the BSO (French Open Access Monitoring), to restrict the cross-referencing of data to publi-
cations identified by a DOI. This step is necessary to allow the efficient cross-referencing of
documents identified in each database by their DOI identifier, common to all databases. In
addition, the Unpaywall database, which will inform us about open access in the next step,
only lists publications with a DOI.
Let us note that the requirement of the presence of a DOI immediately rules out a certain
number of journals that do not adhere to this very general technology of persistent identifiers
(Gorraiz, Melero-Fuentes et al., 2016); some of these journals may be, as Wang, Shen et al.
(2020) point out, key journals in their discipline, with the example, for the field of Artificial
Intelligence, of the Journal of Machine Learning Research.
Moreover, grey literature, under which we can group preprints, reports, theses, and in some
cases conference proceedings (Schöpfel & Prost, 2019), is often ignored by open access mea-
surement tools, mainly for two reasons: The first corresponds to a concern to discard literature
whose scientific relevance cannot be sufficiently controlled (lack of peer review); the second
is rather related to technical considerations, in particular a difficulty in identifying these pub-
lications in the absence of complete and standardized metadata, especially persistent identi-
fiers. In practice, this leads to ignoring a large proportion of the work published in certain
disciplines where the thematic field, the regional vocation, or the applicative nature of the
publications takes precedence over international referencing.
Our methodology, based on the use of the DOI, therefore effectively excludes some of the
documents that might be of interest to us. This is why we will come back to publications with-
out DOIs at the end of our study, by proposing an estimate of the share of grey literature in
French national production (Section 5.2).
Finally, it should be noted that the publications taken into account to establish our corpus
are exclusively those that have a digital version: It is this digital version for which we will try to
measure the degree of accessibility. Thus, peer-reviewed research published in books or
monographs is only covered when it is in digital format and has a DOI. For this reason, non-
academic publishing generally falls outside the scope of our study.
2.1.3. Open access
A scientific article that is only available on payment of a subscription or a fee (price per article)
is considered closed. In contrast, a scientific article that is freely available, either on a pub-
lisher’s website or after the deposit of the full text (in its final layout or not) on an open archive,
is deemed open.
Our source of information for the open access status of an article will be the Unpaywall
database (Piwowar et al., 2018), specifically the data in the “is_oa” field. If the value returned
for a given publication is equal to “True,” the publication will be considered open. If this value
is “False,” the publication will be considered closed. The so-called “bronze” status is consid-
ered open.
Quantitative Science Studies
21
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
Note that the open access status may vary over time, because a closed publication may
have its embargo lifted or be subsequently deposited in an open archive. Thus, in our study,
it will be the status observed in February 2021, as recorded in the Unpaywall database snap-
shot for that date.
Let us recall that for France, the Law for a Digital Republic of 7 October 20169 establishes
the possibility of deposit in an open archive of the postprint of any scientific article resulting
from research funded at least 50% by the state or public authorities, at the expiration of a
period of 6–12 months depending on the scientific field (respectively, STM or Humanities &
Social Sciences).
2.2. Sources Used to Constitute the FR-2015-2020 Corpus
The collection of metadata related to a large set of publications is facilitated by the use of
databases that systematically, if not exhaustively, collect a large part of the millions of scientific
publications published each year worldwide.
In this article, we have privileged the databases providing a search capability for the men-
tion of the country in the affiliation, and we have collected the publications whose affiliation
mentions the country considered in our study, France, using the corresponding query modes of
six databases that, to our knowledge, effectively cover French scientific production.
We did not use the Dimensions database, as it is not considered to be a reliable source for
establishing a corpus on a country scale (Guerrero-Bote, Chinchilla-Rodríguez et al., 2021).
We use the following databases in our study:
(cid:129) Scopus (Baas, Schotten et al., 2020) references more than 25,000 journals and is consid-
ered one of the most comprehensive databases for international peer-reviewed journals.
Query by country is possible. Metadata extraction is limited to batches of 20,000 docu-
ments. This database is available by subscription from Elsevier.
(cid:129) WoS (Birkle, Pendlebury et al., 2020) has been the reference database for scientometrics
since the pioneering work of Garfield (1964). The query by country is provided in the
advanced query mode. This database is available by subscription from Clarivate Analyt-
ics. In this study, we use all the indexes (including ESCI: Emerging Sources) except for the
Book Citation Index, which was not available to us.
(cid:129) The HAL open archive10 (Charnay & Michau, 2007) is a national multidisciplinary open
archive intended for the deposit and dissemination of research-level scientific articles
(published or not), theses, and other objects emanating from French or foreign teaching
and research establishments, and public or private laboratories. Created in 2001 with
ArXiv as a model, this platform has gradually become one of the main tools for reporting
French research. A partnership agreement in favor of this archive was signed in 2013 by
the Conference of University Presidents (CPU) and 22 institutions. In July 2021, the MESRI
also committed to supporting the development of this archive, in terms of both technical
aspects and governance, as part of its second national plan for open science 2021–2024.
French researchers are invited to deposit on this platform the products of their research,
whether they are publications (article in a journal, communication in a conference, chapter
9 Law for a Digital Republic; see in particular its article 30: https://www.legifrance.gouv.fr/dossierlegislatif
/JORFDOLE000031589829/.
10 https://hal.archives-ouvertes.fr/.
Quantitative Science Studies
22
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
of a book, book, poster, file, patent), unpublished documents (prepublication, working
document, report), academic works (thesis, HDR, course), or research data (image, video, soft-
ware, map, or sound). The recorded documents are either in the form of a notice only or
accompanied by the full text of the article. This production can be grouped within different
collections or portals relating to a theme (SHS for example), a medium (images and videos), or
a research structure (university, laboratory, or research team), but it remains possible to carry
out queries covering all portals and collections. After 20 years of use (Berthaud, Charnay, &
Fargier, 2021), more than 2,700,000 works are now recorded in this archive.
HAL data can be queried using an advanced query or the API. The latter, which is available
free of charge, allows the identification of the country of affiliation.
(cid:129) The NASA/ADS database (Kurtz, Eichhorn et al., 2000) is one of the most recognized
examples of a bibliographic database covering a research field: astrophysics and physics.
Its query mode allows querying by country. Access is free.
(cid:129) The PubMed database is one of the preferred and free access points for metadata related to
biomedical science research. A query by affiliation is possible (Ibarra, Ferreira et al., 2018).
(cid:129) The Microsoft Academic Graph (MAG) database (Herrmannova & Knoth, 2016; Wang
et al., 2019), one of the three products of the Microsoft Research project, is one of the
largest open publication and citation data sets. It is populated automatically, using biblio-
graphic data from web pages crawled by the Bing search engine, also a Microsoft product.
The data can be accessed using the Academic Knowledge API. It should be noted that MAG
does not contain structured data on affiliation country. Identification of French outputs
(provided by the Curtin Open Knowledge Initiative team) was by applying a query to the
affiliation string (OriginalAffiliation data element from the MAG PaperAuthorAffiliations
table, linked via the PaperID to the DOI) that sought to determine whether the affiliation
string ended with “France” (or one of a small set of non-English names). This number
may not match that in the online COKI country dashboard, which maps affiliation country
from GRIDs in MAG to the country of organization in the GRID database11.
Some of the characteristics of these databases as well as the number of documents obtained
for 1 year (the year 2019), in the framework of the query “France 2015–2020” carried out in
October 2021 are presented in Table 1.
2.3. Aggregation of Results for Publications Identified by a DOI
As mentioned above, to facilitate the aggregation of results and to avoid duplication, we have
chosen, as does the BSO (French Open Access Monitoring), to restrict the cross-matching of
data to publications identified by a DOI.
Table 2 shows the counts obtained for the year 2019: DOIs are available for 94% of the
documents indexed in Scopus and 85% of those in WoS. Notice, in addition, that a majority of
the documents without a DOI corresponds to communications to conferences (for France and
the year 2019: 54% of the documents without a DOI in Scopus are communications; 78% in
WoS). For ADS the documents without a DOI are mainly conference abstracts, while docu-
ments without a DOI represent only 1% of PubMed.
For the HAL archive, the point is that the DOI identifier is not systematically filled in
because it is not a compulsory metadata during the deposit. While only 2–3% of the
11 https://openknowledge.community/dashboards/coki-open-access-dashboard/.
Quantitative Science Studies
23
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
Table 1.
Sources used: queries, number of records returned for the year 2019
Base
Scopus
Sample query
(France, year 2019)
AFFILCOUNTRY (france) and
PUBYEAR = 2019
Number of
documents
France 2019
123,181
Types of
documents
Domains
All
All areas
Web of Science
CU = FRANCE AND PY = 2019
124,790
All
All areas
HAL (Open
Via API: producedDateY_i:2019
158,937
Archive, France)
structCountry_s:fr
Open archive
of French
laboratories
All areas
Practical
limitations
Export in batches
of 20,000
Export in batches
of 5,000
Export in batches
of 10,000
NASA/ADS
aff: “France” AND year:2019-2019
19,997
All
PubMed
(France[Affiliation]) AND
(“2019″[Date – Publication])
56,038
All
Physics and
Astrophysics
Export in batches
of 500
Medicine,
Export in batches
Biology, Health
of 10,000
MAG
mag.Year = 2019 AND
101,885
All (with DOI) All areas
((SELECT COUNT(1) FROM
UNNEST(mag.authors) as auth
WHERE REGEXP_EXTRACT
(auth.OriginalAffiliation, r’Fran
(ce|kreich|cia)(?:\W|\s+|$)’)
is not null) > 0
(COKI, private
communication)
documents characterized as articles in WoS or Scopus do not have a DOI recorded, this pro-
portion rises to 22% for documents characterized as articles in HAL. In addition, the open
archive contains many unpublished documents, preprints, reports, or theses that do not have
(or not yet) a DOI: With the book chapters, these documents represent half of the publications
without a DOI, which will not be considered for the rest of the study.
However, we will return to HAL in Section 5 for a discussion of grey literature.
Note that for MAG, we had direct access to the DOI lists through the COKI team, whom we
thank for their help.
Table 2. DOI counts in the six sources for the year 2019. The last column shows the numbers of
documents without DOIs in the Article category alone.
Query France
2019
Scopus
Number of
documents
123,181
WoS
HAL
ADS
PubMed
MAG
124,790
158,937
19,997
56,038
Documents
with DOI
115,273
101,377
66,836
15,731
55,516
101,885
% DOI
94
Category: Articles
with no DOI
1,709
85
42
79
99
–
2,763
16,992
56
522
–
Quantitative Science Studies
24
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
Table 3. Unpaywall cross-reference: DOI and year of publication
Scopus
WoS
HAL
ADS
PubMed
MAG
Total Corpus FR-2019
Total with
DOI 2019
115,273
101,377
66,836
15,731
55,516
101,885
DOI confirmed
Unpaywall 2019
111,422
96,712
63,413
15,410
48,047
102,338
139,514
2.4. Open Access and External Validation: Using Unpaywall
One of the objectives of this study is the measurement of the share of open access to publi-
cations. For this we use the Unpaywall database12, which is the leading database in this field
(Holly, 2018; Piwowar et al., 2018).
This database offers a simplified access mode (by batches of 1,000 DOIs) which allows us
to easily obtain the status of a publication (open or closed access, with the publisher and/or in
an open archive) at the time of the query. It is also possible to download a complete version of
the database (called a Snapshot ). For this study, we used the version dated February 2021. For
the year 2019, this version lists more than 6 million publications.
Querying the Unpaywall database also allows us to validate the DOIs identified in the pre-
vious step: We consider that DOIs not found in Unpaywall generally correspond to identifiers
that have not been confirmed by Crossref, the agency that certifies their quality and continuity.
Moreover, it is not uncommon to find differences in the date of publication from one data-
base to another (often due to the time lag between the version published online (early access)
and the “final” publication). We have chosen to use the year of publication provided in the
Unpaywall database as the reference year (see Table 3), whether or not it is consistent with the
year of publication mentioned in the source database. This choice is also the one adopted by
the BSO (French Open Access Monitoring).
Table 3 presents the results of the cross-matching between the six sources and their valida-
tion with Unpaywall.
The first column recalls the number of DOIs obtained from each source, already presented
in Table 2. The second column presents the numbers of DOIs found in Unpaywall and
recorded in this database as published in 2019.
Note that to obtain the counts in Table 3 we cross-referenced the results of queries covering
for the six sources the whole of the years 2015–2020 with the year 2019 from Unpaywall.
Discrepancies in publication dates affect about 8% of the documents. Because of the reassign-
ment of publication dates, the number of DOIs with confirmed output (second column of
Table 3) for a given year may be larger than the original number of DOIs for this year (case
of MAG), despite a small loss of unidentified DOIs.
12 Unpaywall: https://www.unpaywall.org.
Quantitative Science Studies
25
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
Table 4. Cross-referencing of FR-2019 sources with BSO data (source BSO: Jeangirard, 2019)
Corpus FR-2019
BSO 2019
In common
BSO only
FR-2019 only
Global corpus FR-2019 +
BSO (without duplicates)
France 2019
(# DOI)
139,514
153,705
125,807
27,898
13,707
167,412
Contribution to the
global corpus
83%
92%
75%
17%
8%
100%
In the following section, the 139,514 records described in column 2 will be cross-
referenced with the BSO.
3. COMPARISON OF THE FR-2019 AND BSO DATA SETS
3.1. Overlap of the Two Sets
The corpus thus constituted (FR-2019) can now be compared with that of the French Open
Science Barometer (BSO), which also aims to cover all French production, for several years
including 201913.
Because the BSO data are also restricted to publications with a DOI and have benefited
from the Unpaywall query, it is easy to cross-reference the two sets of DOIs. The result is sum-
marized in Table 4.
Table 4 shows that, if we restrict ourselves to the data validated after querying Unpaywall,
8% of the total data set (i.e., 13,707 DOIs) are not identified in the BSO, while conversely 17%
of the documents (i.e., 27,898 DOIs) had not been identified in our FR-2019 corpus.
3.2. Data from Our FR-2019 Corpus That Are Not Part of the BSO Corpus
The data from our sources not included in the BSO corpus seem to correspond mainly to a
failure to identify the France affiliation in the algorithm developed by Jeangirard (2019). This
was expected and corresponds to what Jeangirard calls false negatives—which he says he can-
not estimate and which we estimate here at 9% of the BSO corpus.
In our study, the main sources contributing to this subset not identified by the BSO are Sco-
pus (63%), WoS (41%), and MAG (23%). We believe that these documents come from the less
represented publishers, for which it is likely that specific algorithms for extracting the country
of affiliation have not been developed for BSO.
3.3. Data from the BSO Corpus Absent from the FR-2019 Corpus
The data from the BSO corpus not included in our sources come mainly from humanities and
social sciences journals (44%), biomedical journals (24%), and basic biology journals (12%).
13 The BSO data have been produced in December 2020 and are made available on the Open Data portal of
the Ministry of Higher Education (MESRI): https://data.enseignementsup-recherche.gouv.fr/explore/dataset
/open-access-monitor-france/.
Quantitative Science Studies
26
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
Table 5.
Search in Scopus for false positives of BSO
Search in Scopus
BSO only
Not found
Found in other years
Found same year
Number
27,898
23,706
576
3,616
Comment
Journals not indexed by Scopus
Year assignment discrepancy
Probable false positives from the BSO
We note a significantly higher proportion of articles in French in this BSO-only subset: 31%
compared to the average of 15% for the global corpus (the language analysis methodology will
be presented in Section 4.4).
These are mainly journals or resources not covered by the databases we have used, in par-
ticular, documentary resources and journals with a national scope in French or English. For
example, the most represented sources in this set are the following:
(cid:129) Case Medical Research: international database of clinical trials
(cid:129) Faculty Opinions—Postpublication peer review of the biomedical literature
(cid:129) SSRN electronic journal: database of social science preprints.
This set of documents also includes the “false positives” reported by Jeangirard (2019) (i.e.,
documents that his algorithm wrongly identified as publications from the France set). These are
publications for which none of the authors has an affiliation in France but which the BSO
algorithm nevertheless retained. Jeangirard estimates the false positive rate at 4% (which
would correspond to about 6,000 publications for the year 2019).
We can try to estimate more precisely this share of false positives: The search in Scopus of
DOIs corresponding to publications collected for the BSO but not confirmed by our other
sources sheds light on this subject (Table 5).
This search allows us to identify 3,616 probable false positives: The Scopus database rec-
ognizes the DOI, the year is indeed 2019, but the article does not include, according to
Scopus, an affiliation in France. This corresponds to 3.5% of the DOIs common to BSO and
Scopus, which thus seems compatible with the 4% estimated by Jeangirard (2019). Let us note
once again that the cross-referencing of the different sources highlights divergent assessments of
the publication date of the articles.
3.4. Contribution of the Different Sources to the Overall Aggregated Corpus
Table 6 presents the contributions of each source to the overall corpus (aggregating the two
approaches: our FR-2019 corpus and the one collected for the BSO).
Table 6.
gives the number of documents found in only one source (year 2019).
Share of each source in the overall aggregated corpus (FR-2019 + BSO). The second line
Share of Total
Scopus
67%
WoS
58%
HAL
38%
ADS
9%
PubMed
29%
MAG
61%
BSO
92%
In one source
7,211
4,009
6,335
155
230
11,665
27,898
Quantitative Science Studies
27
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
Table 7.
Cross contributions from each source to the overall France 2019 corpus
Scopus
WoS
HAL
ADS
Scopus
111,422
WoS
88,327
HAL
54,611
ADS
14,851
PubMed
46,503
MAG
85,873
BSO
102,736
88,327
96,712
49,664
14,507
44,493
76,286
91,159
54,611
49,664
63,413
10,521
22,934
45,608
61,440
14,851
14,507
10,521
15,410
3,243
11,270
14,780
PubMed
46,503
44,493
22,934
3,243
48,047
44,071
47,696
MAG
BSO
85,873
76,286
45,608
11,270
44,071
102,338
98,604
102,736
91,159
61,440
14,780
47,696
98,604
153,705
Table 7 presents the cross-referenced contributions of the sources to the overall corpus. It
should be noted that the fact that a publication is identified in database A and is not identified
in database B as being part of the corpus does not necessarily mean that it is absent from data-
base B: It may be present in database B, but with a DOI that has not been filled in or is incor-
rect, or a failure to identify the country (no affiliation with France).
4. ESTIMATED RATE OF OPEN ACCESS PUBLICATIONS
4.1. Unpaywall Results: Share of Open Access Publications ( Year 2019)
Table 8 presents the main results of the open access (OA) rate estimate observed in February
2021, based on Unpaywall.org, for each of the sources.
Note that we do not use here the original BSO open access observations, which were made
at a different date, and thus could not be directly compared to ours. We have chosen to report
all the calculations to the same observation date: that of the production of the Unpaywall
snapshot in February 2021.
Table 8.
including the BSO: Open access as of February 2021.
Share of open access for each source (OA calculation: Unpaywall). For all sources,
Publications France 2019
Scopus
# DOI
111,422
Total OA % OA
56%
61,854
WoS
HAL
ADS
PubMed
MAG
FR-2019
BSO
96,712
56,975
63,413
42,316
15,410
11,981
48,047
29,907
102,338
53,392
128,344
75,070
153,953
82,267
FR-2019 + BSO
167,412
88,365
59%
67%
78%
62%
52%
54%
54%
53%
OA articles %OA articles
56,538
54,473
38,513
11,608
29,818
48,647
67,285
70,197
75,413
59%
60%
69%
80%
63%
55%
57%
57%
56%
Quantitative Science Studies
28
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
Table 8 illustrates the results obtained, depending on the sources used, to determine the
open access rate (%OA) observed in February 2021: Overall we find 54% both for the BSO
corpus and for our FR-2019 corpus. The aggregation of the two results gives a slightly lower
overall rate of 53% for all 167,412 publications.
The reader is referred to Aliakbar and Stahlschmidt (2019) for a discussion of the merits and
limitations of these rate calculations. In their conclusions the authors recommend the use of
multiple sources to reduce errors and gaps, and this is clearly a view we share. Cross-matching
all these data sets allowed us to correct, at least in part, the problem of false negatives and to
obtain a refined estimate of the open access rate.
4.2. Variation in Open Access Rate by Document Type
The calculation for the articles alone, using the journal-article nomenclature proposed by
Unpaywall, shows, as expected, a significantly higher rate of opening: 57% for the BSO corpus
and for our corpus, and 56% for the corpus resulting from the aggregation of the two sets.
This category is interesting insofar as the national policy enacted by Article 30 of the 2016
law mentioned above concerns a “scientific writing […] published in a periodical appearing at
least once a year,” (i.e., in our terminology, a scientific journal article).
In this context, it is worth mentioning that the approaches presented here do not distinguish
between publicly funded research articles and other articles from private and industrial
research, for which the open science commitments do not apply.
The details of the types of documents identified for both approaches are given in Table 9.
The percentages observed are very similar in the two data sets (FR-2019 and BSO) for articles
and conference proceedings. The differences are more noticeable for book chapters and can
be explained by a significantly wider coverage in the case of the BSO. The “other” category
covers too many different situations for the differences in the observed rate to be significant.
4.3. Observation of Annual Trends (2015–2020)
To detect the ability to measure annual changes, we extracted the data (and present the annual
counts in Table 10) for each of the years 2015 to 2020, following the same methodology as
outlined for 2019. For 2019 the counts are identical to those in Tables 3 and 7. Table 11 pro-
vides the data from Table 4 for the years 2015 to 2019 (the BSO does not cover the year 2020).
For the observation of open access, the reference remains Unpaywall (snapshot of February
2021). The results are shown in Table 12. As expected, they show a steady increase in the
open access rate from 2015 to 2019.
Table 9.
Share of open access by document type (overall data set FR-2019 + BSO)
Type of document
journal-article
Number of DOIs
133,638
Share % OA % OA FR2019 % OA BSO
80%
56
57
57
book-chapter
proceedings-article
other
13,268
12,987
7,519
8%
8%
4%
Total FR-2019 + BSO
167,412
100%
25
40
60
53
24
40
54
27
41
64
29
Quantitative Science Studies
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
Table 10. Counts obtained for publications from France for the years 2015 to 2020, using the
same methodology
Year
2015
2016
2017
2018
2019
2020
2015
2016
2017
2018
2019
HAL
51,734
57,851
59,451
61,997
63,413
59,796
44,785
46,057
46,490
48,047
55,293
PubMed
41,287
ADS
15,387
WoS
91,028
Scopus
108,195
MAG
92,722
96,850
95,808
99,356
16,396
96,186
112,486
16,806
95,731
113,077
16,254
97,012
114,069
15,410
96,712
111,422
102,338
16,077
94,237
104,533
100,608
Table 11.
Results of the two approaches for the years 2015 to 2019; see Table 4
FR-2015-20
133,817
138,885
138,845
141,059
139,514
BSO
140,493
148,476
146,179
159,380
153,705
Global corpus
157,053
% FR15-20
85
% BSO
89
164,772
162,179
171,987
167,412
84
86
82
83
90
90
93
92
The year 2020, observed in February 2021, has a different character, as the observation is
made before the 6-month, 1-year, or in some cases longer embargoes have expired.
In Table 13, we give examples of observations of the open access status (Gold, Green, etc.)
as provided by Unpaywall for 2 distinct years. These few examples allow us to affirm the
absence of significant bias between the 2 data sets: The two strategies lead to quite similar
estimates.
A comparison of the rates obtained for the French corpus with those obtained on an inter-
national scale would go beyond the limits of this article: The interested reader may refer to the
Table 12. Change in open access rate, observed in February 2021 for publications dated from
2015 to 2020 (the global corpus is the aggregation of the two data sets FR-2015-20 and BSO)
Year of publication
2015
2016
2017
2018
2019
2020
FR2015-2020
45.4%
Open access rate
BSO
45.5%
Global corpus
44.5%
47.8%
50.0%
51.7%
53.8%
52.6%
47.6%
50.0%
50.6%
53.5%
–
46.6%
48.9%
49.9%
52.8%
52.6%
30
Quantitative Science Studies
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
Table 13. Open access status, observed in February 2021 for publications dated 2015 and 2019
Open access status
Gold
Hybrid
Bronze
Green
Closed
FR-2015
12%
BSO 2015
13%
FR-2019
18%
BSO 2019
18%
12%
4%
18%
55%
12%
4%
16%
54%
9%
7%
20%
46%
10%
6%
20%
46%
Table 14.
France 2019: Language by document type
journal-article
book-chapter
proceedings-article
other
% English
82
77
97
87
% French
16
14
1
4
recent study by Robinson-Garcia, Costas, and van Leeuwen (2020), which also presents a dis-
cussion of the different modes of open access mentioned here (Gold, Bronze, Hybrid, Green).
4.4. Are Articles in French More Often in Open Access?
It is possible to cross-reference the observations presented above with information on the lan-
guage in which the article is written: Are articles in French, for example, more often, or less
often, in open access? To examine this, as this information is not systematically provided by all
databases, we analyzed the title of the article as provided by Unpaywall by applying the sim-
ple language detection software langdetect14. Only detections assigned with a displayed prob-
ability greater than 0.99 were retained.
In the framework of our study of French national scientific production, for the year 2019,
the two main languages concerned are English (83% of the detected documents) and French
(15%), the rest of the detected languages not exceeding 3% in total (Table 14). The distribution
is not identical according to the document type, in particular the communications to (mostly
international) conferences (labeled proceedings-article in Unpaywall) are almost always in
English.
Table 15 shows that the rates of open access observed vary greatly according to the disci-
pline (extracted here from the BSO). As a general rule, documents detected as being written in
French are much less frequently in open access.
14 Langdetect (https://pypi.org/project/ langdetect/) is a python-port of Nakatani Shuyo’s language-detection
library (https://github.com/shuyo/ language-detection). When published (in 2010), it claimed to reach 99%+
accuracy on 49 supported languages.
Quantitative Science Studies
31
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
Table 15.
be determined and whose discipline is assessed in the BSO.
France 2019: Open access rate by language and discipline. Calculations are restricted to documents for which the language can
Total with language and discipline detected
Chemistry
Computer and information sciences
Mathematics
Medical research
Biology (fond.)
Social sciences
Physical sciences, Astronomy
Earth, Ecology, Energy and applied biology
Engineering
Humanities
Number of
documents
153,272
% documents
in French
15
% OA documents
in English
58
% OA documents
in French
26
7,050
10,225
3,914
48,191
21,535
8,020
15,701
12,222
4,402
9,388
5
8
11
24
12
69
7
16
24
66
53
55
73
57
69
40
64
59
40
41
50
37
55
8
57
37
73
42
40
43
Most of the French language material without open access comes from three areas: medical
research, including journals for practitioners, and the humanities and social sciences.
5. RESULTS AND DISCUSSION
5.1. Discussion of the Sources Used
The six sources we have chosen to use actually provide three different insights:
(cid:129) Scopus and WoS provide extensive coverage of the literature in peer-reviewed journals
and international conference proceedings; while Scopus has a slightly wider coverage,
the use of the two databases together provides a 10–20% improvement over what
would be obtained with a single database. The MAG database, which will soon be
discontinued, brings, as a complement, a set of documents not indexed by WoS and
Scopus, contributing to a further increase of about 10% of the corpus identified in our
study.
(cid:129) The HAL open archive is filled at the initiative of the authors who deposit the biblio-
graphic record (metadata) and, if applicable, the full text in its preprint or editor version.
Part of the archive contains grey literature (Schöpfel, Prost, & Ndiaye, 2019) and more-
over the DOI is filled in irregularly and not systematically. The metadata and DOI do not
seem to be thoroughly quality controlled: For this reason, this source should be consid-
ered with caution for bibliometric studies. However, it is a reference source for French
research and a cornerstone of the national open science policy.
(cid:129) The ADS and PubMed databases are thematic databases and are therefore only
intended to cover parts of the research field. On the other hand, both databases are
deep in their field and cover grey literature and sources not indexed by the large gen-
eralist databases.
Quantitative Science Studies
32
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
This study sheds new light on the coverage of French scientific production by the various
databases. While the WoS and Scopus voluntarily restrict themselves to the perimeter of
peer-reviewed publications appearing in referenced journals or books (Baas et al., 2020;
Birkle et al., 2020), the use of complementary databases, whether thematic or not, allows
us to have a more complete view of the share of literature that is not or poorly referenced,
and that may be less general in scope geographically, linguistically, or thematically. We
observe that the strategy adopted by the BSO allows for the systematic collection of data on
a significant quantity of these publications—often neglected in bibliometric studies. Far from
identifying an optimal source, our study shows the importance of diversifying the sources used
to provide complementary views on a country’s publication.
5.2. Characteristics of Excluded National Production Without DOI
Publications without a DOI form a heterogeneous group of peer-reviewed and grey literature.
The share of unreferenced grey literature can be approached in particular through the HAL
open archive, by considering documents without a DOI, which were not taken into account
in our study. However, it is advisable to make sure beforehand that the absence of a DOI is not
due to a lack of information, but corresponds to articles from journals that do not use this
identification mode. As the open archive, which is mainly fed by author deposits, is not fed
in a complete and systematic way, this approach can only be qualitative.
We note, first of all, without surprise, a very strong disciplinary variation: Only 15% of the
documents in the field of humanities and social sciences (SSH) deposited in HAL have a DOI,
while the proportion is 70% in chemistry or physics, the global average being 42% for the year
2019 considered here (see Table 2). This rate reaches 50% in the field of computer science.
Among the records without a DOI the share of records from the SSH fields is 52%, compared
to an SSH share of 12% of publications with a DOI.
We also note that the full text is deposited significantly less frequently for documents with-
out a DOI: 39%, whereas the average is 44%.
We can also note, for HAL (year 2019) a strong differentiation according to the language
(we use here the language informed in the archive):
(cid:129) Among the documents without a DOI, the proportion of articles in French is 57% (49%
for articles in English), while for articles with a DOI it is only 8%.
(cid:129) 91% of the documents in French have no DOI (or no DOI indicated).
We found nearly 90,000 records without a DOI in HAL (Table 2). If we restrict ourselves to
documents classified as articles, book chapters or conference papers, nearly 56,000 records
without a DOI (or without a DOI indicated) listed in HAL had to be excluded from this
study.
For journal articles (category ART in HAL) we tried to estimate the proportion that corre-
sponds to not having been informed of a DOI: If we consider the articles without a DOI pub-
lished in a journal for which other articles have a DOI, we note that this concerns 31% of the
articles without a DOI (in HAL in 2019). We therefore estimate that at least 30% of DOIs are
missing in HAL due to DOIs that are not filled in. Most of this 30% can be expected to be
covered by the other sources. If this assumption is correct, it would mean that out of the
56,000 records without a DOI entered in HAL, we can estimate that there are around
40,000 articles or communications without a DOI, which were therefore not taken into
account. This point will be the subject of further study.
Quantitative Science Studies
33
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
5.3. Validation of the Open Strategy Used for the BSO
The comparison between the result obtained with our sources and the open strategy of the
BSO validates the use of the latter: This strategy, if we summarize it in a few words, consists
in scanning all the DOIs available from Unpaywall, and also from HAL, to identify either the
French authors or the presence of a mention of France in the address.
We observe that this strategy makes it possible to identify more than 20,000 records (if we
exclude the false positives) not found by our approach (i.e., about 17% of the total): These are
mainly journals that are not indexed in the major international databases, and more particu-
larly in the biomedical and social science fields.
Our approach also identified approximately 13,000 DOIs not included in the BSO and thus
estimated the false negative rate in the BSO strategy to be close to 9% (see Table 4).
Recurrent sources of error include conflicting approaches to publication date (with the
usual confusions between the first online publication and the final date of the reference;
see for example Liu, 2021).
6. CONCLUSIONS
The main results of our study are as follows.
(cid:129) Our study validates a strategy of determining a collection of scientific publications with
an affiliation in France for a given year. This corpus is deliberately restricted by the use of
DOIs. We present the details of the counts for the year 2019. We estimate that the corpus
of outputs with a DOI covers around 80% of French national scholarly production in
2019, with an additional set of 40,000 articles or communications without a DOI not
taken into account here.
(cid:129) Our determination of cross-coverage by the various databases provides useful insight for
users of these databases. We believe that these counts can help users of these databases
to identify overlaps and complementarities, in a context comparable to that of our study.
(cid:129) The use of multiple sources ensures validation at a sufficiently fine level to shed light on
the geographical, thematic, linguistic, etc. disparities that affect bibliometric studies. Our
study confirms the relevance of adopting a multisource approach.
(cid:129) The open-source strategy used by the BSO effectively identifies the vast majority of pub-
lications with a persistent identifier (DOI) for Open Science monitoring.
(cid:129) The determination of the open access rate has been refined. It should be remembered
that this rate depends on the date of observation and may differ depending on the type of
documents we wish to consider. Our objective is not to comment here on the 54% or
53% rate reached for the opening of publications in 2019 (observed in February 2021),
but to note the convergence of two different methodologies that allow us to accurately
draw the shifting landscape of open science at the country level.
The question of the place of the national open archive HAL, and of other open archives, in
the strategy of Open Science deserves a specific development which should be the subject of a
further study. The objective of such a study would be to examine the possibilities of conver-
gence between, on the one hand, the specific challenges of open archives, allowing for easy
depositing at the disposal of the authors, and on the other hand, the requirements of a refer-
encing and query environment that should not only provide open access to scientific knowl-
edge produced by French research, but also support the most diverse possible readership in
their consultation process.
Quantitative Science Studies
34
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
ACKNOWLEDGMENTS
We thank the two reviewers for their stimulating comments, which we believe have signifi-
cantly helped to improve our work.
AUTHOR CONTRIBUTIONS
Lauranne Chaignon: Validation, Writing—review & editing. Daniel Egret: Data curation,
Supervision, Writing—original draft, Writing—review & editing.
COMPETING INTERESTS
The authors have no competing interests.
FUNDING INFORMATION
No specific funding has been received for this research.
DATA AVAILABILITY
Data tables providing the detailed number of records for each year, as well as a notebook
describing the whole procedure, are available as supplementary data files on HAL Open
Archive: https://hal.archives-ouvertes.fr/hal-03537679. Subscriptions to Scopus and WoS are
required to replicate the research, with the methods described above.
REFERENCES
Aliakbar, A., & Stahlschmidt, S. (2019). Merits and limits: Apply-
ing open data to monitor open access publications in biblio-
metric databases. SocArXiv. https://doi.org/10.31235/osf.io
/npj4h
Archambault, É., Campbell, D., Gingras, Y., & Larivière, V. (2009).
Comparing bibliometric statistics obtained from the Web of Sci-
ence and Scopus. Journal of the American Society for Information
Science and Technology, 60, 1320–1326. https://doi.org/10.1002
/asi.21062
Baas, J., Schotten, M., Plume, A., Côté, G., & Karimi, R. (2020).
Scopus as a curated, high-quality bibliometric data source for
academic research in quantitative science studies. Quantitative
Science Studies, 1(1), 377–386. https://doi.org/10.1162/qss_a
_00019
Bartol, T., Budimir, G., Dekleva-Smrekar, D., Pusnik, M., & Juznic,
P. (2014). Assessment of research fields in Scopus and Web of
Science in the view of national research evaluation in Slovenia.
Scientometrics, 98(2), 1491–1504. https://doi.org/10.1007
/s11192-013-1148-8
Berthaud, C., Charnay, D., & Fargier, N. (2021). Diffuser et péren-
niser le savoir scientifique: 20 ans d’histoire de HAL. Histoire de
la Recherche Contemporaine, 10(2). https://doi.org/10.4000/hrc
.6330
Birkle, C., Pendlebury, D. A., Schnell, J., & Adams, J. (2020). Web
of Science as a data source for research on scientific and schol-
arly activity. Quantitative Science Studies, 1(1), 363–376. https://
doi.org/10.1162/qss_a_00018
Carvalho, J., Laranjeira, C., Vaz, V., & Mendes Moreira, J. (2017).
Monitoring a national open access funder mandate. Procedia
Computer Science, 106, 283–290. https://doi.org/10.1016/j
.procs.2017.03.027
Charnay, D., & Michau, C. (2007). L’archive ouverte HAL. JRES
2007. Strasbourg, France.
Garfield, E. (1964). Science Citation Index—A new dimension in
indexing science. Science, 144(361), 649–654. https://doi.org
/10.1126/science.144.3619.649, PubMed: 17806988
Gorraiz, J., Melero-Fuentes, D., Gumpenberger, C., & Valderrama-
Zurián, J.-C. (2016). Availability of digital object identifiers
(DOIs) in Web of Science and Scopus. Journal of Informetrics,
10(1), 98–109. https://doi.org/10.1016/j.joi.2015.11.008
Guerrero-Bote, V. P., Chinchilla-Rodríguez, Z., Mendoza, A., & de
Moya-Anegón, F. (2021). Comparative analysis of the biblio-
graphic data sources Dimensions and Scopus: An approach at
the country and institutional levels. Frontiers in Research Metrics
and Analytics, 5, 593494. https://doi.org/10.3389/frma.2020
.593494, PubMed: 33870055
Hendricks, G., Tkaczyk, D., Lin, J., & Feeney, P. (2020). Crossref:
The sustainable source of community-owned scholarly metadata.
Quantitative Science Studies, 1(1), 414–427. https://doi.org/10
.1162/qss_a_00022
Herrmannova, D., & Knoth, P. (2016). An analysis of the Microsoft
Academic Graph. D-Lib Magazine, 22(7), 9–10. https://doi.org
/10.1045/september2016-herrmannova
Holly, E. (2018). The rise and rise of Unpaywall. Nature, 560(7718),
290–291. https://doi.org/10.1038/d41586-018-05968-3
Huang, C.-K., Neylon, C., Brookes-Kenworthy, C., Hosking, R.,
Montgomery, L., … Ozaygen, A. (2020). Comparison of biblio-
graphic data sources: Implications for the robustness of university
Quantitative Science Studies
35
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Identifying scientific publications countrywide and measuring their open access
rankings. Quantitative Science Studies, 1(2), 445–478. https://doi
.org/10.1162/qss_a_00031
Ibarra, M. E., Ferreira, J. P., Torrents, M., Hamui, M., Torres, F., …
Ferrero, F. (2018). Changes in PubMed affiliation indexing
improved publication identification by country. Scientometrics,
115, 1365–1370. https://doi.org/10.1007/s11192-018-2714-x
Jeangirard, E. (2019). Monitoring Open Access at a national level:
French case study. 23rd International Conference on Electronic
Publishing, ELPUB 2019, Marseille, France. https://doi.org/10
.4000/proceedings.elpub.2019.20
Kurtz, M. J., Eichhorn, G., Accomazzi, A., Grant, C. S., Murray,
S. S., & Watson, J. M. (2000). The NASA Astrophysics Data Sys-
tem: Overview. Astronomy and Astrophysics Supplement Series,
143, 41. https://doi.org/10.1051/aas:2000170
Laakso, M., & Björk, B. C. (2012). Anatomy of open access publish-
ing: A study of longitudinal development and internal structure.
BMC Medicine, 10, 124. https://doi.org/10.1186/1741-7015-10
-124, PubMed: 23088823
Liu, W. (2021). A matter of time: Publication dates in Web of Sci-
ence Core Collection. Scientometrics, 126, 849–857. https://doi
.org/10.1007/s11192-020-03697-x
Moed, H. F., Markusova, V., & Akoev, M. (2018). Trends in Russian
research output indexed in Scopus and Web of Science. Sciento-
metrics, 116(2), 1153–1180. https://doi.org/10.1007/s11192-018
-2769-8
Mongeon, P., & Paul-Hus, A. (2016). The journal coverage of Web
of Science and Scopus: A comparative analysis. Scientometrics,
106(1), 213–228. https://doi.org/10.1007/s11192-015-1765-5
Philipp, T., Botz, G., Kita, J.-C., Richards, P., Sänger, A., &
Reumaux, M. (2021). Open access monitoring: Guidelines and
recommendations for research organisations and funders. Sci-
ence Europe, Briefing Paper, May. https://doi.org/10.5281
/zenodo.4905553
Piwowar, H., Priem, J. Larivière, V., Alperin, J.P., Matthias, L., …
Haustein, S. (2018). The state of OA: A large-scale analysis of
the prevalence and impact of open access articles. PeerJ, 6,
e4375. https://doi.org/10.7717/peerj.4375, PubMed: 29456894
Pölönen, J., Laakso, M., Guns, R., Kulczycki, E., & Sivertsen, G.
(2020). Open access at the national level: A comprehensive
analysis of publications by Finnish researchers. Quantitative
Science Studies, 1(4), 1396–1428. https://doi.org/10.1162/qss_a
_00084
Pranckutė, R. (2021). Web of Science ( WoS) and Scopus: The titans
of bibliographic information in today’s academic world. Publica-
tions, 9(1), 12. https://doi.org/10.3390/publications9010012
Puuska, H.-M., Nikkanen, J., Engels, T., Guns, R., Ivanović, D., &
Pölönen, J. (2020). Integration of national publication
databases—Towards a high-quality and comprehensive informa-
tion base on scholarly publications in Europe. ITM Web Confer-
ence 33, 02001. https://doi.org/10.1051/itmconf/20203302001
Robinson-Garcia, N., Costas, R., & van Leeuwen, T. N. (2020).
Open access uptake by universities worldwide. PeerJ, 8, e9410.
https://doi.org/10.7717/peerj.9410, PubMed: 32714658
Schöpfel, J., & Prost, H. (2019). The scope of open science moni-
toring and grey literature. 12th Conference on Grey Literature and
Repositories, National Library of Technology (NTK), Prague,
Czech Republic.
Schöpfel, J., Prost, H., & Ndiaye, E. (2019). Going green. Publishing
academic grey literature in laboratory collections on HAL. GL21
International Conference on Grey Literature, 22–23 October
2019, Hannover, Germany.
Simmonds, A. W. (1999). The Digital Object Identifier (DOI). Pub-
lishing Research Quarterly, 15, 10–13. https://doi.org/10.1007
/s12109-999-0022-2
Sivertsen, G. (2019). Developing current research information sys-
tems as data sources for studies of research. In W. Glänzel, H. F.
Moed, U. Schmoch, & M. Thelwall (Eds.), Springer Handbook of
Science and Technology Indicators (pp. 667–683). Cham:
Springer. https://doi.org/10.1007/978-3-030-02511-3_25
Van Leeuwen, T. N., Moed, H. F., Tijssen, R. J. W., Visser, M. S., &
Van Raan, A. F. J. (2001). Language biases in the coverage of the
Science Citation Index and its consequences for international
comparisons of national research performance. Scientometrics,
51, 335–346. https://doi.org/10.1023/A:1010549719484
Vera-Baceta, M. A., Thelwall, M., & Kousha, K. (2019). Web of
Science and Scopus language coverage. Scientometrics, 121,
1803–1813. https://doi.org/10.1007/s11192-019-03264-z
Visser, M., van Eck, N. J., & Waltman, L. (2021). Large-scale com-
parison of bibliographic data sources: Scopus, Web of Science,
Dimensions, Crossref, and Microsoft Academic. Quantitative
Science Studies, 2(1), 20–41. https://doi.org/10.1162/qss_a
_00112
Wang, K., Shen, Z., Huang, C., Wu, C.-H., Eide, D., … Rogahn, R.
(2019). A review of Microsoft Academic Services for science of
science studies. Frontiers in Big Data, 2, 45. https://doi.org/10
.3389/fdata.2019.00045, PubMed: 33693368
Wang, K., Shen, Z., Huang, C., Wu, C.-H., Dong, Y., & Kanakia, A.
(2020). Microsoft Academic Graph: When experts are not
enough. Quantitative Science Studies, 1(1), 396–413. https://doi
.org/10.1162/qss_a_00021
Quantitative Science Studies
36
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d
l
f
/
/
/
/
/
3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
a
_
0
0
1
7
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3