ARTICLE DE RECHERCHE

ARTICLE DE RECHERCHE

Identifying scientific publications countrywide
and measuring their open access: The case of
the French Open Science Barometer (BSO)

un accès ouvert

journal

Université PSL, Paris, France

Lauranne Chaignon

and Daniel Egret

Mots clés: databases, open science, publications

Citation: Chaignon, L., & Egret, D.
(2022). Identifying scientific
publications countrywide and
measuring their open access: The case
of the French Open Science Barometer
(BSO). Études scientifiques quantitatives,
3(1), 18–36. https://doi.org/10.1162/qss
_a_00179

EST CE QUE JE:
https://doi.org/10.1162/qss_a_00179

Peer Review:
https://publons.com/publon/10.1162
/qss_a_00179

Reçu: 16 Juillet 2021
Accepté: 17 Janvier 2022

Auteur correspondant:
Daniel Egret
daniel.egret@psl.eu

Éditeur de manipulation:
Ludo Waltman

ABSTRAIT

We use several sources to collect and evaluate academic scientific publication on a country-
wide scale, and we apply it to the case of France for the years 2015–2020, while presenting a
more detailed analysis focused on the reference year 2019. These sources are diverse:
databases available by subscription (Scopus, Web de la Science) or open to the scientific
community (Microsoft Academic Graph), the national open archive HAL, and databases
serving thematic communities (ADS and PubMed). We show the contribution of the different
sources to the final corpus. These results are then compared to those obtained with another
approche, that of the French Open Science Barometer for monitoring open access at the
national level. We show that both approaches provide a convergent estimate of the open
access rate. We also present and discuss the definitions of the concepts used, and list the main
difficulties encountered in processing the data. The results of this study contribute to a better
understanding of the respective contributions of the main databases and their complementarity
in the broad framework of a countrywide corpus. They also shed light on the calculation of
open access rates and thus contribute to a better understanding of current developments in the
field of open science.

1.

INTRODUCTION

Open access to publications (par exemple., Laakso & Björk, 2012; Piwowar, Priem et al., 2018) within
the general framework of Open Science is now an issue shared by many institutions, univer-
sities and research organizations, and funders. France is no exception: Two national plans for
Open Science have been successively launched, dans 2018 et 2021, by the Ministry of Higher
Éducation, Research and Innovation (MESRI). Generalizing open access to publications is the
first axis of these two plans, with a goal of 100% of French scientific publications in open
access by 20301, either through a publication natively in open access or through a deposit
in an open archive. This national plan is in line with the European Plan S2.

To support the policies thus deployed, a good knowledge of the state of publications and
their open access rate seems necessary, and many measurement tools have been developed
for this purpose, in different contexts, such as the European Open Science Monitor (OSM), le

1 National Plan for Open Science: https://www.ouvrirlascience.fr/national-plan-for-open-science-4th-july

-2018/; https://www.ouvrirlascience.fr/second-national-plan-for-open-science/.

2 Plan S: https://www.coalition-s.org/.

droits d'auteur: © 2022 Lauranne Chaignon
and Daniel Egret. Published under a
Creative Commons Attribution 4.0
International (CC PAR 4.0) Licence.

La presse du MIT

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

German Open Access Monitor (OAM), the Danish Open Access Indicator, or the COKI Open
Access Dashboard. Other countries have also adopted national strategies for monitoring open
access (Carvalho, Laranjeira et al., 2017).

In its guide to assisting research organizations and funders in setting up a tool for monitor-
ing open access publications (Philipp, Botz et al., 2021), the organization Science Europe con-
siders the constitution of the corpus of publications to be analyzed as one of the key stages in
the process. We could add that it is even one of the major challenges of this exercise. En effet,
no database provides an easy and complete answer to this question. The large databases, tel
as the Web of Science ( WoS) and Scopus, have the advantage of systematically listing a large
part of the millions of scientific publications published each year in the world. The metadata
are standardized and allow for efficient searching. Cependant, the coverage of science, technol-
ogy, and medicine (STM) and of English-language publications in international journals is pri-
vileged, while other disciplinary fields, other languages of publication, and other sources or
document types are less fully surveyed (Mongeon & Paul-Hus, 2016; Van Leeuwen, Moed
et coll., 2001; Vera-Baceta, Thelwall, & Kousha, 2019). De plus, these databases are accessi-
ble only by subscription, so their data are not open or reusable. If we consider thematic data-
bases such as PubMed or NASA/ADS, their metadata are both high quality and open. On the
other hand, they cover a very specific disciplinary field: An exhaustive census of publications
in a multidisciplinary context will therefore require multiple sources.

As for open archives, while they have the advantage of listing types of publications,
languages, and sources that are often absent from large databases, they offer insufficiently
standardized metadata, which complicates their collection and processing. Ainsi, no single
database offers comprehensiveness, standardized metadata, and openness. As Huang, Neylon
et autres. (2020) conclude in a recent article: “Any institutional evaluation framework that is seri-
ous about coverage should consider incorporating multiple bibliographic sources.”

Current Research Information Systems (CRIS) can be a way around this difficulty, provided
that they are not fed solely by the large commercial databases mentioned above. They are
increasingly being used in universities to help manage, understand, and evaluate research
activités. Cependant, most CRIS are, aujourd'hui, still used only at an institutional level (Sivertsen,
2019). Although their aggregation at the country level to constitute a national base is progres-
sing, it is still most often correlated with the implementation of a public funding policy based
on scientific publication performance, as is the case in Denmark, Finlande, Hungary, Italy,
Norway, and Poland (Puuska, Nikkanen et al., 2020). If the motivation is primarily financial,
a national database is an opportunity to set up an effective monitoring of open access policies
at the country level, as Finland has experimented with (Pölönen, Laakso et al., 2020).

For countries that do not have such a pool of data, the implementation of a monitoring tool
on this scale implies selecting from among the existing databases, whether commercial or not,
those that will best meet the objective set. The German Ministry of Education and Research
has thus chosen to use the Dimensions and WoS databases to establish its corpus3. Universities
ROYAUME-UNI, the association of 140 UK universities, has chosen to use Scopus to produce its latest
report on the effects of new policies to promote open access4.

In the case of France, the objective of the MESRI was to set up a tool that would enable the
steering of the national policy on open science, by measuring, on an annual basis, the level of

3 https://jugit.fz-juelich.de/synoa/oam-dokumentation/-/wikis/Quelldatenbanken/Quelldatenbanken.
4 https://www.universitiesuk.ac.uk/sites/default/files/field/downloads/2021-09/monitoring-transition-open

-access-2017-annexe-1-methodology.pdf.

Études scientifiques quantitatives

19

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

open access of all publications with at least one French affiliation. This request was accom-
panied by a very specific requirement: “a transparent methodology and reproducible results.”
It is with this in mind that the French Open Science Barometer (BSO) was carried out5, comme
described by Eric Jeangirard (2019). For the BSO, the constitutive choice is to use only open
sources. The methodology used consists in scanning all the papers referenced in Unpaywall
and in the national open archive HAL (see below) to identify either the French authors or the
presence of the mention of France in the affiliation. The publications thus identified were then
enriched with information on their scientific discipline, using natural language processing
(NLP), also based on open source code, to determine, from the title, the discipline to which
a document belongs. Enfin, the open access status was determined using the Unpaywall
database. The corpus obtained by this strategy is available in open access from the MESRI
OpenData portal6. In accordance with the recommendations made at the European level
(Open Access Monitoring: Philipp et al., 2021), the French National Open Science Barometer
is published on an annual basis.

About 150,000 publications are thus identified each year by the BSO. The purpose of this
study is to consider an alternative approach, this time based on the use of the main open or
nonopen bibliographic databases, and to analyze the extent to which this new corpus differs
from that of the BSO. Our approach is based on the use of six complementary sources, namely
WoS, Scopus, Microsoft Academic Graph, PubMed, NASA/ADS, and the HAL open archive, à
identify and assess academic scientific publication at the scale of a country, in this case
France, for publications released during the 6 years 2015–2020. As the year scale seemed
to us more relevant to characterize scientific production, we chose to highlight, in the context
of this article, the data related to the year 20197. We then compare the corpus obtained with
that of the BSO, and we show to what extent the diversity of the sources used makes it possible
to refine the identification and characterization of French scientific production, as well as the
estimation of the open access rate.

While there is an abundant literature on the comparison between Scopus, WoS, et autre
generalist databases (voir, Par exemple, in a national production context Archambault,
Campbell et al. [2009], Bartol, Budimir et al. [2014], and Moed, Markusova, and Akoev
[2018], or for a statistical comparison of large reference databases Mongeon and Paul-Hus
[2016], Pranckutė [2021], and Visser, Van Eck, and Waltman [2021]), our study provides a
detailed quantitative view in the specific context of French research. Far from identifying a
source that would be optimal, our study shows the importance of diversifying the sources used
to provide complementary views on a country’s publication.

2. CONSTITUTION OF THE FRANCE 2015–2020 CORPUS: DATA AND METHODS

2.1. Definitions

Before describing in detail the methodology used to establish our corpus, we present and dis-
cuss here the main concepts used.

2.1.1. Digital Object Identifier (EST CE QUE JE)

The DOI8 is a persistent identifier that can be assigned to any type of content, be it text, software,
data sets, etc.. (Simmonds, 1999). It will be used as a common metadata for the entire study.

5 https://bso.esr.gouv.fr.
6 https://data.enseignementsup-recherche.gouv.fr/explore/dataset/open-access-monitor-france/.
7 The counts for each of the 6 years are available in the supplementary data file.
8 DOIs are managed by the nonprofit association CrossRef (Hendricks, Tkaczyk et al., 2020).

Études scientifiques quantitatives

20

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

2.1.2.

Scientific publications

We consider here scientific publications indexed in databases (private or public) and accessible
in open archives. All types of documents are taken into account. This primarily concerns articles,
generally published in international peer-reviewed journals, but also conference proceedings,
book chapters, or any other publication, provided that it has a DOI. Cependant, the restriction to
only documents with a DOI is an important restriction, which we must explain here.

To facilitate the aggregation of results, and to avoid duplication, we have chosen, as does
the BSO (French Open Access Monitoring), to restrict the cross-referencing of data to publi-
cations identified by a DOI. This step is necessary to allow the efficient cross-referencing of
documents identified in each database by their DOI identifier, common to all databases. Dans
addition, the Unpaywall database, which will inform us about open access in the next step,
only lists publications with a DOI.

Let us note that the requirement of the presence of a DOI immediately rules out a certain
number of journals that do not adhere to this very general technology of persistent identifiers
(Gorraiz, Melero-Fuentes et al., 2016); some of these journals may be, as Wang, Shen et al.
(2020) point out, key journals in their discipline, with the example, for the field of Artificial
Intelligence, of the Journal of Machine Learning Research.

De plus, grey literature, under which we can group preprints, reports, theses, and in some
cases conference proceedings (Schöpfel & Prost, 2019), is often ignored by open access mea-
surement tools, mainly for two reasons: The first corresponds to a concern to discard literature
whose scientific relevance cannot be sufficiently controlled (lack of peer review); the second
is rather related to technical considerations, in particular a difficulty in identifying these pub-
lications in the absence of complete and standardized metadata, especially persistent identi-
fiers. In practice, this leads to ignoring a large proportion of the work published in certain
disciplines where the thematic field, the regional vocation, or the applicative nature of the
publications takes precedence over international referencing.

Our methodology, based on the use of the DOI, therefore effectively excludes some of the
documents that might be of interest to us. This is why we will come back to publications with-
out DOIs at the end of our study, by proposing an estimate of the share of grey literature in
French national production (Section 5.2).

Enfin, it should be noted that the publications taken into account to establish our corpus
are exclusively those that have a digital version: It is this digital version for which we will try to
measure the degree of accessibility. Ainsi, peer-reviewed research published in books or
monographs is only covered when it is in digital format and has a DOI. For this reason, non-
academic publishing generally falls outside the scope of our study.

2.1.3. Open access

A scientific article that is only available on payment of a subscription or a fee (price per article)
is considered closed. In contrast, a scientific article that is freely available, either on a pub-
lisher’s website or after the deposit of the full text (in its final layout or not) on an open archive,
is deemed open.

Our source of information for the open access status of an article will be the Unpaywall
database (Piwowar et al., 2018), specifically the data in the “is_oa” field. If the value returned
for a given publication is equal to “True,” the publication will be considered open. If this value
is “False,” the publication will be considered closed. The so-called “bronze” status is consid-
ered open.

Études scientifiques quantitatives

21

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

Note that the open access status may vary over time, because a closed publication may
have its embargo lifted or be subsequently deposited in an open archive. Ainsi, in our study,
it will be the status observed in February 2021, as recorded in the Unpaywall database snap-
shot for that date.

Let us recall that for France, the Law for a Digital Republic of 7 Octobre 20169 establishes
the possibility of deposit in an open archive of the postprint of any scientific article resulting
from research funded at least 50% by the state or public authorities, at the expiration of a
period of 6–12 months depending on the scientific field (respectivement, STM or Humanities &
Sciences sociales).

2.2. Sources Used to Constitute the FR-2015-2020 Corpus

The collection of metadata related to a large set of publications is facilitated by the use of
databases that systematically, if not exhaustively, collect a large part of the millions of scientific
publications published each year worldwide.

In this article, we have privileged the databases providing a search capability for the men-
tion of the country in the affiliation, and we have collected the publications whose affiliation
mentions the country considered in our study, France, using the corresponding query modes of
six databases that, to our knowledge, effectively cover French scientific production.

We did not use the Dimensions database, as it is not considered to be a reliable source for
establishing a corpus on a country scale (Guerrero-Bote, Chinchilla-Rodríguez et al., 2021).

We use the following databases in our study:

(cid:129) Scopus (Baas, Schotten et al., 2020) references more than 25,000 journals and is consid-
ered one of the most comprehensive databases for international peer-reviewed journals.
Query by country is possible. Metadata extraction is limited to batches of 20,000 docu-
ments. This database is available by subscription from Elsevier.

(cid:129) WoS (Birkle, Pendlebury et al., 2020) has been the reference database for scientometrics
since the pioneering work of Garfield (1964). The query by country is provided in the
advanced query mode. This database is available by subscription from Clarivate Analyt-
ics. Dans cette étude, we use all the indexes (including ESCI: Emerging Sources) except for the
Index des citations de livres, which was not available to us.

(cid:129) The HAL open archive10 (Charnay & Michau, 2007) is a national multidisciplinary open
archive intended for the deposit and dissemination of research-level scientific articles
(published or not), theses, and other objects emanating from French or foreign teaching
and research establishments, and public or private laboratories. Created in 2001 avec
ArXiv as a model, this platform has gradually become one of the main tools for reporting
French research. A partnership agreement in favor of this archive was signed in 2013 par
the Conference of University Presidents (CPU) et 22 institutions. In July 2021, the MESRI
also committed to supporting the development of this archive, in terms of both technical
aspects and governance, as part of its second national plan for open science 2021–2024.

French researchers are invited to deposit on this platform the products of their research,
whether they are publications (article in a journal, communication in a conference, chapter

9 Law for a Digital Republic; see in particular its article 30: https://www.legifrance.gouv.fr/dossierlegislatif

/JORFDOLE000031589829/.
10 https://hal.archives-ouvertes.fr/.

Études scientifiques quantitatives

22

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

of a book, livre, poster, file, patent), unpublished documents (prepublication, fonctionnement
document, report), academic works (thesis, HDR, cours), or research data (image, video, soft-
ware, map, or sound). The recorded documents are either in the form of a notice only or
accompanied by the full text of the article. This production can be grouped within different
collections or portals relating to a theme (SHS for example), a medium (images and videos), ou
a research structure (university, laboratory, or research team), but it remains possible to carry
out queries covering all portals and collections. After 20 years of use (Berthaud, Charnay, &
Fargier, 2021), plus que 2,700,000 works are now recorded in this archive.

HAL data can be queried using an advanced query or the API. The latter, which is available

free of charge, allows the identification of the country of affiliation.

(cid:129) The NASA/ADS database (Kurtz, Eichhorn et al., 2000) is one of the most recognized
examples of a bibliographic database covering a research field: astrophysics and physics.
Its query mode allows querying by country. Access is free.

(cid:129) The PubMed database is one of the preferred and free access points for metadata related to
biomedical science research. A query by affiliation is possible (Ibarra, Ferreira et al., 2018).
(cid:129) The Microsoft Academic Graph (MAG) database (Herrmannova & Knoth, 2016; Wang
et coll., 2019), one of the three products of the Microsoft Research project, is one of the
largest open publication and citation data sets. It is populated automatically, using biblio-
graphic data from web pages crawled by the Bing search engine, also a Microsoft product.
The data can be accessed using the Academic Knowledge API. It should be noted that MAG
does not contain structured data on affiliation country. Identification of French outputs
(provided by the Curtin Open Knowledge Initiative team) was by applying a query to the
affiliation string (OriginalAffiliation data element from the MAG PaperAuthorAffiliations
table, linked via the PaperID to the DOI) that sought to determine whether the affiliation
string ended with “France” (or one of a small set of non-English names). This number
may not match that in the online COKI country dashboard, which maps affiliation country
from GRIDs in MAG to the country of organization in the GRID database11.

Some of the characteristics of these databases as well as the number of documents obtained
pour 1 année (the year 2019), in the framework of the query “France 2015–2020” carried out in
Octobre 2021 are presented in Table 1.

2.3. Aggregation of Results for Publications Identified by a DOI

As mentioned above, to facilitate the aggregation of results and to avoid duplication, we have
chosen, as does the BSO (French Open Access Monitoring), to restrict the cross-matching of
data to publications identified by a DOI.

Tableau 2 shows the counts obtained for the year 2019: DOIs are available for 94% of the
documents indexed in Scopus and 85% of those in WoS. Notice, in addition, that a majority of
the documents without a DOI corresponds to communications to conferences (for France and
the year 2019: 54% of the documents without a DOI in Scopus are communications; 78% dans
WoS). For ADS the documents without a DOI are mainly conference abstracts, while docu-
ments without a DOI represent only 1% of PubMed.

For the HAL archive, the point is that the DOI identifier is not systematically filled in
because it is not a compulsory metadata during the deposit. While only 2–3% of the

11 https://openknowledge.community/dashboards/coki-open-access-dashboard/.

Études scientifiques quantitatives

23

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

Tableau 1.

Sources used: queries, number of records returned for the year 2019

Base
Scopus

Sample query
(France, année 2019)

AFFILCOUNTRY (france) et

PUBYEAR = 2019

Number of
documents
France 2019
123,181

Types of
documents

Domains

All

All areas

Web de la Science

CU = FRANCE AND PY = 2019

124,790

All

All areas

HAL (Open

Via API: producedDateY_i:2019

158,937

Archive, France)

structCountry_s:fr

Open archive
of French
laboratories

All areas

Practical
limitations
Export in batches

de 20,000

Export in batches

de 5,000

Export in batches

de 10,000

NASA/ADS

aff: “France” AND year:2019-2019

19,997

All

PubMed

(France[Affiliation]) AND

(“2019″[DatePublication])

56,038

All

Physics and

Astrophysics

Export in batches

de 500

Medicine,

Export in batches

Biology, Health

de 10,000

MAG

mag.Year = 2019 AND

101,885

All (with DOI) All areas

((SELECT COUNT(1) FROM
UNNEST(mag.authors) as auth
WHERE REGEXP_EXTRACT
(auth.OriginalAffiliation, r’Fran
(ce|kreich|cia)(?:\W|\s+|$)')
is not null) > 0

(COKI, private

communication)

documents characterized as articles in WoS or Scopus do not have a DOI recorded, this pro-
portion rises to 22% for documents characterized as articles in HAL. En outre, the open
archive contains many unpublished documents, preprints, reports, or theses that do not have
(or not yet) a DOI: With the book chapters, these documents represent half of the publications
without a DOI, which will not be considered for the rest of the study.

Cependant, we will return to HAL in Section 5 for a discussion of grey literature.

Note that for MAG, we had direct access to the DOI lists through the COKI team, whom we

thank for their help.

Tableau 2. DOI counts in the six sources for the year 2019. The last column shows the numbers of
documents without DOIs in the Article category alone.

Query France
2019
Scopus

Number of
documents
123,181

WoS

HAL

ADS

PubMed

MAG

124,790

158,937

19,997

56,038

Documents
with DOI
115,273

101,377

66,836

15,731

55,516

101,885

% EST CE QUE JE
94

Category: Articles
with no DOI
1,709

85

42

79

99

2,763

16,992

56

522

Études scientifiques quantitatives

24

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

Tableau 3. Unpaywall cross-reference: DOI and year of publication

Scopus

WoS

HAL

ADS

PubMed

MAG

Total Corpus FR-2019

Total with
EST CE QUE JE 2019
115,273

101,377

66,836

15,731

55,516

101,885

DOI confirmed
Unpaywall 2019
111,422

96,712

63,413

15,410

48,047

102,338

139,514

2.4. Open Access and External Validation: Using Unpaywall

One of the objectives of this study is the measurement of the share of open access to publi-
cations. For this we use the Unpaywall database12, which is the leading database in this field
(Holly, 2018; Piwowar et al., 2018).

This database offers a simplified access mode (by batches of 1,000 DOIs) which allows us
to easily obtain the status of a publication (open or closed access, with the publisher and/or in
an open archive) at the time of the query. It is also possible to download a complete version of
the database (called a Snapshot ). For this study, we used the version dated February 2021. Pour
the year 2019, this version lists more than 6 million publications.

Querying the Unpaywall database also allows us to validate the DOIs identified in the pre-
vious step: We consider that DOIs not found in Unpaywall generally correspond to identifiers
that have not been confirmed by Crossref, the agency that certifies their quality and continuity.

De plus, it is not uncommon to find differences in the date of publication from one data-
base to another (often due to the time lag between the version published online (early access)
and the “final” publication). We have chosen to use the year of publication provided in the
Unpaywall database as the reference year (see Table 3), whether or not it is consistent with the
year of publication mentioned in the source database. This choice is also the one adopted by
the BSO (French Open Access Monitoring).

Tableau 3 presents the results of the cross-matching between the six sources and their valida-

tion with Unpaywall.

The first column recalls the number of DOIs obtained from each source, already presented
in Table 2. The second column presents the numbers of DOIs found in Unpaywall and
recorded in this database as published in 2019.

Note that to obtain the counts in Table 3 we cross-referenced the results of queries covering
for the six sources the whole of the years 2015–2020 with the year 2019 from Unpaywall.
Discrepancies in publication dates affect about 8% of the documents. Because of the reassign-
ment of publication dates, the number of DOIs with confirmed output (second column of
Tableau 3) for a given year may be larger than the original number of DOIs for this year (case
of MAG), despite a small loss of unidentified DOIs.

12 Unpaywall: https://www.unpaywall.org.

Études scientifiques quantitatives

25

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

Tableau 4. Cross-referencing of FR-2019 sources with BSO data (source BSO: Jeangirard, 2019)

Corpus FR-2019

BSO 2019

In common

BSO only

FR-2019 only

Global corpus FR-2019 +

BSO (without duplicates)

France 2019
(# EST CE QUE JE)
139,514

153,705

125,807

27,898

13,707

167,412

Contribution to the
global corpus
83%

92%

75%

17%

8%

100%

Dans la rubrique suivante, le 139,514 records described in column 2 will be cross-

referenced with the BSO.

3. COMPARISON OF THE FR-2019 AND BSO DATA SETS

3.1. Overlap of the Two Sets

The corpus thus constituted (FR-2019) can now be compared with that of the French Open
Science Barometer (BSO), which also aims to cover all French production, for several years
y compris 201913.

Because the BSO data are also restricted to publications with a DOI and have benefited
from the Unpaywall query, it is easy to cross-reference the two sets of DOIs. The result is sum-
marized in Table 4.

Tableau 4 shows that, if we restrict ourselves to the data validated after querying Unpaywall,
8% of the total data set (c'est à dire., 13,707 DOIs) are not identified in the BSO, while conversely 17%
of the documents (c'est à dire., 27,898 DOIs) had not been identified in our FR-2019 corpus.

3.2. Data from Our FR-2019 Corpus That Are Not Part of the BSO Corpus

The data from our sources not included in the BSO corpus seem to correspond mainly to a
failure to identify the France affiliation in the algorithm developed by Jeangirard (2019). Ce
was expected and corresponds to what Jeangirard calls false negatives—which he says he can-
not estimate and which we estimate here at 9% of the BSO corpus.

Dans notre étude, the main sources contributing to this subset not identified by the BSO are Sco-
pus (63%), WoS (41%), and MAG (23%). We believe that these documents come from the less
represented publishers, for which it is likely that specific algorithms for extracting the country
of affiliation have not been developed for BSO.

3.3. Data from the BSO Corpus Absent from the FR-2019 Corpus

The data from the BSO corpus not included in our sources come mainly from humanities and
social sciences journals (44%), biomedical journals (24%), and basic biology journals (12%).

13 The BSO data have been produced in December 2020 and are made available on the Open Data portal of
the Ministry of Higher Education (MESRI): https://data.enseignementsup-recherche.gouv.fr/explore/dataset
/open-access-monitor-france/.

Études scientifiques quantitatives

26

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

Tableau 5.

Search in Scopus for false positives of BSO

Search in Scopus
BSO only

Not found

Found in other years

Found same year

Nombre
27,898

23,706

576

3,616

Comment

Journals not indexed by Scopus

Year assignment discrepancy

Probable false positives from the BSO

We note a significantly higher proportion of articles in French in this BSO-only subset: 31%
compared to the average of 15% for the global corpus (the language analysis methodology will
be presented in Section 4.4).

These are mainly journals or resources not covered by the databases we have used, in par-
particulier, documentary resources and journals with a national scope in French or English. Pour
example, the most represented sources in this set are the following:

(cid:129) Case Medical Research: international database of clinical trials
(cid:129) Faculty Opinions—Postpublication peer review of the biomedical literature
(cid:129) SSRN electronic journal: database of social science preprints.

This set of documents also includes the “false positives” reported by Jeangirard (2019) (c'est à dire.,
documents that his algorithm wrongly identified as publications from the France set). Ceux-ci sont
publications for which none of the authors has an affiliation in France but which the BSO
algorithm nevertheless retained. Jeangirard estimates the false positive rate at 4% (lequel
would correspond to about 6,000 publications for the year 2019).

We can try to estimate more precisely this share of false positives: The search in Scopus of
DOIs corresponding to publications collected for the BSO but not confirmed by our other
sources sheds light on this subject (Tableau 5).

This search allows us to identify 3,616 probable false positives: The Scopus database rec-
ognizes the DOI, the year is indeed 2019, but the article does not include, according to
Scopus, an affiliation in France. This corresponds to 3.5% of the DOIs common to BSO and
Scopus, which thus seems compatible with the 4% estimated by Jeangirard (2019). Let us note
once again that the cross-referencing of the different sources highlights divergent assessments of
the publication date of the articles.

3.4. Contribution of the Different Sources to the Overall Aggregated Corpus

Tableau 6 presents the contributions of each source to the overall corpus (aggregating the two
approaches: our FR-2019 corpus and the one collected for the BSO).

Tableau 6.
gives the number of documents found in only one source (année 2019).

Share of each source in the overall aggregated corpus (FR-2019 + BSO). The second line

Share of Total

Scopus
67%

WoS
58%

HAL
38%

ADS
9%

PubMed
29%

MAG
61%

BSO
92%

In one source

7,211

4,009

6,335

155

230

11,665

27,898

Études scientifiques quantitatives

27

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

Tableau 7.

Cross contributions from each source to the overall France 2019 corpus

Scopus

WoS

HAL

ADS

Scopus

111,422

WoS
88,327

HAL
54,611

ADS
14,851

PubMed
46,503

MAG
85,873

BSO
102,736

88,327

96,712

49,664

14,507

44,493

76,286

91,159

54,611

49,664

63,413

10,521

22,934

45,608

61,440

14,851

14,507

10,521

15,410

3,243

11,270

14,780

PubMed

46,503

44,493

22,934

3,243

48,047

44,071

47,696

MAG

BSO

85,873

76,286

45,608

11,270

44,071

102,338

98,604

102,736

91,159

61,440

14,780

47,696

98,604

153,705

Tableau 7 presents the cross-referenced contributions of the sources to the overall corpus. Il
should be noted that the fact that a publication is identified in database A and is not identified
in database B as being part of the corpus does not necessarily mean that it is absent from data-
base B: It may be present in database B, but with a DOI that has not been filled in or is incor-
rect, or a failure to identify the country (no affiliation with France).

4. ESTIMATED RATE OF OPEN ACCESS PUBLICATIONS

4.1. Unpaywall Results: Share of Open Access Publications ( Year 2019)

Tableau 8 presents the main results of the open access (OA) rate estimate observed in February
2021, based on Unpaywall.org, for each of the sources.

Note that we do not use here the original BSO open access observations, which were made
at a different date, and thus could not be directly compared to ours. We have chosen to report
all the calculations to the same observation date: that of the production of the Unpaywall
snapshot in February 2021.

Tableau 8.
including the BSO: Open access as of February 2021.

Share of open access for each source (OA calculation: Unpaywall). For all sources,

Publications France 2019
Scopus

# EST CE QUE JE
111,422

Total OA % OA
56%
61,854

WoS

HAL

ADS

PubMed

MAG

FR-2019

BSO

96,712

56,975

63,413

42,316

15,410

11,981

48,047

29,907

102,338

53,392

128,344

75,070

153,953

82,267

FR-2019 + BSO

167,412

88,365

59%

67%

78%

62%

52%

54%

54%

53%

OA articles %OA articles

56,538

54,473

38,513

11,608

29,818

48,647

67,285

70,197

75,413

59%

60%

69%

80%

63%

55%

57%

57%

56%

Études scientifiques quantitatives

28

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

Tableau 8 illustrates the results obtained, depending on the sources used, to determine the
open access rate (%OA) observed in February 2021: Overall we find 54% both for the BSO
corpus and for our FR-2019 corpus. The aggregation of the two results gives a slightly lower
overall rate of 53% for all 167,412 publications.

The reader is referred to Aliakbar and Stahlschmidt (2019) for a discussion of the merits and
limitations of these rate calculations. In their conclusions the authors recommend the use of
multiple sources to reduce errors and gaps, and this is clearly a view we share. Cross-matching
all these data sets allowed us to correct, at least in part, the problem of false negatives and to
obtain a refined estimate of the open access rate.

4.2. Variation in Open Access Rate by Document Type

The calculation for the articles alone, using the journal-article nomenclature proposed by
Unpaywall, shows, as expected, a significantly higher rate of opening: 57% for the BSO corpus
and for our corpus, et 56% for the corpus resulting from the aggregation of the two sets.

This category is interesting insofar as the national policy enacted by Article 30 of the 2016
law mentioned above concerns a “scientific writing [] published in a periodical appearing at
least once a year,» (c'est à dire., in our terminology, a scientific journal article).

In this context, it is worth mentioning that the approaches presented here do not distinguish
between publicly funded research articles and other articles from private and industrial
recherche, for which the open science commitments do not apply.

The details of the types of documents identified for both approaches are given in Table 9.
The percentages observed are very similar in the two data sets (FR-2019 and BSO) for articles
and conference proceedings. The differences are more noticeable for book chapters and can
be explained by a significantly wider coverage in the case of the BSO. The “other” category
covers too many different situations for the differences in the observed rate to be significant.

4.3. Observation of Annual Trends (2015–2020)

To detect the ability to measure annual changes, we extracted the data (and present the annual
counts in Table 10) for each of the years 2015 à 2020, following the same methodology as
outlined for 2019. Pour 2019 the counts are identical to those in Tables 3 et 7. Tableau 11 pro-
vides the data from Table 4 for the years 2015 à 2019 (the BSO does not cover the year 2020).

For the observation of open access, the reference remains Unpaywall (snapshot of February
2021). The results are shown in Table 12. As expected, they show a steady increase in the
open access rate from 2015 à 2019.

Tableau 9.

Share of open access by document type (overall data set FR-2019 + BSO)

Type of document
journal-article

Number of DOIs
133,638

Share % OA % OA FR2019 % OA BSO
80%

56

57

57

book-chapter

proceedings-article

other

13,268

12,987

7,519

8%

8%

4%

Total FR-2019 + BSO

167,412

100%

25

40

60

53

24

40

54

27

41

64

29

Études scientifiques quantitatives

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

Tableau 10. Counts obtained for publications from France for the years 2015 à 2020, using the
same methodology

Year

2015

2016

2017

2018

2019

2020

2015

2016

2017

2018

2019

HAL
51,734

57,851

59,451

61,997

63,413

59,796

44,785

46,057

46,490

48,047

55,293

PubMed
41,287

ADS
15,387

WoS
91,028

Scopus
108,195

MAG
92,722

96,850

95,808

99,356

16,396

96,186

112,486

16,806

95,731

113,077

16,254

97,012

114,069

15,410

96,712

111,422

102,338

16,077

94,237

104,533

100,608

Tableau 11.

Results of the two approaches for the years 2015 à 2019; see Table 4

FR-2015-20
133,817

138,885

138,845

141,059

139,514

BSO
140,493

148,476

146,179

159,380

153,705

Global corpus
157,053

% FR15-20
85

% BSO
89

164,772

162,179

171,987

167,412

84

86

82

83

90

90

93

92

The year 2020, observed in February 2021, has a different character, as the observation is

made before the 6-month, 1-année, or in some cases longer embargoes have expired.

In Table 13, we give examples of observations of the open access status (Gold, Vert, etc.)
as provided by Unpaywall for 2 distinct years. These few examples allow us to affirm the
absence of significant bias between the 2 data sets: The two strategies lead to quite similar
estimates.

A comparison of the rates obtained for the French corpus with those obtained on an inter-
national scale would go beyond the limits of this article: The interested reader may refer to the

Tableau 12. Change in open access rate, observed in February 2021 for publications dated from
2015 à 2020 (the global corpus is the aggregation of the two data sets FR-2015-20 and BSO)

Year of publication

2015

2016

2017

2018

2019

2020

FR2015-2020
45.4%

Open access rate
BSO
45.5%

Global corpus
44.5%

47.8%

50.0%

51.7%

53.8%

52.6%

47.6%

50.0%

50.6%

53.5%

46.6%

48.9%

49.9%

52.8%

52.6%

30

Études scientifiques quantitatives

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

Tableau 13. Open access status, observed in February 2021 for publications dated 2015 et 2019

Open access status

Gold

Hybrid

Bronze

Vert

Closed

FR-2015
12%

BSO 2015
13%

FR-2019
18%

BSO 2019
18%

12%

4%

18%

55%

12%

4%

16%

54%

9%

7%

20%

46%

10%

6%

20%

46%

Tableau 14.

France 2019: Language by document type

journal-article

book-chapter

proceedings-article

other

% English
82

77

97

87

% French
16

14

1

4

recent study by Robinson-Garcia, Costas, and van Leeuwen (2020), which also presents a dis-
cussion of the different modes of open access mentioned here (Gold, Bronze, Hybrid, Vert).

4.4. Are Articles in French More Often in Open Access?

It is possible to cross-reference the observations presented above with information on the lan-
guage in which the article is written: Are articles in French, Par exemple, more often, ou moins
souvent, in open access? To examine this, as this information is not systematically provided by all
databases, we analyzed the title of the article as provided by Unpaywall by applying the sim-
ple language detection software langdetect14. Only detections assigned with a displayed prob-
ability greater than 0.99 were retained.

In the framework of our study of French national scientific production, for the year 2019,
the two main languages concerned are English (83% of the detected documents) and French
(15%), the rest of the detected languages not exceeding 3% in total (Tableau 14). The distribution
is not identical according to the document type, in particular the communications to (mostly
international) conferences (labeled proceedings-article in Unpaywall) are almost always in
English.

Tableau 15 shows that the rates of open access observed vary greatly according to the disci-
pline (extracted here from the BSO). As a general rule, documents detected as being written in
French are much less frequently in open access.

14 Langdetect (https://pypi.org/project/ langdetect/) is a python-port of Nakatani Shuyo’s language-detection
library (https://github.com/shuyo/ language-detection). When published (dans 2010), it claimed to reach 99%+
accuracy on 49 supported languages.

Études scientifiques quantitatives

31

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

Tableau 15.
be determined and whose discipline is assessed in the BSO.

France 2019: Open access rate by language and discipline. Calculations are restricted to documents for which the language can

Total with language and discipline detected

Chemistry

Computer and information sciences

Mathematics

Medical research

Biology (fond.)

Social sciences

Physical sciences, Astronomy

Terre, Ecology, Energy and applied biology

Engineering

Sciences humaines

Number of
documents
153,272

% documents
in French
15

% OA documents
in English
58

% OA documents
in French
26

7,050

10,225

3,914

48,191

21,535

8,020

15,701

12,222

4,402

9,388

5

8

11

24

12

69

7

16

24

66

53

55

73

57

69

40

64

59

40

41

50

37

55

8

57

37

73

42

40

43

Most of the French language material without open access comes from three areas: medical

recherche, including journals for practitioners, and the humanities and social sciences.

5. RESULTS AND DISCUSSION

5.1. Discussion of the Sources Used

The six sources we have chosen to use actually provide three different insights:

(cid:129) Scopus and WoS provide extensive coverage of the literature in peer-reviewed journals
and international conference proceedings; while Scopus has a slightly wider coverage,
the use of the two databases together provides a 10–20% improvement over what
would be obtained with a single database. The MAG database, which will soon be
discontinued, brings, as a complement, a set of documents not indexed by WoS and
Scopus, contributing to a further increase of about 10% of the corpus identified in our
étude.

(cid:129) The HAL open archive is filled at the initiative of the authors who deposit the biblio-
graphic record (metadata) et, if applicable, the full text in its preprint or editor version.
Part of the archive contains grey literature (Schöpfel, Prost, & Ndiaye, 2019) and more-
over the DOI is filled in irregularly and not systematically. The metadata and DOI do not
seem to be thoroughly quality controlled: For this reason, this source should be consid-
ered with caution for bibliometric studies. Cependant, it is a reference source for French
research and a cornerstone of the national open science policy.

(cid:129) The ADS and PubMed databases are thematic databases and are therefore only
intended to cover parts of the research field. On the other hand, both databases are
deep in their field and cover grey literature and sources not indexed by the large gen-
eralist databases.

Études scientifiques quantitatives

32

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

This study sheds new light on the coverage of French scientific production by the various
databases. While the WoS and Scopus voluntarily restrict themselves to the perimeter of
peer-reviewed publications appearing in referenced journals or books (Baas et al., 2020;
Birkle et al., 2020), the use of complementary databases, whether thematic or not, allows
us to have a more complete view of the share of literature that is not or poorly referenced,
and that may be less general in scope geographically, linguistically, or thematically. Nous
observe that the strategy adopted by the BSO allows for the systematic collection of data on
a significant quantity of these publications—often neglected in bibliometric studies. Far from
identifying an optimal source, our study shows the importance of diversifying the sources used
to provide complementary views on a country’s publication.

5.2. Characteristics of Excluded National Production Without DOI

Publications without a DOI form a heterogeneous group of peer-reviewed and grey literature.
The share of unreferenced grey literature can be approached in particular through the HAL
open archive, by considering documents without a DOI, which were not taken into account
in our study. Cependant, it is advisable to make sure beforehand that the absence of a DOI is not
due to a lack of information, but corresponds to articles from journals that do not use this
identification mode. As the open archive, which is mainly fed by author deposits, is not fed
in a complete and systematic way, this approach can only be qualitative.

We note, first of all, without surprise, a very strong disciplinary variation: Only 15% of the
documents in the field of humanities and social sciences (SSH) deposited in HAL have a DOI,
while the proportion is 70% in chemistry or physics, the global average being 42% for the year
2019 considered here (see Table 2). This rate reaches 50% in the field of computer science.
Among the records without a DOI the share of records from the SSH fields is 52%, compared
to an SSH share of 12% of publications with a DOI.

We also note that the full text is deposited significantly less frequently for documents with-

out a DOI: 39%, whereas the average is 44%.

We can also note, for HAL (année 2019) a strong differentiation according to the language

(we use here the language informed in the archive):

(cid:129) Among the documents without a DOI, the proportion of articles in French is 57% (49%

for articles in English), while for articles with a DOI it is only 8%.
(cid:129) 91% of the documents in French have no DOI (or no DOI indicated).

We found nearly 90,000 records without a DOI in HAL (Tableau 2). If we restrict ourselves to
documents classified as articles, book chapters or conference papers, presque 56,000 records
without a DOI (or without a DOI indicated) listed in HAL had to be excluded from this
étude.

For journal articles (category ART in HAL) we tried to estimate the proportion that corre-
sponds to not having been informed of a DOI: If we consider the articles without a DOI pub-
lished in a journal for which other articles have a DOI, we note that this concerns 31% of the
articles without a DOI (in HAL in 2019). We therefore estimate that at least 30% of DOIs are
missing in HAL due to DOIs that are not filled in. Most of this 30% can be expected to be
covered by the other sources. If this assumption is correct, it would mean that out of the
56,000 records without a DOI entered in HAL, we can estimate that there are around
40,000 articles or communications without a DOI, which were therefore not taken into
account. This point will be the subject of further study.

Études scientifiques quantitatives

33

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

5.3. Validation of the Open Strategy Used for the BSO

The comparison between the result obtained with our sources and the open strategy of the
BSO validates the use of the latter: This strategy, if we summarize it in a few words, consists
in scanning all the DOIs available from Unpaywall, and also from HAL, to identify either the
French authors or the presence of a mention of France in the address.

We observe that this strategy makes it possible to identify more than 20,000 records (if we
exclude the false positives) not found by our approach (c'est à dire., à propos 17% of the total): Ceux-ci sont
mainly journals that are not indexed in the major international databases, and more particu-
larly in the biomedical and social science fields.

Our approach also identified approximately 13,000 DOIs not included in the BSO and thus

estimated the false negative rate in the BSO strategy to be close to 9% (see Table 4).

Recurrent sources of error include conflicting approaches to publication date (avec le
usual confusions between the first online publication and the final date of the reference;
see for example Liu, 2021).

6. CONCLUSIONS

The main results of our study are as follows.

(cid:129) Our study validates a strategy of determining a collection of scientific publications with
an affiliation in France for a given year. This corpus is deliberately restricted by the use of
DOIs. We present the details of the counts for the year 2019. We estimate that the corpus
of outputs with a DOI covers around 80% of French national scholarly production in
2019, with an additional set of 40,000 articles or communications without a DOI not
taken into account here.

(cid:129) Our determination of cross-coverage by the various databases provides useful insight for
users of these databases. We believe that these counts can help users of these databases
to identify overlaps and complementarities, in a context comparable to that of our study.
(cid:129) The use of multiple sources ensures validation at a sufficiently fine level to shed light on
the geographical, thematic, linguistic, etc.. disparities that affect bibliometric studies. Notre
study confirms the relevance of adopting a multisource approach.

(cid:129) The open-source strategy used by the BSO effectively identifies the vast majority of pub-

lications with a persistent identifier (EST CE QUE JE) for Open Science monitoring.

(cid:129) The determination of the open access rate has been refined. It should be remembered
that this rate depends on the date of observation and may differ depending on the type of
documents we wish to consider. Our objective is not to comment here on the 54% ou
53% rate reached for the opening of publications in 2019 (observed in February 2021),
but to note the convergence of two different methodologies that allow us to accurately
draw the shifting landscape of open science at the country level.

The question of the place of the national open archive HAL, and of other open archives, dans
the strategy of Open Science deserves a specific development which should be the subject of a
further study. The objective of such a study would be to examine the possibilities of conver-
gence between, d'un côté, the specific challenges of open archives, allowing for easy
depositing at the disposal of the authors, and on the other hand, the requirements of a refer-
encing and query environment that should not only provide open access to scientific knowl-
edge produced by French research, but also support the most diverse possible readership in
their consultation process.

Études scientifiques quantitatives

34

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

REMERCIEMENTS

We thank the two reviewers for their stimulating comments, which we believe have signifi-
cantly helped to improve our work.

CONTRIBUTIONS DES AUTEURS
Lauranne Chaignon: Validation, Writing—review & édition. Daniel Egret: Conservation des données,
Surveillance, Writing—original draft, Writing—review & édition.

COMPETING INTERESTS

The authors have no competing interests.

INFORMATIONS SUR LE FINANCEMENT

No specific funding has been received for this research.

DATA AVAILABILITY

Data tables providing the detailed number of records for each year, as well as a notebook
describing the whole procedure, are available as supplementary data files on HAL Open
Archive: https://hal.archives-ouvertes.fr/hal-03537679. Subscriptions to Scopus and WoS are
required to replicate the research, with the methods described above.

RÉFÉRENCES

Aliakbar, UN., & Stahlschmidt, S. (2019). Merits and limits: Apply-
ing open data to monitor open access publications in biblio-
metric databases. SocArXiv. https://doi.org/10.31235/osf.io
/npj4h

Archambault, É., Campbell, D., Gingras, Y., & Larivière, V. (2009).
Comparing bibliometric statistics obtained from the Web of Sci-
ence and Scopus. Journal of the American Society for Information
Science and Technology, 60, 1320–1326. https://doi.org/10.1002
/asi.21062

Baas, J., Schotten, M., Plume, UN., Côté, G., & Karimi, R.. (2020).
Scopus as a curated, high-quality bibliometric data source for
academic research in quantitative science studies. Quantitative
Science Studies, 1(1), 377–386. https://doi.org/10.1162/qss_a
_00019

Bartol, T., Budimir, G., Dekleva-Smrekar, D., Pusnik, M., & Juznic,
P.. (2014). Assessment of research fields in Scopus and Web of
Science in the view of national research evaluation in Slovenia.
Scientometrics, 98(2), 1491–1504. https://doi.org/10.1007
/s11192-013-1148-8

Berthaud, C., Charnay, D., & Fargier, N. (2021). Diffuser et péren-
niser le savoir scientifique: 20 ans d’histoire de HAL. Histoire de
la Recherche Contemporaine, 10(2). https://doi.org/10.4000/hrc
.6330

Birkle, C., Pendlebury, D. UN., Rapide, J., & Adams, J.. (2020). Web
of Science as a data source for research on scientific and schol-
arly activity. Études scientifiques quantitatives, 1(1), 363–376. https://
est ce que je.org/10.1162/qss_a_00018

Carvalho, J., Laranjeira, C., Vaz, V., & Mendes Moreira, J.. (2017).
Monitoring a national open access funder mandate. Procedia

Computer Science, 106, 283–290. https://est ce que je.org/10.1016/j
.procs.2017.03.027

Charnay, D., & Michau, C. (2007). L’archive ouverte HAL. JRES

2007. Strasbourg, France.

Garfield, E. (1964). Science Citation Index—A new dimension in
indexing science. Science, 144(361), 649–654. https://doi.org
/10.1126/science.144.3619.649, PubMed: 17806988

Gorraiz, J., Melero-Fuentes, D., Gumpenberger, C., & Valderrama-
Zurián, J.-C. (2016). Availability of digital object identifiers
(DOIs) in Web of Science and Scopus. Journal of Informetrics,
10(1), 98–109. https://doi.org/10.1016/j.joi.2015.11.008

Guerrero-Bote, V. P., Chinchilla-Rodríguez, Z., Mendoza, UN., & de
Moya-Anegón, F. (2021). Comparative analysis of the biblio-
graphic data sources Dimensions and Scopus: An approach at
the country and institutional levels. Frontiers in Research Metrics
and Analytics, 5, 593494. https://doi.org/10.3389/frma.2020
.593494, PubMed: 33870055

Hendricks, G., Tkaczyk, D., Lin, J., & Feeney, P.. (2020). Crossref:
The sustainable source of community-owned scholarly metadata.
Études scientifiques quantitatives, 1(1), 414–427. https://est ce que je.org/10
.1162/qss_a_00022

Herrmannova, D., & Knoth, P.. (2016). An analysis of the Microsoft
Academic Graph. D-Lib Magazine, 22(7), 9-dix. https://doi.org
/10.1045/september2016-herrmannova

Holly, E. (2018). The rise and rise of Unpaywall. Nature, 560(7718),

290–291. https://doi.org/10.1038/d41586-018-05968-3

Huang, C.-K., Neylon, C., Brookes-Kenworthy, C., Hosking, R.,
Montgomery, L., … Ozaygen, UN. (2020). Comparison of biblio-
graphic data sources: Implications for the robustness of university

Études scientifiques quantitatives

35

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Identifying scientific publications countrywide and measuring their open access

rankings. Études scientifiques quantitatives, 1(2), 445–478. https://est ce que je
.org/10.1162/qss_a_00031

Ibarra, M.. E., Ferreira, J.. P., Torrents, M., Hamui, M., Torres, F.,
Ferrero, F. (2018). Changes in PubMed affiliation indexing
improved publication identification by country. Scientometrics,
115, 1365–1370. https://doi.org/10.1007/s11192-018-2714-x
Jeangirard, E. (2019). Monitoring Open Access at a national level:
French case study. 23rd International Conference on Electronic
Édition, ELPUB 2019, Marseille, France. https://est ce que je.org/10
.4000/proceedings.elpub.2019.20

Kurtz, M.. J., Eichhorn, G., Accomazzi, UN., Grant, C. S., Murray,
S. S., & Watson, J.. M.. (2000). The NASA Astrophysics Data Sys-
tem: Overview. Astronomy and Astrophysics Supplement Series,
143, 41. https://doi.org/10.1051/aas:2000170

Laakso, M., & Björk, B. C. (2012). Anatomy of open access publish-
ing: A study of longitudinal development and internal structure.
BMC Medicine, 10, 124. https://doi.org/10.1186/1741-7015-10
-124, PubMed: 23088823

Liu, W. (2021). A matter of time: Publication dates in Web of Sci-
ence Core Collection. Scientometrics, 126, 849–857. https://est ce que je
.org/10.1007/s11192-020-03697-x

Moed, H. F., Markusova, V., & Akoev, M.. (2018). Trends in Russian
research output indexed in Scopus and Web of Science. Sciento-
metrics, 116(2), 1153–1180. https://doi.org/10.1007/s11192-018
-2769-8

Mongeon, P., & Paul-Hus, UN. (2016). The journal coverage of Web
of Science and Scopus: A comparative analysis. Scientometrics,
106(1), 213–228. https://doi.org/10.1007/s11192-015-1765-5
Philipp, T., Botz, G., Kita, J.-C., Richards, P., Sänger, UN., &
Reumaux, M.. (2021). Open access monitoring: Guidelines and
recommendations for research organisations and funders. Sci-
ence Europe, Briefing Paper, May. https://doi.org/10.5281
/zenodo.4905553

Piwowar, H., Priem, J.. Larivière, V., Alperin, J.P., Matthias, L.,
Haustein, S. (2018). The state of OA: A large-scale analysis of
the prevalence and impact of open access articles. PeerJ, 6,
e4375. https://doi.org/10.7717/peerj.4375, PubMed: 29456894
Pölönen, J., Laakso, M., Guns, R., Kulczycki, E., & Sivertsen, G.
(2020). Open access at the national level: A comprehensive
analysis of publications by Finnish researchers. Quantitative
Science Studies, 1(4), 1396–1428. https://doi.org/10.1162/qss_a
_00084

Pranckutė, R.. (2021). Web de la Science ( WoS) and Scopus: The titans
of bibliographic information in today’s academic world. Publica-
tion, 9(1), 12. https://doi.org/10.3390/publications9010012

Puuska, H.-M., Nikkanen, J., Engels, T., Guns, R., Ivanović, D., &
Pölönen, J.. (2020). Integration of national publication
databases—Towards a high-quality and comprehensive informa-
tion base on scholarly publications in Europe. ITM Web Confer-
ence 33, 02001. https://doi.org/10.1051/itmconf/20203302001
Robinson-Garcia, N., Costas, R., & van Leeuwen, T. N. (2020).
Open access uptake by universities worldwide. PeerJ, 8, e9410.
https://doi.org/10.7717/peerj.9410, PubMed: 32714658

Schöpfel, J., & Prost, H. (2019). The scope of open science moni-
toring and grey literature. 12th Conference on Grey Literature and
Repositories, National Library of Technology (NTK), Prague,
Czech Republic.

Schöpfel, J., Prost, H., & Ndiaye, E. (2019). Going green. Édition
academic grey literature in laboratory collections on HAL. GL21
International Conference on Grey Literature, 22–23 October
2019, Hannover, Allemagne.

Simmonds, UN. W. (1999). The Digital Object Identifier (EST CE QUE JE). Pub-
lishing Research Quarterly, 15, 10–13. https://doi.org/10.1007
/s12109-999-0022-2

Sivertsen, G. (2019). Developing current research information sys-
tems as data sources for studies of research. In W. Glänzel, H. F.
Moed, U. Schmoch, & M.. Thelwall (Éd.), Springer Handbook of
Science and Technology Indicators (pp. 667–683). Cham:
Springer. https://doi.org/10.1007/978-3-030-02511-3_25

Van Leeuwen, T. N., Moed, H. F., Tijssen, R.. J.. W., Viser, M.. S., &
Van Raan, UN. F. J.. (2001). Language biases in the coverage of the
Science Citation Index and its consequences for international
comparisons of national research performance. Scientometrics,
51, 335–346. https://doi.org/10.1023/A:1010549719484

Vera-Baceta, M.. UN., Thelwall, M., & Kousha, K. (2019). Internet de
Science and Scopus language coverage. Scientometrics, 121,
1803–1813. https://doi.org/10.1007/s11192-019-03264-z

Viser, M., Van Eck, N. J., & Waltman, L. (2021). Large-scale com-
parison of bibliographic data sources: Scopus, Web de la Science,
Dimensions, Crossref, and Microsoft Academic. Quantitative
Science Studies, 2(1), 20–41. https://doi.org/10.1162/qss_a
_00112

Wang, K., Shen, Z., Huang, C., Wu, C.-H., Eide, D., … Rogahn, R..
(2019). A review of Microsoft Academic Services for science of
science studies. Frontiers in Big Data, 2, 45. https://est ce que je.org/10
.3389/fdata.2019.00045, PubMed: 33693368

Wang, K., Shen, Z., Huang, C., Wu, C.-H., Dong, Y., & Kanakia, UN.
(2020). Microsoft Academic Graph: When experts are not
enough. Études scientifiques quantitatives, 1(1), 396–413. https://est ce que je
.org/10.1162/qss_a_00021

Études scientifiques quantitatives

36

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

3
1
1
8
2
0
0
8
2
7
8
q
s
s
_
un
_
0
0
1
7
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3RESEARCH ARTICLE image
RESEARCH ARTICLE image

Télécharger le PDF