ARTICLE DE RECHERCHE
Fine-grained classification of social science
journal articles using textual data:
A comparison of supervised machine
learning approaches
un accès ouvert
journal
Joshua Eykens
, Raf Guns
, and Tim C. E. Engels
Centre for R&D Monitoring (ECOOM), Faculty of Social Sciences, University of Antwerp,
Middelheimlaan 1, 2020 Antwerp, Belgium
Citation: Eykens, J., Guns, R., &
Engels, T. C. E. (2021). Fine-grained
classification of social science journal
articles using textual data: UN
comparison of supervised machine
learning approaches. Quantitative
Science Studies, 2(1), 89–110. https://
doi.org/10.1162/qss_a_00106
EST CE QUE JE:
https://doi.org/10.1162/qss_a_00106
Reçu: 12 May 2020
Accepté: 14 Décembre 2020
Auteur correspondant:
Joshua Eykens
joshua.eykens@uantwerpen.be
Éditeur de manipulation:
Ludo Waltman
droits d'auteur: © 2021 Joshua Eykens, Raf
Guns, and Tim C. E. Engels. Published
under a Creative Commons Attribution
4.0 International (CC PAR 4.0) Licence.
La presse du MIT
Mots clés: disciplinary classification, granularity, multilabel classification, Sciences sociales, supervised
machine learning, textual data
ABSTRAIT
We compare two supervised machine learning algorithms—Multinomial Naïve Bayes and
Gradient Boosting—to classify social science articles using textual data. The high level of
granularity of the classification scheme used and the possibility that multiple categories are
assigned to a document make this task challenging. To collect the training data, we query three
discipline specific thesauri to retrieve articles corresponding to specialties in the classification.
The resulting data set consists of 113,909 records and covers 245 specialties, aggregated into
31 subdisciplines from three disciplines. Experts were consulted to validate the thesauri-based
classification. The resulting multilabel data set is used to train the machine learning algorithms in
different configurations. We deploy a multilabel classifier chaining model, allowing for an
arbitrary number of categories to be assigned to each document. The best results are obtained
with Gradient Boosting. The approach does not rely on citation data. It can be applied in settings
where such information is not available. We conclude that fine-grained text-based classification
of social sciences publications at a subdisciplinary level is a hard task, for humans and machines
alike. A combination of human expertise and machine learning is suggested as a way forward to
improve the classification of social sciences documents.
1.
INTRODUCTION
Disciplines have long been considered as the fundamental units of division within the sciences
(Stichweh, 2003). These units are knowledge production and communication systems, and can
as such serve important classificatory functions (Hammarfelt, 2018; Stichweh, 1992, 2003;
Sugimoto & Weingart, 2015; van den Besselaar & Heimeriks, 2006). The subjects of interest
for scientometricians (c'est à dire., scientific documents) are classified according to disciplines to facilitate
research into knowledge production and dissemination. Over the past few decades, cependant, nous
have faced continuous growth of the number of new disciplines and specialties (c'est à dire., internal dif-
ferentiation), resulting in increasing dynamism and “intensification of the interactions between
[…] disciplines” (Stichweh, 2003, p. 85).
General classification systems such as the Web of Science ( WoS) Subject Categories (SC) ou
the OECD’s Fields of Science are too broad to adequately capture the more complex, fine-
grained cognitive reality. Several concerns have been raised in this regard—here we mention
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
two central ones. D'abord, Glänzel, Schubert, and Czerwon (1999) point out that the SC approach
on the journal level works well for classifying publications in highly specialized journals, mais
that it is problematic for those appearing in multidisciplinary or general journals. Deuxième,
research such as Waltman and van Eck’s (2012) large-scale clustering study, grouping publica-
tions based on their citation relations, indicated the feasibility of more fine-grained classification
schemes. The authors cluster documents on three different levels, the most detailed of which
can be conceived of as “small subfields” and consists of 21,412 clusters. While most biblio-
metric studies still make use of more general classification schemes for publications, ces
are limited in scope, only indicating broad scientific fields or general disciplines. Empirical studies
such as the one conducted by Waltman and van Eck (2012), as well as theoretical arguments
raised by sociologists of science, amplify the need for fine-grained classification schemes. More
recently, Sjögårde and Ahlgren (2018, 2019) have shown that fine-grained specialized communi-
ties can be determined based on citation relations, and these communities in their turn might pos-
sibly exhibit specific citation and publication practices.
In Flanders, the Dutch speaking region of Belgium, the Flemish Research Discipline Standard
(“Vlaamse Onderzoeksdiscipline Standaard” or VODS) has been introduced to facilitate a
detailed classification of research, including research output (Vancauwenbergh & Poelmans,
2019un, 2019b). The VODS builds upon the OECD Fields of Science (2007), adding two more
fine-grained levels. While the introduction of the VODS will open up new possibilities for
understanding knowledge production and dissemination on a more detailed level, it also poses
important challenges, the classification of publications in the social sciences being one of them.
Current bibliometric approaches to classification of publications are not entirely fit for the
Sciences sociales. This mainly has to do with lack of coverage in major citation databases
(Ossenblok, Engels, & Sivertsen, 2012) and differences in publication and citation practices
within the fields (Kulczycki, Engels et al., 2018; Nederhof, 2006). One possible way to address
these concerns is including nonsource items in citation-based bibliometric maps (Boyack &
Klavans, 2014). An alternative solution is making use of text-based methods.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
1.1. Using Textual Data and Machine Learning to Cluster or Classify (Social Science) Publications
Compared to classification approaches making use of reference/citation data (or other metadata),
the usage of purely textual information (titles, abstracts, full-texts, etc.) has thus far received less
attention. Nevertheless, the theoretical relevance of an article’s textual content for this task has
already been emphasized since the seminal work by Rip and Courtial (1984). Michel Callon and
colleagues have further developed this long tradition of co-word analysis research, which aims to
map and describe scientific interaction and the formation of specialist communities (Callon,
Courtial et al., 1983; Callon, Courtial, & Laville, 1991). More recently there has been a resurgence
of interest in textual data, mainly due to increased computing resources and availability of poten-
tial data sources.
Machine learning methods currently spearhead a lot of research that is based on textual data.
We can distinguish between supervised and unsupervised approaches. In unsupervised learning,
no predefined classes or categories are available to learn from. Supervised learning, on the other
main, starts from a set of predefined categories, each of which has a number of instances or
records assigned to it. An algorithm is then trained on these labeled instances, from which it tries
to deduce the common characteristics of instances in each category to predict to which category
a new, unseen instance might belong. The present article uses such a supervised approach.
In scientometric studies, unsupervised clustering of documents is common. Hybrid approaches
to document clustering in which citation information and textual data are used have shown that
Études scientifiques quantitatives
90
Fine-grained classification of social science journal articles using textual data
adding textual information can ameliorate the outcomes of document clustering (see for example
Janssens, Zhang et al., 2009; Yau, Porter et al., 2014). Unsupervised clustering of documents
based only on textual similarity (Boyack et al., 2011) has gained traction in the bibliometric
community as well. Arguably, supervised ML has been less popular, presumably because in
most scientometric clustering studies a granular ground truth classification at the article level is
lacking.
An exploration of supervised ML algorithms combined with basic NLP techniques has been
described by Read (2010), who used supervised learning to classify documents in, entre autres,
the Ohsumed data set, part of MedLINE. The author reports F1 scores for different multilabel clas-
sification techniques, ranging from 0.1 up to 0.43. Classifier Chains (CCs) are proposed by Read
(2010) as a possible solution to the task of multilabel, multiclass classification problems. The latter
are tasks in which a document can be assigned to multiple categories at the same time. This kind of
learning task is considerably more challenging than the single label classification problem.
Recent supervised ML algorithms with neural networks and word embeddings or BERT
(Bidirectional Encoder Representations from Transformers) models, respectivement, have also been
used to vectorize and classify scientific documents. While these recent studies do not deal with
multilabel, multiclass classification, they are relevant in that they apply these relatively new NLP
techniques to vectorize scientific publications. Kandimalla, Rohatgi et al. (2020) report on a
large-scale classification study in which they categorize papers according to WoS Subject
Categories by making use of neural networks and word embedding models. The authors show
that such classification systems work well, achieving an average F-score of 0.76. For the indi-
vidual SC, the scores range from 0.5 à 0.95. Dans cette étude, cependant, the subcategories with
too few records are merged or omitted from the analysis, as they “decrease the performance
of the model.” Documents that are labeled with more than one category are also dropped.
The authors conclude that their experiment shows that the supervised learning approach
scales better than citation clustering-based methods. Dunham, Melot, and Murdick (2020)
train SciBERT classifiers on arXiv metadata and subject labels. This model is then used to identify
AI-relevant publications in WoS, Digital Science Dimensions and Microsoft Academic. Le
authors report F1 scores ranging from 0.59 à 0.86 for the four categories within the field
of AI.
Annif, an automated subject indexing tool currently being tested and implemented at the
National Library of Finland, is also comparable to our approach (Suominen, 2019). Annif anno-
tates terms from different subject vocabularies and thesauri to documents based on textual in-
formation, such as abstracts and/or titles. The ML module consists of an ensemble of
classification algorithms. Annif annotates documents on a granular level, as the tested module
was able to assign up to five indexing terms to documents. The module was evaluated on
four corpora, including both academic and nonacademic texts, yielding F1 scores ranging from
0.14 à 0.46.
The present paper is an extension of work presented at ISSI 2019, where we applied supervised
ML to classify sociology publications into subdisciplinary categories (Eykens, Guns, & Engels,
2019), reaching 81% accuracy. Note, though, that that paper only worked with publications
assigned to one specialty. In this article, we study the use of textual data to classify publications
from three social science disciplines into one or more subdisciplines. Much like Read (2010) et
Kandimalla et al. (2020), we thus primarily aim to exploit textual characteristics of (social science)
documents to categorize them into predefined disciplinary categories. As we will describe in
more detail further on, we aim to categorize these social science abstracts into granular subcat-
egories. Multiple categories can be assigned to one document at the same time. The novelty of
Études scientifiques quantitatives
91
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
this study resides in the fact that we have used a procedure to validate the data collected for our
ML experiment, and that multiple granular subdisciplinary categories can be assigned to one
single document.
1.2. Outline
In Section 2 we describe the classification scheme used in detail. Section 3 describes the data
sources used (Sociological Abstracts, ERIC, EconLit), and the collection and processing proce-
dure. We have developed a structured way of collecting and validating textual data based on
well-established disciplinary thesauri in tandem with a validation round by experts from
the respective fields. This validation procedure is discussed in Section 3.3. Suivant, Section 4
further details the supervised ML algorithms and feature extraction techniques that we compare.
Section 5 describes the results of the comparison, where we evaluate performance on two
dimensions: the individual labels and the instances. Enfin, we discuss our ML setup and
contrast our approach to existing automatic classification techniques. We conclude with some
reflections and pathways for future research, and briefly discuss practical applications.
2. THE FLEMISH RESEARCH DISCIPLINE STANDARD ( VODS)
We make use of the Flemish Research Discipline Standard (“Vlaamse Onderzoeksdiscipline
Standaard”, abbreviated VODS, in Dutch), which is available at https://researchportal.be/en
/disciplines and has been described in the literature (Vancauwenbergh & Poelmans, 2019un,
2019b). The VODS was introduced in the Flemish Research Information Space (FRIS, voir
https://researchportal.be/en), an aggregation platform of publicly funded research in Flanders,
dans 2019. In the future, all scientific output produced by scholars in Flanders may be classified
according to the VODS. The VODS is structured as a hierarchical tree with four levels. To allow
for international comparison, the first level overlaps with the seven broad fields of science present
at the highest level of the OECD Fields of Science (OECD, 2007) coding scheme (hereafter
referred to as OECD FOS). For the case of sociology, Par exemple, at the top level of the OECD
FOS we find category 5 “social sciences” and subcategory “5.4 sociology and anthropology”
(Chiffre 1). This category is present in the VODS as well.
The VODS adds two more granular layers representing further subdivisions of the second
layer of the OECD FOS. The third layer of the VODS might be interpreted as containing subdis-
ciplinary categories, while items on the fourth level can be considered research specialties. À
construct and define this scheme, experts from the corresponding fields were consulted by
the creators of the VODS. In total, on the most granular level the VODS contains 2,493 codes.
For further technical details on this classification scheme, we refer interested readers to
Vancauwenbergh and Poelmans (2019un, 2019b).
Chiffre 1.
VODS classification scheme can be accessed at https://researchportal.be/en/disciplines.
Excerpt of tree structure: OECD FOS (2007) coding scheme and VODS (2019). Le
Études scientifiques quantitatives
92
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
Our objective is to automatically classify articles (based on abstracts and titles) into categories
at level 3 of the coding scheme (par exemple., 050402 Applied sociology and/or 050405 Social Change
and/or …) for three fields within the social sciences, namely (0502) économie & entreprise
(10 classes at the third level), (0503) pedagogical & educational sciences (nine classes at the
third level), et (0504) sociology & anthropologie (12 classes at the third level). At level 3, nous
have 31 subdisciplinary categories for the three disciplines together. Section 3.3 will further
detail the reasons why our approach operates at level 3 rather than level 4. In the following part
we introduce the data sources used to collect the titles and abstracts for the three disciplines.
3. DATA SOURCES: SOCIOLOGICAL ABSTRACTS, ERIC AND ECONLIT
The data used for our study were downloaded from ProQuest (https://search.proquest.com).
ProQuest provides good journal coverage of the social science literature compared to, for exam-
ple, Scopus or WoS (Norris & Oppenheim, 2007). ProQuest offers access to a range of existing
abstracting services and disciplinary databases. For the purpose of our analyses, we have used
Sociological Abstracts to download bibliographic records from sociology & anthropologie,
EconLit for records from business & économie, and ERIC for records from the pedagogical &
educational sciences.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
3.1. Combinations of Indexing Terms as Proxies for Subject Specialties
A clear advantage of all three databases is that they make use of controlled vocabularies (ou
thesauri) for the records which are indexed. The Thesaurus of Sociological Indexing Terms is a
well-developed and highly regarded indexing system used by Sociological Abstracts’ service.
Within EconLit, the Journal of Economic Literature ( JEL) classification, also known as the
American Economic Association Classification System, is used. Within ERIC, the Thesaurus of
ERIC Descriptors is used. En outre, ProQuest’s search engine allows us to filter on publication
types and publication years. We selected all journal articles published between 2000 et 2018.
These controlled vocabularies allow us to query ProQuest’s command line search page for
abstracts on a very fine levels of granularity. Chiffre 2 shows an example of the query we used
for the category “law & économie,” within business & économie. The full set of queries for
all categories is available online (Eykens & Guns, 2020).
To design the queries, we manually coupled the indexing terms of these different data sets
to the fourth level categories in the VODS and downloaded the abstracts found for each query.
The first author went through the list of VODS categories and per category collected all rele-
vant indexing terms from the thesaurus at hand.
The VODS provides semantic definitions of each category, which were formulated together
with field experts. We used this information to manually retrieve the relevant indexing terms.
In many cases, this was straightforward, because there was a perfect overlap with the indexing
Chiffre 2. Example of command line query designed for Business & Economics, catégorie 05020109
Loi & économie.
Études scientifiques quantitatives
93
Fine-grained classification of social science journal articles using textual data
termes (for EconLit, this was the case for nearly all categories). Dans d'autres cas, some additional
indexing terms were found to be relevant (see for example Figure 2).
The indexing terms were then used to query ProQuest. The records retrieved for each of the
214 level 4 categories were subsequently downloaded (with an upper limit set to 1,000 records
per VODS level four category) and saved in a separate folder, which was labeled with the cor-
responding VODS category. The data collection was carried out between December 2018 et
Février 2019. After collecting and processing, the merged sets (all files for the three fields
ensemble) resulted in a raw set consisting of 148,341 records (see Table 1).
3.2. Data Cleaning and Processing
To clean the raw data set, we followed a protocol consisting of four steps. At step 1 we removed
all records that were missing an abstract or title. Although we limited our search to records
published between 2000 et 2018 (step 2), there were still some in our data set that were
published before 2000 or after 2018. These were omitted as well. Lower and upper boundaries
were set for the word count of the abstracts (step 3): minimum 50 and maximum 1,000, respecter-
tivement. These limits were found to adequately weed out cases where the abstract field replicated
either the title or the entire full text.
Whereas we expected only journal articles resulting from our queries, other publication
types were present as well. The reason for this might have to do with the fact that all three data
sets have been designed by different organizations, which results in a diverse range of variable
names to describe the different publication types used within the data sets. At step 4, for each
data set, we compiled a list of unique variable names present in the collected records and
filtered out those describing publication types that we did not want to take into consideration
(par exemple., book reviews, interviews, editorial material, instructional material). A list of the remaining,
relevant publication types was used to restrict our data set to research articles published in
journaux.
Tableau 1 provides an overview of the number of records in each set before and after cleaning.
For Sociological Abstracts and EconLit, our initial collection of records was reduced by a little
over 20%. For ERIC, the total number of records was reduced by almost 40%. The large inter-
group difference observed is mainly due to a large number of records classified as “instructional
material” in ERIC. The large intragroup difference is due to the smaller number of subcategories
present in the VODS. For business & economics we queried 84 catégories, for pedagogical &
educational sciences 53 catégories, and for sociology & anthropologie 77 catégories.
As discussed above, we have designed queries for each level 4 category in the VODS and
collected records from the respective databases. Some records appeared multiple times—that
est, some records were retrieved with different queries. After deduplication and relabeling, le
Tableau 1. Number of records collected from each database: Before and after cleaning
VODS category
0502 Business & économie
0503 Pedagogical & educational sciences
Indexing service
consulted
EconLit
ERIC
0504 Sociology & anthropologie
Sociological Abstracts
Total
Études scientifiques quantitatives
Initial number
of records
63,407
Number of records
after cleaning
50,577
23,521
61,413
148,341
14,527
48,805
113,909
94
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
data set contains 113,909 multilabeled abstracts, with an average of 1.1 labels per abstract
(min. = 1, maximum. = 6, SD = 0.36).
3.3. Expert Validation: Interindexer Consistency and F1 Scores
To validate the reliance on controlled vocabularies described above, a domain expert from each
of the three disciplines was contacted. The three experts were given a random sample of 45 ab-
stracts and titles, which they were asked to classify according to the VODS level 4 catégories
corresponding to their field of expertise (c'est à dire., sociology & anthropologie, entreprise & économie,
and pedagogical & educational sciences). Each expert was presented a set of abstracts and titles
from their own discipline. The expert working in the field of business & économie, Par exemple,
was given abstracts and titles originating from EconLit (entreprise & économie) only. No limita-
tions were set on the number of categories the indexers were allowed to assign.
Suivant, the classification by the experts on the one hand and the classification based on the con-
trolled vocabularies of each database on the other were compared. To do so, the interindexer
consistency (IIC) was calculated for each record in the sample. For every sample, we calculated
the average IIC using the method described by Rollin (Leininger, 2000). D'abord, a percentage of
consistency between two indexers (ici: the expert and the controlled vocabulary) is calculated
for each document d:
IICd
¼
2UN
B þ C
(1)
Ici, A denotes the number of categories on which both indexers agree, B is the number of
categories assigned by indexer 1 (expert) and C the number of categories assigned by indexer 2
(controlled vocabulary). The IIC at document level is the Dice coefficient of the two sets of
categories assigned by the indexers. The average IIC for the whole sample is calculated by dividing
the sum of the IICs for all individual documents by the total number of documents N (in our case,
equal to 45). En outre, we calculated F1 scores for each disciplinary sample. We have calcu-
lated these scores for levels 3 et 4 of the VODS.
Tableau 2 displays the results of the IIC and F1 calculations. The F-scores are also included for
the assessment of the performance of the ML models. On level three of the VODS, the IIC varies
entre 45.2% for the sample from Sociological Abstracts and 62.2% for EconLit. On level four,
the IIC scores are considerably lower, with a minimum of 23.7% for EconLit and a maximum of
39.7% for ERIC. Previous research into IIC in the case of the PsycINFO database shows similar
results to those obtained for level 3 of the VODS. Leininger (2000) evaluates IIC for a similar clas-
sification scheme, based on research areas within psychology. Using Rollins’ method, he finds an
average IIC of 45% (Rollin, 1981, as cited in Leininger, 2000, p. 6). Sievert and Andrews (1991)
study the IIC for a subset from Information Science Abstracts. The authors report average consis-
tency scores of about 50%. Funk and Reid (1983) study the IIC for MEDLINE. They report a con-
sistency score of 61.1% for the MeSH terms assigned to documents. While our scenario is
somewhat different (c'est à dire., the first author “reclassified” publications according to the indexing
Tableau 2.
Rollin (1981) interindexer consistency (IIC) and weighted F1 scores for the three data sets at two classification levels
Sample from
Pedagogical & educational sciences—ERIC
Business & economics—EconLit
Sociology & anthropology—Sociological Abstracts
IIC level 3
52.9%
62.2%
45.2%
F1 level 3
0.59
0.57
0.67
IIC level 4
39.7%
23.7%
26.7%
Études scientifiques quantitatives
F1 level 4
0.42
0.51
0.48
95
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
terms and experts were consulted to validate this reclassification), it seems that these low scores
rather indicate the difficulty of the problem at hand. Donc, we conclude that the level 3 clas-
sification is sufficiently robust to be used for our ML learning experiment, and hence we limit
ourselves to the classification of journal articles at this level. For matters of interpretation of
the differences in scores, at level 3 we have 31 subdisciplinary categories in total, compared
à 214 research specialties at level 4.
4. MÉTHODES
Pour 30 années, the dominant paradigm of text classification (TC) has consisted of ML approaches.
ML algorithms are deployed such that “a general inductive process automatically builds an
automatic text classifier by learning, from a set of preclassified documents, the characteristics
of the categories [or labels] of interest” (Sebastiani, 2002, p. 2). ML approaches have already been
applied to classify abstracts or full texts of journal articles. Langlois, Nie et al. (2018) classify
papers into two broad domains: empirical and nonempirical. Our approach is different from
such studies as the level of granularity of categories into which we classify texts is far greater.
Par conséquent, articles from two different level 3 subdisciplinary categories are overall much
more similar than what is encountered in most other classification tasks.
The classification problem discussed in this paper belongs to the domain of multiclass
multilabel classification. Multiclass classification refers to assigning one of more than two classes
to an instance. Multiclass multilabel classification is an extension of this problem where we assign
one or more of multiple classes to an instance (Read, Pfahringer et al., 2009). Some abstracts were
thus assigned to multiple classes (up to a maximum of six).
A popular strategy is to transform the multilabel problem into different single-label classifica-
tion tasks. This can be done making use of binary relevance. As a baseline classifier, we make use
of Multinomial Naïve Bayes (MNB). We optimize this classifier to explore the best feature engi-
neering techniques as described below. Suivant, we compare the results obtained with MNB to
those obtained by a Gradient Boosting (GB) model. After discussing the feature engineering steps
in the following part, we will present a short description of the algorithms and the metrics that
were used to evaluate performance on different aspects.
4.1. Feature Engineering
Feature engineering for multilabel TC is done in the same way as for single-label TC. Le
“features” or columns of the matrix are representations of words in the abstracts and titles of
the publications. The Bag of Words approach (BoW) is a traditional, popular, and simple yet
powerful way of vectorizing documents for TC. The BoW approach consists of slicing a text into
words or phrases (without taking word order into account). We have built customized tokenizer
functions in Python to extract four different textual features: lemma unigrams, lemma bigrams
(combined with unigrams), nouns, and noun phrases (Chiffre 3). Although previous research has
shown that for the BoW approach more advanced document representations such as nouns
and noun phrases are “not adequate to improve TC accuracy,” we wanted to explore this for
our specific use-case (Moschitti & Basili, 2004).
We made use of the natural language processing packages NLTK (Loper & Oiseau, 2002) et
SpaCy (Honnibal & Montani, 2018) to parse the texts and to perform part of speech tagging and
stemming. For stemming, we made use of NLTK’s implementation of the snowball stemming
algorithme. Scikit-learn’s count vectorizer and TF-IDF (term frequency-inverse document fre-
quency) transformer were used to process the outcomes of the different feature extraction
Études scientifiques quantitatives
96
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
Chiffre 3.
Four feature extraction methods: lemma unigrams, lemma bigrams, nouns, and noun phrases.
méthodes (Pedregosa, Varoquaux et al., 2011). With each tokenizer, we tested the performance
of both (normalized) TF and TF-IDF. This resulted in eight different feature spaces (voir la figure 4).
Feature sparseness is a common problem in TC. Transformation methods that make use of
bigrams can easily bring about feature matrices with hundreds of thousands or even millions of
columns, leading to very high dimensionality. To reduce the dimensionality, we make use of a
feature selection method based on randomized decision trees. After extracting textual features,
we fit a shallow extra trees classifier (maximum depth of 10) to the data to select the most
relevant ones.
4.2. Classification Algorithms
4.2.1. Multinomial Naïve Bayes (MNB)
MNB is one of the most popular TC algorithms used by the ML community. It is a fast, scalable
(c'est à dire., iterates very fast over large data sets), and successful approach for many TC problems. Sur
the years, it has become a popular baseline method, “the punching bag of classifiers” (Lewis,
1998, p. 2). MNB makes use of Bayes’ theorem to construct histograms based on the feature
vectors—in our case counts or probabilities of the textual features present in a document—for
every single instance. The classifier associates these histograms with the labels and estimates
likelihoods of a label and a distribution of feature counts occurring together.
If, cependant, a feature-class combination has zero counts, the probability will be set to zero. Ce
mitigates the necessary information of the other probabilities by multiplying them by zero. For the
algorithm to be able to deal with such problems a smoothing parameter is used. Another way of
Chiffre 4. Overview of feature transformation steps.
Études scientifiques quantitatives
97
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
dealing with this problem is transforming the feature space into a TF-IDF normalized matrix
(Rennie, Shih et al., 2003).
4.2.2. Gradient Boosting Decision Trees (LightGBM)
Gradient Boosting Decision Trees (GBDTs) sont, as the name indicates, tree-based learning algo-
rithms. These algorithms build ensemble models, or groups of decision trees aimed at reducing
residual errors for a split point in a decision tree. Boosting is a specific ensemble technique that
sequentially builds the models on random subsets of the features and instances. When an instance
is misclassified, its weight is increased and the next model tries to correct for this error.
In practice, this algorithm can be very time-consuming. Ke, Meng et al. (2017) have come up
with a solution to this problem by optimizing the randomness of the feature and instance selection
step. They combine Gradient-based One-Side Sampling (GOSS) with Exclusive Feature Bundling
(EFB) to speed up the training process. The GOSS procedure pays more attention to instances with
larger gradients (c'est à dire., having more impact on the classification error of a model) “and randomly
drop those instances with small gradients” (Ke et al., 2017, p. 2). This approach is implemented
in the LightGBM software package (https://lightgbm.readthedocs.io).
The EFB implementation exploits feature sparseness, which is a very common problem in TC.
It bundles sparse features together into a single feature, efficiently reducing the dimensionality. Dans
a previous study, we have found the LightGBM implementation of GB to be the best performing
algorithm to classify publications in sociology & anthropologie, achieving accuracy scores well
over 80% (Eykens et al., 2019). Different from this previous study, in this paper we assess classifier
performance for a vastly more complex multilabel setting.
Decision tree-based models, cependant, come at a cost. They require tuning a wide range of
parameter settings. For LightGBM, one can set well over 100 parameters1. For our purposes,
we have chosen to optimize for 11 core parameters:
(cid:129) the number of trees that will be built;
(cid:129) the maximum depth of the trees: to limit tree growth;
(cid:129) the number of leaves of the decision trees: last splits made in the model when reaching
the optimal number of splits for a given loss function, or when reaching the predefined
maximum depth;
(cid:129) the learning rate: sets the weight of the outcomes of each tree for the final output;
(cid:129) maximum in bin: handles the maximum number of bins in which the feature values will
be grouped;
(cid:129) regularization alpha (L1): limits the impact of the leaves encouraging sparsity (c'est à dire.,
weights to zero);
(cid:129) regularization lambda (L2): limits the impact of the leaves by encouraging smaller
weights;
(cid:129) minimum child weight: the minimum sum of instance weight which is needed in a leaf
(enfant);
(cid:129) bagging fraction: the fraction of the data set used for each iteration;
(cid:129) bagging frequency: the number of trees training per random subsample of the data set; et
(cid:129) minimum data in leaf: the minimum number of samples which should be captured in a leaf.
We will describe how we optimized the parameters in Section 4.2.4.
1 For a complete overview of the parameters used in LightGBM, see https://lightgbm.readthedocs.io/en/latest
/Parameters.html.
Études scientifiques quantitatives
98
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
4.2.3. Multilabel classification: Classifier Chains (CC)
Two main approaches to multilabel classification exist: problem transformation and algorithm
adaptation. The most popular and computationally least expensive approach is problem trans-
formation, where a multilabel classification problem is transformed into N single-label classi-
fication problems. An example of problem transformation is turning the multilabel task into
N-labels binary classification problems, wherein each binary classification problem is treated
by a separate classifier. This is also known as the Binary Relevance method and has proven
success in the domain of multilabel TC (Zhang, Li et al., 2018).
As each label is treated separately, cependant, the algorithm effectively ignores label depen-
dence. Read et al. (2009) have suggested an improvement of the Binary Relevance method by
“chaining” the results of each classifier to the input space so that the next training round takes
the results of previous classifiers into account. As different disciplinary categories might be
closer to each other in terms of concepts and topics studied, we do not expect labels to be
completely independent of each other. Ainsi, we opt for the CC approach. It should be noted
that other approaches exist, but these come at a cost of computational complexity as well as
intuitive understanding of the models.
4.2.4. Cross-validation
After vectorization, dimensionality reduction, and problem transformation with a binary relevance-
based CC algorithm, a holdout set (25% of the complete data set) was sliced from the initial data set
using an iterative stratification technique as proposed by Sechidis, Tsoumakas, and Vlahavas
(2011). This stratification method handles class imbalance for multilabel learning problems
in such a way that the distribution of instances over classes in the validation set is kept as close
to the actual distribution as possible.
Chiffre 5 visualizes the cross-validation procedure. The test data (0.25 of the total set) will be
used for the final evaluation of our models. For each iteration, a different subset of the remaining
75% of the data (training data) are used to evaluate different parameter settings for the feature
engineering options presented above. We make use of randomized parameter grid search and
threefold cross-validation to evaluate different parameter settings on parts or “folds” of the
training data. This means we run three new random experiments, each of which again divides
the training data into two different parts, en utilisant 66.66% of the training data to train a model with
a random parameter setting, and evaluating that setting on unseen data (the darker grey area
represented above). We make use of three different slices of training and test data to make sure
that our findings are robust.
4.3. Evaluation Metrics
Evaluating the performance of multilabel classification is not as straightforward as is the case for
single-label classification. Single-label classifiers’ predictive performance can be evaluated
Chiffre 5. Visualization of training set—validation set folds and test data. Lighter grey represents
training samples, and darker grey represents validation samples.
Études scientifiques quantitatives
99
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
using the accuracy measure (c'est à dire., the fraction of correctly classified instances over the total
number of instances). The Accuracy Acc is calculated as follows:
Acc ¼
1
N
XN
i ¼ 1
(cid:2)
I Yi ¼ ^
Y i
(cid:3)
(2)
I here is the indicator function. Yi is the set of true labels (subdisciplinary categories, in our case)
^
Yi is the set of predicted labels for document i. For multilabel classification
for document i, et
assessing such a score based on the full set of labels per instance would be too harsh, “since even
a single false positive or false negative label makes the example incorrect” (Read, 2010). Using
multiple metrics to capture different dimensions of the multilabel prediction is advised (Read,
2010; Zhang & Zhou, 2014). Two main dimensions can be assessed: the individual labels and
the entire training or testing label sets per instance (Zhang & Zhou, 2014, p. 1822). For full label
set evaluation, we calculate accuracy, and for label-based evaluation, we calculate precision
and recall.
Per label ℓ from the set of labels L, we can determine the set testℓ of documents to which this
label has been assigned and the set predℓ of documents for which the classifier predicts this
label. Weighted average precision P is determined as follows:
P ¼
P.
1
j
‘ 2 L
test‘
j
X
test‘
j
j
‘ 2 L
test‘ \ pred‘
j
pred‘
j
j
j
où |·| denotes set cardinality. De la même manière, weighted average recall R is:
R ¼
P.
1
test‘
j
‘ 2 L
j
X
test‘
j
j
‘ 2 L
test‘ \ pred‘
j
test‘
j
j
j
¼
test‘ \ pred‘
j
‘ 2 L
P.
test‘
j
j
‘ 2 L
P.
(3)
(4)
j
The F1 score is the weighted average of precision and recall. Precision and recall are first
“macroaveraged” by calculating the weighted mean of precision and recall for each label,
and these are used to calculate the final F1 scores. These measures give an indication of the
performance of our algorithm across the three different disciplinary data sets. Precision (Eq. 3) dans
a multilabel setting is “the fraction of predicted relevances which are actually relevant” (Read,
2010, p. 41). En outre, Schapire and Singer (2000, as cited in Tsoumakas & Katakis, 2007)
propose Hamming Loss to take into account the fraction of labels that are predicted incorrectly.
Hamming Loss is calculated as follows (see Sorower, 2010):
Hamming Loss ¼
1
N Lj j
XLj j
XN
k ¼ 1
i ¼ 1
yi;k
(cid:2) ^yi;k
(5)
Ici, (cid:2) is the exclusive-or operator, yi,k is 1 if document i has label k and 0 otherwise, et
similarly, ^yi,k is 1 if document i is predicted to have label k and 0 otherwise. We average these
scores over the total number of classes |L| and predictions N. Hamming Loss thus denotes the
fraction of incorrectly predicted labels and its optimal value is 0.
5. RÉSULTATS
In the first part, we present the best results obtained for Multinomial Naïve Bayes. As detailed
au-dessus de, we have vectorized the abstracts and titles making use of three slightly different textual
characteristics: lemmas, nouns, and noun phrases. Because of the computational require-
ments, the ML steps were carried out on the High Performance Computing infrastructure of
VSC (the Flemish Supercomputer Center) at the University of Antwerp.
Études scientifiques quantitatives
100
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
Tableau 3.
Training and evaluation on hold-out set. Best results (per row) in bold
Results of Multinomial Naïve Bayes classification performance for optimal feature space (lemma bigrams, no IDF normalization).
Lemma
unigrams
0.20
0.04
0.61
0.31
0.36
TF
Lemma
bigrams
0.24
0.04
0.61
0.37
0.42
Nouns
0.17
0.04
0.62
0.26
0.31
TF-IDF
Noun
phrases
0.14
Lemma
unigrams
0.19
Lemma
bigrams
0.21
0.03
0.66
0.19
0.27
0.04
0.60
0.30
0.35
0.03
0.64
0.31
0.38
Nouns
0.17
0.04
0.61
0.25
0.31
Noun
phrases
0.15
0.03
0.63
0.20
0.29
Set accuracy
Hamming
Loss
Precision
Recall
F1 score
5.1. Multinomial Naïve Bayes
For the Multinomial Naïve Bayes classifier, we aim to optimize the smoothing parameter alpha.
−10)
We randomly sample a value from a loguniform distribution, ranging from very small (c'est à dire., 10
up to 1 (c'est à dire., add-one or Laplace smoothing). After finding the optimal value for alpha (0.13883)
by fitting the algorithm to the three folds of the training set, we make a prediction for the hold-
out test set. The results for the best feature representation method are presented in Table 3. Le
optimal representation strategy turns out to be lemma bigrams without IDF normalization.
Making use of bigrams for lemmas decreased the Hamming Loss and increased the other
scores. We achieved quite similar results with TF-IDF transformed vectors. Fait intéressant, noun
phrases, except for the Hamming Loss evaluation metric, do not yield improved results.
5.2. Gradient Boosting (LightGBM)
For the GB algorithm, we randomly sample values for 11 different parameters. For a more
detailed explanation of these parameters, refer to Section 4.2.2. To reduce computing time,
we limited the number of random iterations to 100. If we were to perform a full parameter grid
recherche, the number of model fits would be far too high. Keeping in mind that 25 fits take about
three hours, this is not desirable.
Compared to the best results achieved with MNB, the GB implementation scores better on
almost all evaluation metrics, except for precision (see Table 4). It is interesting to note that
Tableau 4.
Scores for Gradient Boosting classification on the validation set, for each feature space. Best results (per row) in bold
Lemma
unigrams
0.46
0.03
0.66
0.48
0.54
TF
Lemma
bigrams
0.46
0.03
0.64
0.50
0.55
Nouns
0.43
0.03
0.60
0.45
0.49
TF-IDF
Noun
phrases
0.33
Lemma
unigrams
0.45
Lemma
bigrams
0.45
0.04
0.49
0.36
0.40
0.03
0.66
0.48
0.54
0.03
0.63
0.50
0.55
Nouns
0.43
0.03
0.60
0.45
0.49
Set accuracy
Hamming Loss
Precision
Recall
F1 score
Études scientifiques quantitatives
Noun
phrases
0.32
0.04
0.49
0.36
0.39
101
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
Chiffre 6. Box plot of F1 scores for all 31 subdisciplines for MNB and Gradient Boosting (GB).
Preprocessing: lemma bigrams, no IDF. The subdisciplines are grouped per discipline and the vertical
line segments indicate the average F1 scores per discipline.
MNB scores better for the precision metric in some scenarios. Accuracy scores, cependant, strongly
increase for GB, and Hamming Loss also falls back. The same feature transformation strategy
seems to work best for GB. For the lemma bigrams feature extraction method without IDF
normalization we achieve an F1 score of 0.55. Hamming Loss is considerably lower as well, avec
a fraction of 0.3% of the labels wrongly assigned. 46% of the label combinations predicted by the
algorithm were the same as those in the test set. It is noteworthy that the differences between
TF-IDF and TF feature transformations are insignificant.
Chiffre 6 shows how the F1 scores are distributed across all 31 subdisciplines. We observe that
the scores for GB are not only higher on average but also less spread out, with the exception of
three poorly scoring subdisciplines. These three are all subdisciplines of educational & pedagog-
ical sciences: Informal learning, General pedagogical & educational sciences, and Parenting &
Chiffre 7. Relation between the number of records and F1 scores for MNB and GB for each of the
31 subdisciplines studied. Preprocessing: lemma bigrams, no IDF.
Études scientifiques quantitatives
102
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
family education. Except for these subdisciplinary categories, overall no discipline performs
clearly better or worse than the others, although the number of training records seems to have
some influence: subdisciplines with fewer training records tend to get lower F1 scores (Chiffre 7).
While this relation is somewhat stronger for MNB, the three cases for GB with exceptionally low
F1 scores all have few (entre 174 et 780) records.
6. DISCUSSION
Classifying research output into disciplinary categories is of fundamental importance for nearly
all bibliometric analyses. In the introduction to this paper, we touched upon the issue of differ-
entiation in the sciences, leading to an ever-increasing number of research communities and
disciplines (Stichweh, 2003). This emergence of new disciplines can lead, among other things,
to the formation of new research specialties, the organization of new conferences, the formation
of new scientific societies and the foundation of new journals (see Shneider, 2009). As the land-
scape of disciplines grows more diverse, classification schemes are being updated to better
fit this dynamic reality.
The development of such an updated classification scheme is exemplified by the implemen-
tation of the VODS in Flanders (see Vancauwenbergh & Poelmans, 2019un). Such a diverse and
fine-grained classification scheme makes it possible to study interactions between disciplines
(c'est à dire., inter- and intradisciplinary knowledge flows) more closely, and map discrepancies between
different classification systems with more detail. Encore, it requires new ways of approaching classi-
fication tasks as well, in particular in settings such as the classification of expertise, projects, et
outputs for which citation data are not available. In this article we take up the specific challenge
of a fine-grained classification of social sciences journal articles using the text of their abstracts
and titles.
To summarize, our study consists of three elements. D'abord, we constructed a labeled data set.
As the VODS classification scheme is relatively new, we lack a data set of classified publications
or other documents that can readily be used for ML purposes. This led us to manually construct a
training data set consisting of data extracted from EconLit, ERIC, and Sociological Abstracts. Chaque
of the 31 VODS subdisciplines of economics & entreprise, pedagogy & educational sciences, et
sociology & anthropology was translated to a thesaurus-based query for the respective databases.
Deuxième, the query results were validated by human experts. IIC and F1 scores indicate that
categories at level 3 (subdisciplines) et 4 (specialties) of the VODS can sometimes be hard to
distinguish between. En même temps, the IIC scores for level 3 categories are comparable to
those obtained in earlier IIC studies.
Troisième, the labeled data set at level 3 was used to train Multinomial Naïve Bayes and GB ML
models. If we compare Figure 6 to Table 2, the configuration with the best results yields F1
scores slightly below those for the validation by human experts. This indicates that the models
might still be improved somewhat, but very high scores are probably unrealistic or indicative of
overfitting. Taken together, the results suggest that level 3 of VODS is so fine grained that some
categories are hard to discern in practice and as a result a certain degree of ambiguity becomes
unavoidable, at least for the disciplines studied here.
While some of the reported indicators, such as F-scores, are relatively low, we think it is in-
structive to compare our results to those of the recent studies by Kandimalla et al. (2020) et
Dunham et al. (2020). While these authors report better accuracy, it should be highlighted that in
this paper we specifically look at the applicability of supervised learning in the context of social
sciences. As Kandimalla and colleagues note, this is not an easy task given the large overlap in
terminology and the proximity of the categories. Kandimalla et al. (2020) have for that reason
Études scientifiques quantitatives
103
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
dropped or collapsed 120 out of 235 of the SC from their data set. En outre, they drop docu-
ments assigned to multiple disciplines. It should be noted that WoS SC are less granular than the
ones used in our study (c'est à dire., at the level of disciplines instead of subdisciplines). Dunham et al.
(2020) report good scores for their model, which classifies AI publications into subdisciplinary
catégories, but their model is restricted to only four categories in AI, hence it is also less prone to
errors. Our system works with 31 subcategories, divided over three social science disciplines.
Taking these elements into account, it becomes clear that the lower scores are to a large extent a
result of the difficulty of the task at hand.
A matter of concern that can be raised in this regard is to what extent classification of documents
at a level of granularity that is finer than that of disciplines is feasible. Disciplines, and especially
subdisciplines and research specialties, are in constant flux. Whereas most publications might
belong to the knowledge base of just one discipline, their contents may be of relevance to two
or more subdisciplines and research specialties. Theoretical work such as actor-network theory
in the social sciences, Par exemple, has been of relevance for many disciplines, subdisciplines,
and research specialties, not only in the social sciences. Interdisciplinary studies, in which an in-
tegration of different disciplinary knowledge sources takes place to tackle a research question,
may classify under several research specialties, subdisciplines, and disciplines. As these examples
illustrate, a multilabel approach, as applied in this paper, is needed in view of the validity of a
classification.
This framework requirement needs to be balanced with requirements in terms of the accuracy,
feasibility, and reliability of a classification scheme. As the results of our study show, the classi-
fication of social sciences publications into subdisciplines ( VODS level 3) on the basis of
abstracts and titles is a hard task for both humans and machines; classification into research
specialisms ( VODS level 4) probably is not all that meaningful any more (cf. the IIC and F1 scores
in Table 2). We argue that classification at the subdiscipline level should be further explored and
fine-tuned, as this level of granularity corresponds to actual policy needs and might be improved
by smart combinations of human input and ML. Par exemple, a recommender system might be
improved through validation by the authors of papers and machine classifications might gain
accuracy through the use of larger sets of texts describing expertise, projects, and publications
classified by humans.
6.1. Limitations
Four limitations of this paper should be highlighted. D'abord, we could not compare our results to
any benchmark. Although there have been some experiments in which supervised ML
techniques are used to classify (or study elements of ) scientific articles (see for example
Langlois et al., 2018; Matwin & Sazonova, 2012), to date no comparable applications or data
sets exist (c'est à dire., medium-sized annotated sets of social science publications classified according
to fine-grained disciplinary categories)—at least not to our knowledge. The lack of previous
work in this line of research makes it hard to benchmark our results for this specific problem
setting.
Deuxième, given that the records in our data set were extracted from EconLit, ERIC, ou
Sociological Abstracts, each record has been assigned to only one (but possibly multiple subdis-
ciplines of the same) discipline of the VODS level 2 (c'est à dire., to economics & entreprise, to pedagogy &
educational sciences, or to sociology & anthropologie). Ainsi, interdisciplinary cases are not
present in our initial training data. We cannot compare the performance of the models deployed
in this study at different levels of granularity, in particular the discipline and subdiscipline levels.
Cependant, our results do show that the subdiscipline level is, at least for articles in social sciences
Études scientifiques quantitatives
104
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
and using their abstracts and titles only, the most fine-grained level that makes sense for classifi-
cation exercises.
Troisième, we have coupled classification systems with two entirely different functions. On the
one hand, we have the indexing systems based on the thesauri. These are systems that are
designed for information retrieval purposes and have no limit to the number of indexing terms
that can be assigned to a document. In such a system, there is no purpose in trying to fit a
document into one to six subdisciplinary categories. Ainsi, we have reduced the complexity
and granularity of the thesaurus-based classification to a fixed number of disciplinary groups.
This “mismatch” between the two classification systems might lead to relatively low-scoring
results when an ML algorithm is tasked with reproducing this classification.
Fourth, as discussed in Section 3.1, the queries have been manually constructed by the first
author. The indexing terms in the thesauri were coupled to VODS discipline codes based on
the semantic definition of each field in the VODS. It can be argued that this is a highly subjective
task, as previous research has shown that disagreement between indexers when annotating
records with indexing terms is commonplace. For many categories, cependant, the indexing terms
nicely overlapped with the categories of the VODS. This gave us confidence in the construction.
As the expert validation yielded results comparable to previous exercises of this kind, we believe
this procedure to be of sufficient quality to allow for an automated (concernant)classification experiment.
On the other hand, one can also interpret the relatively low IIC scores as indicative of the inherent
ambiguity at this level of granularity.
6.2. Future Research and Practical Applications
The use of a minimum of textual data makes the approach presented in this study practical to
generalize to other data sets (par exemple., projects and project applications). Using additional biblio-
graphic metadata would presumably increase the performance of the classification algorithms.
Full-text documents would be an interesting path forward, yielding more textual data and a
better sensitivity of TF-IDF transformations. En outre, it would be interesting to study ambi-
guities of the classification resulting from the predictions made by the algorithm and study
those in detail.
With regard to the ML modules used, we acknowledge that more advanced and complex
language-processing techniques have a good track record when it comes to automatically
classifying text documents (par exemple., BERT and related models). Dunham and colleagues (2020) have
shown that SciBERT models outperform other NLP methods when applying them to classify
publications in the field of Artificial Intelligence. For our purposes, cependant, we have opted
to keep the setup relatively straightforward. The main motivation behind this study was to
investigate and compare the feasibility of using supervised ML algorithms for this particular,
challenging fine-grained classification task. We leave comparisons of other methods and
feature transformation procedures for future research.
Questions surrounding the properties of interdisciplinarity demand for a clear operationa-
lization of disciplines, which is not straightforward. This is in itself also the main reason why
many different classification schemes are used in different contexts, each pointing to insights
about different aspects—organizational, cognitive, etc.—of a discipline (Guns, Sı(cid:2)le et al.,
2018). Textual approaches might lead to other insights regarding the cognitive structure of dis-
ciplines, but these same disciplines are in constant flux (Yan, Ding et al., 2012). A fixed clas-
sification scheme will not meet future developments in science; “… human assigned subject
categories are akin to using a rearview mirror to predict where a fast-moving car is heading”
(Suominen & Toivanen, 2016, p. 2464). To this end, the team working on the VODS has
Études scientifiques quantitatives
105
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
provided a “not elsewhere classified” category for all the subfields (Vancauwenbergh &
Poelmans, 2019un). This particular category has not been studied in this article. Future deploy-
ments of the classification system in Flanders will allow researchers to identify themselves and/
or their projects with this category and assign documents to it, et, following from this, nous
could study text residing in these categories to discover emerging research problems and topics.
Once researchers employed by the Flemish universities start to label their expertise, projects,
and outputs using the VODS, supervised ML algorithms can be trained on a broader range of
disciplinary categories, allowing for a broader evaluation of the method proposed in this paper.
This approach will enable us in practice to assist with annotating unlabeled work, or it can serve to
underpin an online recommendation system for researchers, embedded in current research infor-
mation systems. The output of supervised text classifications can also be compared to other exist-
ing classification schemes. We can, Par exemple, contrast the publication level classification with
journal level classifications of the same publications to study the disciplinary or interdisciplinary
diversity of journals.
Enfin, we should highlight that measuring IIC consistency is not straightforward. While
there exists a long tradition of research that makes use of scoring systems such as IIC or F1
scores to assess the reliability and functionality of classification systems, there have been attempts
to include semantic relations between indexing terms or categories to develop more realistic
measures of indexing accuracy. Medelyan and Witten (2006) propose calculating the cosine
similarity between word vectors of vocabularies or semantic definitions of categories. This is
an interesting approach, mais, to our knowledge, there are no systematic comparisons with
other scoring systems available to date. It would be interesting to use such an approach when
assessing classification systems in which a semantic definition of categories is available. Le
classification error could for example be weighted by the cosine distance between sentence
embeddings of semantic definitions of the disciplinary categories. If a classification error is made
whereby two distant categories are mistaken for each other, then the error is greater than when
these categories are closer to each other in terms of cosine distance.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
7. CONCLUSION
In this article we present a supervised ML approach to classify social science journal articles into
multiple fine-grained disciplinary categories. By making use of GB with CC we are capable of
assigning one or more disciplinary categories to text documents (c'est à dire., abstracts and titles). To do
donc, we have compiled a new data set consisting of 113,909 records originating from three disci-
plinary databases in the social sciences (EconLit, ERIC, and Sociological Abstracts).
The novelty of this study lies in two aspects: the construction of the labeled data set, based on
discipline-specific thesauri, and the application of supervised ML algorithms to classify social
science journal articles into one or more fine-grained disciplinary categories using text. We show
in detail how we have collected the data and how we have validated the labeling based on the
subject indexing terms from the thesauri. With regard to the ML methods, we compare different
feature engineering techniques and two well-established classification algorithms. The Gradient
Boosting classifier (LightGBM) in a Classifier Chaining framework is capable of predicting approx-
imately 46% of the exact label combinations correctly, with a fraction of 0.3% of labels assigned
incorrectly. The F1 score is 0.55.
In a previous study (Eykens et al., 2019) we assessed the performance of four different ML
algorithms for the classification of sociology and anthropology journal articles extracted from
Sociological Abstracts into fine-grained disciplinary categories (level 4 of the VODS). Making
use of the same LightGBM module (Ke et al., 2017), we were able to correctly classify over
Études scientifiques quantitatives
106
Fine-grained classification of social science journal articles using textual data
80% of the publications. In this previous study, we made use of simple feature engineering (c'est à dire.,
lemmas and unigrams) and we did not assess whether multilabel classification was possible. Aside
from the work by Read (2010) to date, we are unaware of studies making use of similar methods to
achieve fine-grained disciplinary classifications. To our knowledge, no work exists that studies the
performance of supervised ML algorithms to classify social science documents on such a granular
level.
Because we have significantly scaled up our data set, this study adds more nuance to the pre-
vious experimental study (Eykens et al., 2019). We have added textual data from two additional
disciplinary databases, namely ERIC and EconLit, and we have assessed more complex feature
engineering techniques as well. Surtout, we assess whether multilabel classification is man-
ageable. The results confirm the robustness of our previous work and expand it to additional data
sources. We further demonstrate that to a certain extent the approach is indeed generalizable to
a multilabel classification task. Pour y parvenir, the quality of the data collection and data
validation is crucial. Ainsi, we encourage others to develop a thought-through data collection
and validation procedure to make sure that the complete ML experiment is reproducible, depuis
data collection and processing onwards.
To summarize, this study shows that supervised ML algorithms are capable of classifying
social science journal articles into predefined, fine-grained categories based on the limited
textual data of abstracts and titles only. Cependant, for both human experts and machines, tel
classification at the subdisciplinary level proves very hard, to the extent that the question can be
raised of whether such an attempt makes sense. Given the need for fine-grained classification in
view of assessments, evaluations, and policy, we suggest that the informetric community further
explores the possibilities for such fine-grained classification. Par exemple, can the results
obtained in this study be improved with different or more advanced NLP techniques, and by
combining human expertise with advanced ML techniques? Like others (Boyack & Klavans,
2014; Suominen & Toivanen, 2016), we do not believe it to be fruitful to consider one or other
classification system superior. We do instead insist that each approach has its merits, especially
when contrasted to others. We hope that our work will spur others to conduct similar studies that
explore the limits of the feasibility of classification through algorithms and human experts.
REMERCIEMENTS
We would like to thank the editor and anonymous reviewers for their comments as well as the three
experts, Doctor Pieter Spooren (University of Antwerp), Professor Doctor Raf Vanderstraeten
(Ghent University), and Professor Doctor Nick Deschacht (University of Antwerp and KU
Leuven) who helped validating the data used for the analysis.
CONTRIBUTIONS DES AUTEURS
Joshua Eykens: Conceptualization Methodology Investigation Formal analysis Data curation
Writing—original draft Writing—review & editing Visualization Project administration. Raf
Guns: Conceptualization Methodology Investigation Formal analysis Writing—original draft
Writing—review & editing Visualization Supervision. Tim C.E. Engels: Conceptualisation
Writing—original draft Writing—review & editing Supervision.
COMPETING INTERESTS
The authors have no conflict of interest.
Études scientifiques quantitatives
107
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
INFORMATIONS SUR LE FINANCEMENT
The computing resources and services used in this work were provided by the VSC (Flemish
Supercomputer Center), funded by the Research Foundation Flanders (FWO) and the Flemish
Government.
This investigation has been made possible by the financial support of the Flemish govern-
ment to the Centre for R&D monitoring (ECOOM). The opinions in the paper are the authors’
and not necessarily those of the government.
DATA AVAILABILITY
The Zenodo dataset (Eykens & Guns, 2020) consists of the full set of queries and the number of
results for each, the data resulting from the expert evaluation, as well as the source code.
Malheureusement, due to copyright restrictions from ERIC, EconLit and Sociological Abstracts, nous
are not able to make the retrieved records themselves openly available.
RÉFÉRENCES
Boyack, K. W., Newman, D., Duhon, R.. J., Klavans, R., Patek, M.,
Biberstine, J.. R., … Börner, K. (2011). Clustering more than two
million biomedical publications: Comparing the accuracies of
nine text-based similarity approaches. PLoS ONE, 6(3), e18029.
EST CE QUE JE: https://doi.org/10.1371/journal.pone.0018029, PMID:
21437291, PMCID: PMC3060097
Boyack, K. W., & Klavans, R.. (2014). Including cited non-source
items in a large-scale map of science: What difference does it
make? Journal of Informetrics, 8(3), 569–580. EST CE QUE JE: https://est ce que je
.org/10.1016/j.joi.2014.04.001
Callon, M., Courtial, J.-P., & Laville, F. (1991). Co-word analysis as a tool
for describing the network of interactions between basic and tech-
nological research: The case of polymer chemistry. Scientometrics,
22(1), 155–205. EST CE QUE JE: https://doi.org/10.1007/BF02019280
Callon, M., Courtial, J.-P., Tourneur, W. UN., & Bauin, S. (1983). Depuis
translations to problematic networks: An introduction to co-word
analyse. Social Science Information, 22(2), 191–235. EST CE QUE JE:
https://doi.org/10.1177/053901883022002003
Dunham, J., Melot, J., & Murdick, D. (2020). Identifying the devel-
opment and application of artificial intelligence in scientific text.
arXiv preprint. arXiv:2002.07143.
Eykens, J., & Guns, R.. (2020). Supervised classification of SSH pub-
lications. EST CE QUE JE: https://doi.org/10.5281/zenodo.3822309
Eykens, J., Guns, R., & Engels, T. C. E. (2019). Article level classification
of publications in sociology: An experimental assessment of
supervised machine learning approaches. In G. Catalano, C.
Daraio, M.. Gregori, H. F. Moed, & G. Ruocco (Éd.), 17ème
International Conference on Scientometrics & Informetrics
(ISSI2019) (vol. 1, pp. 738–743). Sapienza University of Rome,
Italy: Edizioni Efesto.
Funk, M.. E., & Reid, C. UN. (1983). Indexing consistency in MEDLINE.
Bulletin of the Medical Library Association, 71(2), 176–183.
Glänzel, W., Schubert, UN., & Czerwon, H.-J. (1999). An item-by-
item subject classification of papers published in multidisciplinary
and general journals using reference analysis. Scientometrics,
44(3), 427–439. EST CE QUE JE: https://doi.org/10.1007/BF02458488
Guns, R., Sı(cid:2)le, L., Eykens, J., Verleysen, F. T., & Engels, T. C. E.
(2018). A comparison of cognitive and organizational classifica-
tion of publications in the social sciences and humanities.
Scientometrics, 116(2), 1093–1111. EST CE QUE JE: https://est ce que je.org/10
.1007/s11192-018-2775-x
Hammarfelt, B. (2018). What is a discipline? The conceptualization
of research areas and their operationalization in bibliometric
recherche. In R. Costas, T. Franssen, & UN. Yegros-Yegros (Éd.),
Science, Technology and Innovation Indicators in Transition—
STI2018 (pp. 197–203). Leiden, The Netherlands: Centre for
Science and Technology Studies (CWTS).
Honnibal, M., & Montani, je. (2018). spaCy 2.0.11. EST CE QUE JE: https://est ce que je
.org/10.5281/zenodo.4291179
Janssens, F., Zhang, L., De Moor, B., & Glänzel, W. (2009). Hybrid
clustering for validation and improvement of subject-classification
schemes. Information Processing and Management, 45(6), 683–702.
EST CE QUE JE: https://doi.org/10.1016/j.ipm.2009.06.003
Kandimalla, B., Rohatgi, S., Wu, J., & Lee Giles, C. (2020). Large
scale subject category classification of scholarly papers with deep
attentive neural networks. arXiv preprint. arXiv:2007.13826.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., … Liu, T.-Y.
(2017). LightGBM: A highly efficient gradient boosting decision
arbre. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R.. Fergus,
S. Vishwanathan, & R.. Garnett (Éd.), Neural Information Processing
Systems 2017 (pp. 1–9). Long Beach, Californie: Curran Associates, Inc.
Kulczycki, E., Engels, T. C. E., Pölönen, J., Bruun, K., Dušková, M., …
Zuccala, UN. (2018). Publication patterns in the social sciences and
sciences humaines: Evidence from eight European countries. Scientometrics,
116, 463–486. EST CE QUE JE: https://doi.org/10.1007/s11192-018-2711-0
Langlois, UN., Nie, J.. Y., Thomas, J., Hong, Q. N., & Pluye, P.. (2018).
Discriminating between empirical studies and nonempirical
works using automated text classification. Research Synthesis
Methods, 9(4), 587–601. EST CE QUE JE: https://doi.org/10.1002
/jrsm.1317, PMID: 30103261
Leininger, K. (2000). Interindexer consistency in PsycINFO. Journal
of Librarianship and Information Science, 32(1), 4–8. EST CE QUE JE:
https://doi.org/10.1177/096100060003200102
Lewis, D. D. (1998). Naïve (Bayes) at forty: The independence as-
sumption in information retrieval. In C. Nédellec & C. Rouveirol
(Éd.), 10th European Conference on Machine Learning—ECML-98
(vol. 1398, pp. 4–15). Chemnitz: Springer. EST CE QUE JE: https://est ce que je.org/10
.1007/BFb0026666
Études scientifiques quantitatives
108
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
Loper, E., & Oiseau, S. (2002). NLTK: The Natural Language Toolkit. Dans
Proceedings of the ACL-02 Workshop on Effective tools and
methodologies for teaching natural language processing and
computational linguistics (vol. 1, pp. 63–70). Philadelphia, Pennsylvanie:
Association for Computational Linguistics. EST CE QUE JE: https://doi.org
/10.3115/1118108.1118117
Matwin, S., & Sazonova, V. (2012). Direct comparison between
support vector machine and multinomial naive Bayes algorithms
for medical abstract classification. Journal of the American
Medical Informatics Association, 19(5), 917. EST CE QUE JE: https://est ce que je
.org/10.1136/amiajnl-2012-001072, PMID: 22683917, PMCID:
PMC3422847
Medelyan, O., & Witten, je. H. (2006). Measuring inter-indexer con-
sistency using a thesaurus. In 6th ACM/IEEE-CS joint conference
on Digital libraries (pp. 296–297). Chapel Hill, Caroline du Nord: ACM. EST CE QUE JE:
https://doi.org/10.1145/1141753.1141816
Moschitti, UN., & Basili, R.. (2004). Complex linguistic features for text
classification: A comprehensive study. In S. McDonald's & J.. Tait
(Éd.), Advances in Information Retrieval. ECIR 2004 (vol. 2997,
pp. 181–196). Berlin: Springer. EST CE QUE JE: https://doi.org/10.1007
/978-3-540-24752-4_14
Nederhof, UN. J.. (2006). Bibliometric monitoring of research perfor-
mance in the social sciences and humanities: A review.
Scientometrics, 66(1), 81–100. EST CE QUE JE: https://doi.org/10.1007
/s11192-006-0007-2
Norris, M., & Oppenheim, C. (2007). Comparing alternatives to the
Web of Science for coverage of the social sciences’ literature.
Journal of Informetrics, 1(2), 161–169. EST CE QUE JE: https://est ce que je.org/10
.1016/j.joi.2006.12.001
OECD. (2007). Revised Fields of Science and Technology (FOS)
Classification in the Frascati Manual. Paris: OECD Publishing.
Ossenblok, T., Engels, T. C. E., & Sivertsen, G. (2012). The repre-
sentation of the social sciences and humanities in the Web of
Science—A comparison of publication patterns and incentive
structures in Flanders and Norway (2005–9). Research
Evaluation, 21(4), 280–290. EST CE QUE JE: https://doi.org/10.1093/reseval
/rvs019
Pedregosa, F., Varoquaux, G., Gramfort, UN., Michel, V., Thirion, B., …
Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python.
Journal of Machine Learning Research, 12(85), 2825–2830.
Read, J.. (2010). Scalable Multi-label Classification. Doctoral thesis,
University of Waikato, Hamilton, Nouvelle-Zélande. Retrieved from
https://hdl.handle.net/10289/4645
Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classifier
chains for multi-label classification. In W. Buntine, M..
Grobelnik, D. Mladenic(cid:3), & J.. Shawe-Taylor (Éd.), Machine
Learning and Knowledge Discovery in Databases. ECML PKDD
2009. Berlin: Springer. EST CE QUE JE: https://doi.org/10.1007/978-3-642
-04174-7_17
Rennie, J.. D. M., Shih, L., Teevan, J., & Karger, D. R.. (2003).
Tackling the poor assumptions of naïve Bayes text classifiers. Dans
T. Fawcett & N. Mishra (Éd.), Twentieth International Conference
on Machine Learning (ICML-2003) (pp. 616–623). Washington,
CC: AAAI Press.
Rip, UN., & Courtial, J.-P. (1984). Co-word maps of biotechnology:
An example of cognitive scientometrics. Scientometrics, 6(6),
381–400. EST CE QUE JE: https://doi.org/10.1007/BF02025827
Rollin, L. (1981). Indexing consistency, quality and efficiency.
Information Processing & Management, 17(2), 69–76. EST CE QUE JE:
https://doi.org/10.1016/0306-4573(81)90028-5
Schapire, R.. E., & Chanteur, Oui. (2000). Boost-exter: A boosting-based sys-
tem for text categorization. Machine Learning, 39(2/3), 135–168.
EST CE QUE JE: https://doi.org/10.1023/A:1007649029923
Sebastiani, F. (2002). Machine learning in automated text categori-
zation. ACM Computer Surveys, 34(1), 1–47. EST CE QUE JE: https://est ce que je
.org/10.1145/505282.505283
Sechidis, K., Tsoumakas, G., & Vlahavas, je. (2011). On the stratifi-
cation of multi-label data. In D. Gunopulos, T. Hofmann, D.
Malerba, & M.. Vazirgiannis (Éd.), Machine Learning and
Knowledge Discovery in Databases. Joint European Conference
on Machine Learning and Knowledge Discovery in Databases—
ECML PKDD 2011 (vol. 6913). Berlin: Springer. EST CE QUE JE: https://est ce que je
.org/10.1007/978-3-642-23808-6_10
Shneider, UN. M.. (2009). Four stages of a scientific discipline; four types
of scientist. Trends in Biochemical Sciences, 34(5), 217–223. EST CE QUE JE:
https://doi.org/10.1016/j.tibs.2009.02.002, PMID: 19362484
Sievert, M.. C., & Andrés, M.. J.. (1991). Indexing consistency in
Information Science Abstracts. Journal of the American Society for
Information Science, 42(1), 1–6. EST CE QUE JE: https://doi.org/10.1002
/(SICI)1097-4571(199101)42:1<1::AID-ASI1>3.0.CO;2-9
Sjögårde, P., & Ahlgren, P.. (2018). Granularity of algorithmically
constructed publication-level classifications of research publications:
Identification of topics. Journal of Informetrics, 12(1), 133–152. EST CE QUE JE:
https://doi.org/10.1016/j.joi.2017.12.006
Sjögårde, P., & Ahlgren, P.. (2019). Granularity of algorithmically
constructed publication-level classifications of research publications:
Identification of specialties. Quantitative Studies of Science, 1(1),
207–238. EST CE QUE JE: https://doi.org/10.1162/qss_a_00004
Sorower, M.. S. (2010). A literature survey on algorithms for multi-
label learning. Technical Report, Corvallis: Oregon State
University.
Stichweh, R.. (1992). The sociology of scientific disciplines: On the
genesis and stability of the disciplinary structure of modern science.
Science in Context, 5(1), 3–15. EST CE QUE JE: https://est ce que je.org/10.1017
/S0269889700001071
Stichweh, R.. (2003). Differentiation in science: Causes and conse-
quences. In G. H. Hardon (Ed.), Unity of Knowledge in Transdisci-
plinary Research for Sustainable Development (vol. 1, pp. 82–90).
Oxford: EOLSS Publishers.
Sugimoto, C. R., & Weingart, S. (2015). The kaleidoscope of disci-
plinarity. Journal of Documentation, 77(4), 775–794. EST CE QUE JE:
https://doi.org/10.1108/JD-06-2014-0082
Suominen, UN., & Toivanen, H. (2016). Map of science with topic
modeling: Comparison of unsupervised learning and human-
assigned subject classification. Journal of the Association for
Information Science and Technology, 67(10), 2464–2476. EST CE QUE JE:
https://doi.org/10.1002/asi.23596
Suominen, Ô. (2019). Annif: DIY automated subject indexing using
multiple algorithms. LIBER Quarterly, 29(1), 1–25. EST CE QUE JE: https://
doi.org/10.18352/lq.10285
Tsoumakas, G., & Katakis, je. (2007). Multi-label classification: Un
overview. International Journal for Data Warehousing and Mining,
3(3), 1–13. EST CE QUE JE: https://doi.org/10.4018/jdwm.2007070101
van den Besselaar, P., & Heimeriks, G. (2006). Mapping research
topics using word-reference co-occurrences: A method and an
exploratory case study. Scientometrics, 68(3), 377–393. EST CE QUE JE:
https://doi.org/10.1007/s11192-006-0118-9
Vancauwenbergh, S., & Poelmans, H. (2019un). The creation of the
Flemish research discipline list, an important step forward in
harmonising research information (systèmes). Procedia Computer
Science, 146, 265–278. EST CE QUE JE: https://doi.org/10.1016/j.procs
.2019.01.075
Vancauwenbergh, S., & Poelmans, H. (2019b). The Flemish re-
search discipline classification standard: A practical approach.
Knowledge Organisation, 46, 354–363. EST CE QUE JE: https://est ce que je.org/10
.5771/0943-7444-2019-5-354
Études scientifiques quantitatives
109
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Fine-grained classification of social science journal articles using textual data
Waltman, L., & Van Eck, N. J.. (2012). A New Methodology for
Constructing a Publication-Level Classification System of Science.
Journal of the American Society for Information Science and
Technologie, 63(12), 2378–2392. EST CE QUE JE: https://doiorg/10.1002/asi.22748
Yan, E., Ding, Y., Milojevic(cid:3), S., & Sugimoto, C. R.. (2012). Topics in
dynamic research communities: An exploratory study for the field
of information retrieval. Journal of Informetrics, 6(1), 140–153.
EST CE QUE JE: https://doi.org/10.1016/j.joi.2011.10.001
Yau, C.-K., Porter, UN., Newman, N., & Suominen, UN. (2014).
C lust ering scientif ic documents with t opi c modeli ng.
Scientometrics, 100, 767–786. EST CE QUE JE: https://doi.org/10.1007
/s11192-014-1321-8
Zhang, M.-L., Li, Y.-K., Liu, X.-Y., & Geng, X. (2018). Binary relevance
for multi-label learning: an overview. Frontiers of Computer
Science, 12, 191–202. EST CE QUE JE: https://doi.org/10.1007/s11704-017
-7031-7
Zhang, M.-L., & Zhou, Z.-H. (2014). A review on multi-label learning
algorithms. IEEE Transactions on Knowledge and Data
Engineering, 26(8), 1819–1837. EST CE QUE JE: https://doi.org/10.1109
/TKDE.2013.39
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
/
e
d
toi
q
s
s
/
un
r
t
je
c
e
–
p
d
je
F
/
/
/
/
/
2
1
8
9
1
9
0
6
5
5
7
q
s
s
_
un
_
0
0
1
0
6
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Études scientifiques quantitatives
110