RESEARCH ARTICLE - Specialized Research AI at MIT

RESEARCH ARTICLE

The Microsoft Academic Knowledge Graph
enhanced: Author name disambiguation,
publication classification, and embeddings

a n o p e n a c c e s s

j o u r n a l

Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

Michael Färber

and Lin Ao

Keywords: linked open data, open science, scholarly data, scientific knowledge graph

ABSTRACT

Although several large knowledge graphs have been proposed in the scholarly field, such
graphs are limited with respect to several data quality dimensions such as accuracy and
coverage. In this article, we present methods for enhancing the Microsoft Academic Knowledge
Graph (MAKG), a recently published large-scale knowledge graph containing metadata about
scientific publications and associated authors, venues, and affiliations. Based on a qualitative
analysis of the MAKG, we address three aspects. First, we adopt and evaluate unsupervised
approaches for large-scale author name disambiguation. Second, we develop and evaluate
methods for tagging publications by their discipline and by keywords, facilitating enhanced
search and recommendation of publications and associated entities. Third, we compute and
evaluate embeddings for all 239 million publications, 243 million authors, 49,000 journals,
and 16,000 conference entities in the MAKG based on several state-of-the-art embedding
techniques. Finally, we provide statistics for the updated MAKG. Our final MAKG is publicly
available at https://makg.org and can be used for the search or recommendation of scholarly
entities, as well as enhanced scientific impact quantification.

INTRODUCTION

In recent years, knowledge graphs have been proposed and made publicly available in the
scholarly field, covering information about entities such as publications, authors, and venues.
They can be used for a variety of use cases: (1) Using the semantics encoded in the knowledge
graphs and RDF as a common data format, which allows easy data integration from different
data sources, scholarly knowledge graphs can be used for providing advanced search and rec-
ommender systems (Noia, Mirizzi et al., 2012) in academia (e.g., recommending publications
(Beel, Langer et al., 2013), citations (Färber & Jatowt, 2020), and data sets (Färber & Leisinger,
2021a, 2021b)). (2) The representation of knowledge as a graph and the interlinkage of entities
of various entity types (e.g., publications, authors, institutions) allows us to propose novel ways to
scientific impact quantification (Färber, Albers, & Schüber, 2021). (3) If scholarly knowledge
graphs model the key content of publications, such as data sets, methods, claims, and research
contributions (Jaradeh, Oelen et al., 2019b), they can be used as a reference point for scientific
knowledge (e.g., claims) (Fathalla, Vahdati et al., 2017), similar to DBpedia and Wikidata in the
case of cross-domain knowledge. In light of the FAIR principles (Wilkinson, Dumontier et al.,
2016) and the overload of scientific information resulting from the increasing publishing rate
in the various fields (Johnson, Watkinson, & Mabe, 2018), one can envision that researchers’

Citation: Färber, M., & Ao, L. (2022).
The Microsoft Academic Knowledge
Graph enhanced: Author name
disambiguation, publication
classification, and embeddings.
Quantitative Science Studies, 3(1),
51–98. https://doi.org/10.1162/qss_a
_00183

DOI:
https://doi.org/10.1162/qss_a_00183

Peer Review:
https://publons.com/publon/10.1162
/qss_a_00183

Received: 16 June 2021
Accepted: 16 October 2021

Corresponding Author:
Michael Färber
michael.faerber@kit.edu

Handling Editor:
Ludo Waltman

Copyright: © 2022 Michael Färber and
Lin Ao. Published under a Creative
Commons Attribution 4.0 International
(CC BY 4.0) license.

The MIT Press

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

working styles will change considerably over the next few decades (Hoffman, Ibáñez et al., 2018;
Jaradeh, Auer et al., 2019a) and that, in addition to PDF documents, scientific knowledge might
be provided manually or semiautomatically via appropriate forms (Jaradeh et al., 2019b) or auto-
matically based on information extraction on the publications’ full-texts (Färber et al., 2021).

The Microsoft Academic Knowledge Graph (MAKG) (Färber, 2019), AMiner (Tang, Zhang
et al., 2008), OpenCitations (Peroni, Dutton et al., 2015), AceKG (Wang, Yan et al., 2018),
and Open-AIRE (OpenAIRE, 2021) are examples of large domain-specific knowledge graphs
with millions or sometimes billions of facts about publications and associated entities, such as
authors, venues, and fields of study. In addition, scholarly knowledge graphs edited by the crowd
(Jaradeh et al., 2019b) and providing scholarly key content (Färber & Lamprecht, 2022; Jaradeh
et al., 2019b) have been proposed. Finally, freely available cross-domain knowledge graphs
such as Wikidata (https://wikidata.org/) provide an increasing amount of information about
the academic world, although not as systematic as the domain-specific offshoots.

The Microsoft Academic Knowledge Graph (MAKG) (Färber, 2019) was published in its first
version in 2019 and is peculiar in the sense that (1) it is one of the largest freely available schol-
arly knowledge graphs (over 8 billion RDF triples as of September 2019), (2) it is linked to other
data sources in the Linked Open Data cloud, and (3) it provides metadata for entities that
are—particularly in combination—often missing in other scholarly knowledge graphs (e.g.,
authors, institutions, journals, fields of study, in-text citations). As of June 2020, the MAKG con-
tains metadata for more than 239 million publications from all scientific disciplines, as well as
over 1.38 billion references between publications. As outlined in Section 2.2, since 2019, the
MAKG has already been used in various scenarios, such as recommender systems (Kanakia,
Shen et al., 2019), data analytics, bibliometrics, and scientific impact quantification (Färber,
2020; Färber et al., 2021; Schindler, Zapilko, & Krüger, 2020; Tzitzikas, Pitikakis et al., 2020),
as well as knowledge graph query processing optimization (Ajileye, Motik, & Horrocks, 2021).

Despite its data richness, the MAKG suffers from data quality issues arising primarily due to
the application of automatic information extraction methods from the publications (see further
analysis in Section 2). We highlight as major issues (1) the containment of author duplicates in
the range of hundreds of thousands, (2) the inaccurate and limited tagging (i.e., assignment) of
publications with keywords given by the fields of study (Färber, 2019), and (3) the lack of
embeddings for the majority of MAKG entities, which hinders the development of machine
learning approaches based on the MAKG.

In this article, we present methods for solving these issues and apply them to the MAKG,

resulting in an enhanced MAKG.

First, we perform author name disambiguation on the MAKG’s author set. To this end, we
adopt an unsupervised approach to author name disambiguation that uses the rich publication
representations in the MAKG and that scales for hundreds of millions of authors. We use
ORCID iDs to evaluate our approach.

Second, we develop a method for tagging all publications with fields of study and with a
newly generated set of keywords based on the publications’ abstracts. While the existing
field of study labels assigned to papers are often misleading (see Wang, Shen et al. (2019) and
Section 4) and, thus, often not beneficial for search and recommender systems, the enhanced
field of study labels assigned to publications can be used, for instance, to search for and rec-
ommend publications, authors, and venues, as our evaluation results show.

Third, we create embeddings for all 239 million publications, 243 million authors, 49,000
journals, and 16,000 conference entities in the MAKG. We experimented with various state-of-

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

the-art embedding approaches. Our evaluations show that the ComplEx embedding method
(Trouillon, Welbl et al., 2016) outperforms other embeddings in all metrics. To the best of our
knowledge, RDF knowledge graph embeddings have not yet been computed for such a large
(scholarly) knowledge graph. For instance, RDF2Vec (Ristoski, Rosati et al., 2019) was trained
on 17 million Wikidata entities. Even DGL-KE (Zheng, Song et al., 2020), a recently published
package optimized for training knowledge graph embeddings at a large scale, was evaluated
on a benchmark with only 86 million entities.

Finally, we provide statistics concerning the authors, papers, and fields of study in the
newly created MAKG. For instance, we analyze the authors’ citing behaviors, the number
of authors per paper over time, and the distribution of fields of study using the disambiguated
author set and the new field of study assignments. We incorporate the results of all mentioned
tasks into a final knowledge graph, which we provide online to the public at https://makg.org
(formerly: http://ma-graph.org) and http://doi.org/10.5281/zenodo.4617285. Thanks to the
disambiguated author set, the new paper tags, and the entity embeddings, the enhanced
MAKG opens the door to improved scholarly search and recommender systems and advanced
scientific impact quantification.

Overall, our contributions are as follows:

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

▪ We present and evaluate an approach for large-scale author name disambiguation, which
can deal with the peculiarities of large knowledge graphs, such as heterogeneous entity
types and 243 million author entries.

▪ We propose and evaluate transformer-based methods for classifying publications according

to their fields of study based on the publications’ abstracts.

▪ We apply state-of-the-art entity embedding approaches to provide entity embeddings for
243 million authors, 239 million publications, 49,000 journals, and 16,000 conferences,
and evaluate them.

▪ We provide a statistical analysis of the newly created MAKG.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

Our implementation for enhancing scholarly knowledge graphs can be found online at

https://github.com/lin-ao/enhancing_the_makg.

The remainder of this article is structured as follows. In Section 2, we describe the MAKG,
along with typical application scenarios and its wide usage in the real world. We also outline
the MAKG’s limitations regarding its data quality, thereby providing our motivation for
enhancing the MAKG. Subsequently, in Sections 3, 4, and 5, we describe in detail our
approaches to author name disambiguation, paper classification, and knowledge graph
embedding computation. In Section 6, we describe the schema of the updated MAKG, infor-
mation regarding the knowledge graph provisioning and statistical key figures of the enhanced
MAKG. We provide a conclusion and give an outlook in Section 7.

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

2. OVERVIEW OF THE MICROSOFT ACADEMIC KNOWLEDGE GRAPH

2.1. Schema and Key Statistics

We can differentiate between three data sets:

1.
2.

the Microsoft Academic Graph (MAG) provided by Microsoft (Sinha, Shen et al., 2015),
the Microsoft Academic Knowledge Graph (MAKG) in its original version provided by
Färber since 2019 (Färber, 2019), and
the enhanced MAKG outlined in this article.

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

The initial MAKG (Färber, 2019) was derived from the MAG, a database consisting of tab-
separated text files (Sinha et al., 2015). The MAKG is based on the information provided by the
MAG and enriches the content by modeling the data according to linked data principles to gen-
erate a Linked Open Data source (i.e., an RDF knowledge graph with resolvable URIs, a public
SPARQL endpoint, and links to other data sources). During the creation of the MAKG, the data
originating from the MAG is not modified (except for minor tasks, such as data cleaning, linking
locations to DB-pedia, and providing sameAs-links to DOI and Wikidata). As such, the data quality
of the MAKG is mainly equivalent to the data quality of the MAG provided by Microsoft.

Table 1 shows the number of entities in the MAG as of May 29, 2020. Accordingly, the
MAKG created from the MAG would also exhibit these numbers. This MAKG impresses with
its size: It contains the metadata for 239 million publications (including 139 million abstracts),
243 million authors, and more than 1.64 billion references between publications (see also
https://makg.org/).

It is remarkable that the MAKG contains more authors than publications. The high number of
authors (243 million) appears to be too high given that there were eight million scientists in the
world in 2013 according to UNESCO (Baskaran, 2017). For more information about the increase
in the number of scientists worldwide, we can refer to Shaver (2018). In addition, the number
of affiliations in the MAKG (about 26,000) appears to be relatively low, given that all research
institutions in all fields should be represented and that there exist 20,000 officially accredited
or recognized higher education institutions ( World Higher Education Database, 2021).

Compared to a previous analysis of the MAG in 2016 (Herrmannova & Knoth, 2016),
whose statistics would be identical to the MAKG counterpart if it had existed in 2016, the
number of instances has increased for all entity types (including the number of conference
series from 1,283 to 4,468), except for the number of conference instances, which has
dropped from 50,202 to 16,142. An obvious reason for this reduction is the data cleaning pro-
cess as a part of the MAG generation at Microsoft. Although the numbers of journals, authors,
and papers have doubled in size compared to the 2016 version (Herrmannova & Knoth,
2016), the number of conference series and fields of study have nearly quadrupled.

Figure 1 shows how many publications represented in the MAKG have been published per
discipline (i.e., level-0 field of study). Medicine, materials science, and computer science

Table 1. General statistics for MAG/MAKG entities as of June 2020

Key
Papers

Papers with Link

Papers with Abstract

Authors

Affiliations

Journals

Conference Series

Conference Instances

Fields of Study

Quantitative Science Studies

# in MAG/MAKG
238,670,900

224,325,750

139,227,097

243,042,675

25,767

48,942

4,468

16,142

740,460

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Figure 1. Number of publications per discipline.

occupy the top positions. This was not always the case. According to the analysis of the MAG
in 2016 (Herrmannova & Knoth, 2016), physics, computer science, and engineering were the
disciplines with the highest numbers of publications. We assume that additional and changing
data sources of the MAG resulted in this change.

Figure 2 presents the overall number of publication citations per discipline. The descending
order of the disciplines is, to a large extent, similar to the descending order of the disciplines
considering their associated publication counts (see Figure 1). However, specific disciplines,
such as biology, exhibit a large publication citation count compared to their publication count,
while the opposite is the case for disciplines such as computer science. The paper citation
count per discipline is not provided by the 2016 MAG analysis (Herrmannova & Knoth, 2016).

Table 2 shows the frequency of instances per subclass of mag:Paper, generated by means
of a SPARQL query using the MAKG SPARQL endpoint. Listing 1 shows an example of how the
MAKG can be queried using SPARQL.

2.2. Current Usage and Application Scenarios

The MAKG RDF dumps on Zenodo have been viewed almost 6,000 times and downloaded
more than 42,000 times (as of June 15, 2021). As the RDF dumps were also available directly

Figure 2. Paper citation count per discipline (i.e., level-0 field of study).

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Table 2. Number of publications by document type

Document type
Journal

Patent

Conference

Book chapter

Book

No type given

Number
85,759,950

52,873,589

4,702,268

2,713,052

2,143,939

90,478,102

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

Listing 1. Querying the top 100 institutions in the area of machine learning according to their
overall number of citations.

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

at https://makg.org/rdf-dumps/ (formerly: http://ma-graph.org/rdf-dumps/) until January 2021,
the 21,725 visits (since April 4, 2019) to this web page are also relevant.

Figure 3, 4, and 5 were created based on the log files of the SPARQL endpoint. They show
the number of SPARQL queries per day, the number of unique users per day, and which user
agents were used to which extent. Given these figures and a further analysis of the SPARQL
endpoint log files, the following facts are observable:

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

▪ Except for in 2 months, the number of daily requests increased steadily.
▪ The number of unique user agents remained fairly constant, apart from a period between

October 2019 and January 2020.

▪ The frequency of more complex queries (based on query length) is increasing.

Within only one year of its publication in November 2019, the MAKG has been used in
diverse ways by various third parties. Below we list some of them based on citations of the
MAKG publication (Färber, 2019).

2.2.1.

Search and recommender systems and data analytics

▪ The MAKG has been used for recommender systems, such as paper recommendation

(Kanakia et al., 2019).

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

Figure 3. Number of queries.

Figure 4. Number of unique users.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 5. User agents.

▪ Scholarly data is becoming increasingly important for businesses. Due to its large number
of items (e.g., publications, researchers), the MAKG has been discussed as a data source
in enterprises (Schubert, Jäger et al., 2019).

▪ The MAKG has been used by nonprofit organizations for data analytics. For instance,
Nesta uses the MAKG in its business intelligence tools (see https://www.nesta.org.uk and
https://github.com/michaelfaerber/MAG2RDF/issues/1).

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

▪ As a unique data source for scholarly data, the MAKG has been used as one of several
publicly available knowledge graphs to build a custom domain-specific knowledge
graph that considers specific domains of interest (Qiu, 2020).

2.2.2. Bibliometrics and scientific impact quantification

▪ The Data Set Knowledge Graph (Färber & Lamprecht, 2022) provides information about
data sets as linked open data source and contains links to MAKG publications in which
the data sets are mentioned. Utilizing the publications’ metadata in the MAKG allows
researchers to employ novel methods for scientific impact quantification (e.g., working
on an “h-index” for data sets).

▪ SoftwareKG (Schindler et al., 2020) is a knowledge graph that links about 50,000 scientific
articles from the social sciences to the software mentioned in those articles. The knowl-
edge graph also contains links to other knowledge graphs, such as the MAKG. In this way,
the SoftwareKG provides the means to assess the current state of software usage.

▪ Publications modeled in the MAKG have been linked to the GitHub repositories contain-
ing the source code associated with the publications (Färber, 2020). For instance, this
facilitates the detection of trends on the implementation level and monitoring of how
the FAIR principles are followed by which people (e.g., considering who provides the
source code to the public in a reproducible way).

▪ According to Tzitzikas et al. (2020), the scholarly data of the MAKG can be used to mea-

▪

sure institutions’ research output.
In Färber et al. (2021), an approach for extracting scientific methods and data sets used
by the authors is presented. The extracted methods and data sets are linked to the pub-
lications in the MAKG enabling novel scientific impact quantification tasks (e.g., mea-
suring how often which data sets and methods have been reused by researchers) and the
recommendation of methods and data sets. Overall, linking the key content of scientific
publications as modeled in knowledge graphs or integrating such information into the
MAKG can be considered as a natural extension of the MAKG in the future.

▪ The MAKG has inspired other researchers to use it in the context of data-driven history of
science (see https://www.downes.ca/post/69870), (i.e., for science of science [Fortunato,
Bergstrom et al., 2018]).

▪ Daquino, Peroni et al. (2020) present the OpenCitations data model and evaluate the

representation of citation data in several knowledge graphs, such as the MAKG.

2.2.3. Benchmarking

▪ As a very large RDF knowledge graph, the MAKG has served as a data set for evaluating

novel approaches to streaming partitioning of RDF graphs (Ajileye et al., 2021).

2.3. Current Limitations

Based on the statistical analysis of the MAKG and the analysis of the usage scenarios of the
MAKG so far, we have identified the following shortcomings:

▪ Author name disambiguation is apparently one of the most pressing needs for enhancing

the MAKG.

▪ The assigned fields of study associated with the papers in the MAKG are not accurate

(e.g., architecture), and the field of study hierarchy is quite erroneous.

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

▪ The use cases of the MAKG show that the MAKG has not been used extensively for
machine learning tasks. So far, only entity embeddings for the MAKG as of 2019 con-
cerning the entity type paper are available, and these have not been evaluated. Thus, we
perceive a need to provide state-of-the-art embeddings for the MAKG covering many
instance types, such as papers, authors, journals, and conferences.

3. AUTHOR NAME DISAMBIGUATION

3.1. Motivation

The MAKG is a highly comprehensive data set containing more than 243 million author enti-
ties alone. As is the case with any large database, duplicate entries cannot be easily avoided
(Wang, Shen et al., 2020). When adding a new publication to the database, the maintainers
must determine whether the authors of the new paper already exist within the database or if a
new author entity is to be created. This process is highly susceptible to errors, as certain names
are common. Given a large enough sample size, it is not rare to find multiple people with
identical surnames and given names. Thus, a plain string-matching algorithm is not sufficient
for detecting duplicate authors. Table 3 showcases the 10 most frequently occurring author
names in the MAKG to further emphasize the issue, using the December 2019 version of
the MAKG for this analysis. All author names are of Asian origin. While it is true that romanized
Asian names are especially susceptible to causing duplicate entries within a database (Roark,
Wolf-Sonkin et al., 2020), the problem is not limited to any geographical or cultural origin and
is, in fact, a common problem shared by Western names as well (Sun, Zhang et al., 2017).

The goal of the author name disambiguation task is to identify the maximum number of
duplicate authors, while minimizing the number of “false positives”; that is, it aims to limit the
number of authors classified as duplicates even though they are distinct persons in the real world.

In Section 3.2, we dive into the existing literature concerning author name disambiguation
and, more generally, entity resolution. In Section 3.3, we define our problem formally. In
Section 3.4, we introduce our approach, and we present our evaluation in Section 3.5. Finally,
we conclude with a discussion of our results and lessons learned in Section 3.6.

Table 3. Most frequently occurring author names in the MAKG

Author name
Wang Wei

Zhang Wei

Li Li

Wang Jun

Li Jun

Li Wei

Wei Wang

Liu Wei

Zhang Jun

Wei Zhang

Quantitative Science Studies

Frequency
20,235

19,944

19,049

16,598

15,975

15,474

14,020

13,578

13,553

13,366

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

3.2. Related Work

3.2.1.

Entity resolution

Entity resolution is the task of identifying and removing duplicate entries in a data set that refer
to the same real-world entity. This problem persists across many domains and, ironically, is
itself affected by duplicate names: “object identification” in computer vision, “coreference res-
olution” in natural language processing, “database merging,” “merge/purge processing,”
“deduplication,” “data alignment,” or “entity matching” in the database domain, and “entity
resolution” in the machine learning domain (Maidasani, Namata et al., 2012). The entities to
be resolved are either part of the same data set or may reside in multiple data sources.

Newcombe, Kennedy et al. (1959) were the first ones to define the entity linking problem,
which was later modeled mathematically by Fellegi and Sunter (1969). They derived a set of
formulas to determine the probabilities of two entities being “matching” based on given precon-
ditions (i.e., similarities between feature pairs). Later studies refer to the probabilistic formulas
as equivalent to a naïve Bayes classifier (Quass & Starkey, 2003; Singla & Domingos, 2006).

Generally speaking, there exist two approaches to dealing with entity resolution (Wang, Li
et al., 2011). In statistics and machine learning, the task is formulated as a classification problem,
in which all pairs of entries are compared to each other and classified as matching or non-
matching by an existing classifier. In the database community, a rule-based approach is usually
used to solve the task. Rule-based approaches can often be transformed into probabilistic
classifiers, such as naïve Bayes, and require certain previous domain knowledge for its setup.

3.2.2. Author name disambiguation

Author name disambiguation is a subcategory of entity resolution and is performed on collec-
tions of authors. Table 4 provides an overview of papers specifically approaching the task of
author name disambiguation in the scholarly field in the last decade.

Ferreira, Gonçalves, and Laender (2012) surveyed existing methods for author name disam-
biguation. They categorized existing methods by their types of approach, such as author
grouping or author assignment methods, as well as their clustering features, such as citation
information, web information, or implicit evidence.

Caron and van Eck (2014) applied a strict set of rules for scoring author similarities, such as
100 points for identical email addresses. Author pairs scoring above a certain threshold are
classified as identical. Although the creation of such a rule set requires specific domain knowl-
edge, the approach is still very simplistic in nature compared to other supervised learning
approaches. In addition, it outperforms other clustering-based unsupervised approaches signif-
icantly (Tekles & Bornmann, 2019). For these reasons, we base our approach on the one pre-
sented in their paper.

3.3. Problem Formulation

Existing papers usually aim to introduce a new fundamental approach to author name disam-
biguation and do not focus on the general applicability of their approaches. As a result, these
approaches are often impractical when applied to a large data set. For example, some
clustering-based approaches require the prior knowledge of the number of clusters (Sun et al.,
2017) and other approaches require the pairwise comparison of all entities (Qian et al., 2015),
whereas some require external information gathered through web queries (Pooja et al., 2018),
which cannot be feasibly done when dealing with millions of entries, as the inherent bottleneck
of web requests greatly limits the speed of the overall processes. Therefore, instead of choosing

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Table 4.

Approaches to author name disambiguation in 2011–2021

Authors
Pooja, Mondal, and Chandra (2020)

Wang, Wang et al. (2020)

Kim, Kim, and Owen-Smith (2019)

Zhang, Xinhua, and Pan (2019)

Ma, Wang, and Zhang (2019)

Kim, Rohatgi, and Giles (2019)

Zhang, Yan, and Zheng (2019)

Zhang et al. (2019)

Xu, Li et al. (2018)

Pooja, Mondal, and Chandra (2018)

Sun et al. (2017)

Lin, Zhu et al. (2017)

Müller (2017)

Kim, Khabsa, and Giles (2016)

Momeni and Mayr (2016)

Protasiewicz and Dadas (2016)

Qian, Zheng et al. (2015)

Tran, Huynh, and Do (2014)

Caron and van Eck (2014)

Schulz, Mazloumian et al. (2014)

Kastner, Choi, and Jung (2013)

Wilson (2011)

Year
2020

2020

2019

2018

2017

2016

2015

2014

2013

2011

Approach
Graph-based combination of author similarity and topic graph

Supervised
✗

Adversarial representation learning

Matching email address, self-citation and coauthorship

with iterative clustering

Hierarchical clustering with edit distances

Graph-based approach

Deep neural network

Graph-based approach and clustering

Molecular cross clustering

Combination of single features

Rule-based clustering

Multi-level clustering

Hierarchical clustering with combination of similarity metrics

Neural network using embeddings

DBSCAN with random forest

Clustering based on coauthorship

Rule-based heuristic, linear regression, support vector

machines and AdaBoost

Support vector machines

Deep neural network

Rule-based scoring

Pairwise comparison and clustering

Random forest, support vector machines and clustering

Single layer perceptron

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

a single approach, we aim to select features from different models and combine them to fit to
our target data set containing millions of author names.

We favor the use of unsupervised learning for the reasons mentioned above: lack of training
data, lack of need for maintaining and updating of training data, and generally more favorable
time and space complexity. Thus, in our approach, we chose the hierarchical agglomerative
clustering algorithm (HAC). We formulate the problem as follows.

Given a set of n authors A = {a1, a2, a3, …, an}, where ai represents an individual entry in the
data set. Furthermore, each individual author ai consists of m features (i.e., ai = {ai1, ai2, ai3, …,
aim}). aik is the kth feature of the ith author. The goal of our approach is to eliminate duplicate
entries in the data set that describe the same real-world entity, in this case the same person. To
this end, we introduce a matching function f which determines whether two given input entities

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

are “matching” (i.e., describe the same real-world person) or “nonmatching” (i.e., describe two
distinct people). Given an input of two authors ai and aj, the function returns the following:
(cid:1)
¼ 1 if ai and aj refer to the same real world entity i:e:; are “matching”
f ai; aj

0 if ai and aj refer to different real world entities

i:e:; are “nonmatching”

ð
ð

(cid:4)

(cid:3)

The goal of our entity resolution task is therefore to reduce the given set of authors A into a subset
Ã where ∀ai, aj 2 Ã, f (ai, aj) = 0.

3.4. Approach

We follow established procedures from existing research for unsupervised author name disambigua-
tion (Caron & van Eck, 2014; Ferreira et al., 2012) and utilize a two-part approach consisting of
pairwise similarity measurement using author and paper metadata and clustering. Additionally, we
use blocking (see Section 3.4) to reduce the complexity considerably. Figure 6 shows the entire
system used for the author name disambiguation process. The system’s steps are as follows:

1. Preprocessing. We preprocess the data by aggregating all relevant information (e.g.,
concerning authors, publications, and venues) into one single file for easier access.
We then sort our data by author name for the final input.

2. Disambiguation. We apply blocking to significantly reduce the complexity of the task.
We then use hierarchical agglomerative clustering with a rule-based binary classifier as
our distance function to group authors into distinct disambiguated clusters.

3. Postprocessing. We aggregate the output clusters into our final disambiguated author set.

Below, the most important aspects of these steps are outlined in more detail.

3.4.1.

Feature selection

We use both author and publication metadata for disambiguation. We choose the features
based on their availability in the MAKG and on their previous use in similar works from
Table 4. Overall, we use the following features:

▪ Author name: This is not used explicitly for disambiguation, but rather as a feature for

blocking to reduce the complexity of the overall algorithm.

▪ Affiliation: This determines whether two authors share a common affiliation.
▪ Coauthors: This determines whether two authors share common coauthors.
▪ Titles: This calculates the most frequently used keywords in each author’s published titles

in order to determine common occurrences.

Figure 6. Author name disambiguation process.

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

▪ Years: This compares the time frame in which authors published works.
▪

Journals and conferences: These compare the journals and conferences where each
author published.

▪ References: This determines whether two authors share common referenced publications.

Although email has proven to be a highly effective distinguishing feature for author name
disambiguation (Caron & van Eck, 2014; Kim, 2018; Schulz et al., 2014), this information is
not available to us directly and therefore omitted from our setup. Coauthorship, on the other
hand, is one of the most important features for author name disambiguation (Han, Giles et al.,
2004). Affiliation could be an important feature, though we could not rely solely on it, as
researchers often change their place of work. In addition, as the affiliation information is auto-
matically extracted from the publications, it might be on varying levels (e.g., department vs.
university) and written in different ways (e.g., full name vs. abbreviation). Journals and confer-
ences could be effective features, as many researchers tend to publish in places familiar to
them. For a similar reason, references can be an effective measure as well.

3.4.2. Binary classifier

We adapt a rule-based binary classifier as seen in the work of Caron and van Eck (2014). We
choose a simple rule-based classifier because of its simplicity, interpretability, and scalability.
The unsupervised approach does not require any training data and is therefore well suited for
our situation. Furthermore, it is easily adapted and fine-tuned to achieve the best performance
based on our data set. Its lack of necessary training time, as well as fast run time, makes it ideal
when working with large-scale data sets containing millions of authors.

The binary classifier uses as input two feature vectors representing two author entities.
Given two authors ai, aj, each consisting of m features ai = {ai1, ai2, ai3, …, aim}, the similarity
sim(ai, aj) between these two authors is the sum of similarities between each of their respective
features where simk is the similarity between the kth feature of two authors.

(cid:1)
sim ai; aj

(cid:3)

(cid:1)
simk aik ; ajk

(cid:3)

k¼1

The classifier then compares the similarity sim(ai, aj) with a predetermined threshold
θmatching to determine whether two authors are “matching” or “nonmatching.” Our classifier
function takes the following shape:

(cid:3)

(cid:1)
f ai; aj

(cid:4)
¼ 1;
0;

(cid:1)
if sim ai; aj
(cid:1)
if sim ai; aj

(cid:3)
(cid:3)

≥ θmatching
< θmatching For each feature, the similarity function consists of rule-based scoring. Below, we briefly describe how similarities between each individual feature are calculated. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / / 3 1 5 1 2 0 0 8 2 8 0 q s s _ a _ 0 0 1 8 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 1. For features with one individual value, as is the case with affiliations because it does not record historical data, the classifier determines whether both entries match and assigns a fixed score saffiliation. (cid:1) simaffiliation ai; aj (cid:3) ¼ saffiliation 0 if ai;affiliation ¼ aj;affiliation else (cid:4) 2. For other features consisting of multiple values such as coauthors, the classifier deter- mines the intersection of both value sets. Here, we assign scores using a stepping func- tion (i.e., fixed scores for an intersection of one, two, three, etc.). Quantitative Science Studies 63 The Microsoft Academic Knowledge Graph enhanced The following formula represents the similarity function for calculating similarities between two authors for the feature coauthors, though the same formula holds for fea- tures journals, conferences, titles, and references with their respective values. (cid:1) simcoauthors ai; aj (cid:3) ¼ 8 >>< >>:

scoauthors1
scoauthors2
scoauthors3
0

(cid:5)
(cid:5)
(cid:5)
(cid:5)
(cid:5)
(cid:5)

if ai;coauthors ∩ aj;coauthors
if ai;coauthors ∩ aj;coauthors
if ai;coauthors ∩ aj;coauthors
else

(cid:5)
(cid:5) ¼ 1
(cid:5)
(cid:5) ¼ 2
(cid:5)
(cid:5) ≥ 3

Papers’ titles are a special case for scoring, as they must be numericalized to allow a
comparison. Ideally, we would use a form of word embeddings to measure the true
semantic similarity between two titles, but, based on the results of preliminary experi-
ments, we did not find it worth doing, as the added computation necessary would be
significant and would most likely not translate directly into huge performance
increases. We therefore adapt a plain surface form string comparison. Specifically,
we extract the top 10 most frequently used words from the tokenized and lemmatized
titles of works published by an author and calculate their intersection with the set of
another author.

3. A special case exists for the references feature. A bonus score sself-reference is applied to
the case of self-referencing, that is if two compared authors directly reference each
other in their respective works, as can be seen in the work of Caron and van Eck (2014).
4. For some features, such as journals and conferences, a large intersection between two
authors may be uncommon. We only assign a nonzero value if both items share a com-
mon value.

(cid:1)

simjournals ai; aj

(cid:3)

¼ sjournals;
0;

if ai;journals ∩ aj;journals
else

(cid:4)

(cid:5)
(cid:5)

(cid:5)
(cid:5) ≥ 1

5. Other features such as publication year also consist of multiple values, though we inter-
pret them as extremes of a time span. Based on their feature values, we construct a time
span for each author in which they were active and check for overlap in active years
when comparing two authors (similar to Qian et al. (2015)). Again, a fixed score is
assigned based on the binary decision. For example, if author A published papers in
2002, 2005, and 2009, we extrapolate the active research period for author A as
2002–2009. If another author B was active during the same time period or within 10
years of both ends of the time span (i.e., 1992–2019), we assign a score syears as the
output. We expect most author comparisons to share an overlap in research time span
and thus receive a score of greater than zero. Therefore, this feature is more aimed at
“punishing” obvious nonmatches. The scoring function takes the following shape:

if ai and aj were active within 10 years of one another
else

(cid:4)
¼ syears
0

(cid:1)
simyears ai; aj

(cid:3)

3.4.3. Blocking

Due to the high complexity of traditional clustering algorithms (e.g., O(n2)), there is a need to
implement a blocking mechanism to improve the scalability of the algorithm to accommodate
large amounts of input data. We implement sorted neighborhood (Hernández & Stolfo, 1995)
as a blocking mechanism. We sort authors based on their names as provided to us by the
MAKG and measure the similarity using the Jaro-Winkler distance (Jaro, 1989), as Winkler
(1999) provides good performances for name-matching tasks on top of being a fast heuristic
(Cohen, Ravikumar, & Fienberg, 2003).

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

The Jaro-Winkler similarity returns values between 0 and 1, where a greater value signifies a
closer match. We choose 0.95 as the threshold θblocking, based on performance on our eval-
uation data set, and we choose 0.1 as the standard value for the scaling factor p. Similar names
will be formed into blocks where we perform pairwise comparison and cluster authors that
were classified as similar by our binary classifier.

3.4.4. Clustering

The final step of our author name disambiguation approach consists of clustering the authors.
To this end, we choose the traditional hierarchical agglomerative clustering approach. We
generate all possible pairs between authors for each block and apply our binary classifier to
distinguish matching and nonmatching entities. We then aggregate the resulting disambigu-
ated blocks and receive the final collection of unique authors as output.

3.5. Evaluation

3.5.1.

Evaluation data

The MAKG contains bibliographical data on scientific publications, researchers, organiza-
tions, and their relationships. We use the version published in December 2019 for evaluation,
though our final published results were performed on an updated version (with only minor
changes) from June 2020 consisiting of 243,042,675 authors.

Table 5. Hyperparameter values for high precision setup

Hyperparameter
saffiliation

scoauthors1

scoauthors2

scoauthors3

stitles1

stitles2

stitles3

sjournals

sconferences

syears

sreferences1

sreferences2

sreferences3

sself-references

θmatching

θblocking

Quantitative Science Studies

Value
1

0.95

0.1

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

3.5.2.

Evaluation setup

For the evaluation, we use the ORCID, a persistent digital identifier for researchers, as a
ground truth, following Kim (2019). ORCID have been established as a common way to
identify researchers. Although the ORCID is still in the process of being adopted, it is
already widely used. More than 7,000 journals already collect ORCID from authors (see
https://info.orcid.org/requiring-orcid-in-publications/). Our ORCID evaluation set consists of
69,742 author entities.

Although using ORCID as a ground truth, we are aware that this data set may be charac-
terized by imbalanced metadata. First of all, ORCID became widely adopted only a few years
ago. Thus, primarily author names from publications published in recent years are considered
in our evaluation. Furthermore, we can assume that ORCID is more likely to be used by active
researchers with a comparatively higher number of publications and that the more publica-
tions’ metadata we have available for one author, the higher the probability is for a correct
author name disambiguation.

We set the parameters as given in Table 5. We refer to these as the high precision config-
uration. These values were chosen based on choices in other similar approaches (Caron & van
Eck, 2014) and adjusted through experimentations with our evaluation data as well as analysis
of the relevancy of each individual feature (see Section 3.5, Evaluation Results).

We rely on the traditional metrics of precision, recall, and accuracy for our evaluation.

3.5.3.

Evaluation results

Due to blocking, the total number of pairwise comparisons was reduced from 2,431,938,411
to 1,475. Out of them, 49 pairs were positive according to our ORCID labels (i.e., they refer to
the same real-world person); the other 1,426 were negative. Full classification results can be
found in Table 6. We have a heavily imbalanced evaluation set, with a majority of pairings
being negative. Nevertheless, we were able to correctly classify the majority of negative labels
(1,424 out of 1,426). The great number of false negative classifications is immediately notice-
able. This is due to the selection of features or lack of distinguishing features overall to classify
certain difficult pairings.

We have therefore chosen to opt for a high percentage of false negatives to minimize the
amount of false positive classifications, as those are tremendously more damaging to an author
disambiguation result.

Table 7 showcases the average scores for each feature separated into each possible category
of outcome. For example, the average score for the feature titles from all comparisons falling
under the true positive class was 0.162, and the average score for the feature years for compar-
isons from the true negative class was 2.899. Based on these results, journals and references play
a significant role in identifying duplicate author entities within the MAKG; that is, they contribute
high scores for true positives and true negatives. Every single author pair from the true positive

Table 6. Diffusion matrix of high-precision setup

Positive classification

Negative classification

Total

Positive label
37

Negative label

1,424

1,426

Total
39

1,436

1,475

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Table 7. Average disambiguation score per feature for high precision setup (TP = True Positive;
TN = True Negative; FP = False Positive; FN = False Negative)

saffiliation

scoauthors

stitles

syears

sjournals

sconferences

sself-reference

sreferences

TP
0.0

0.0

0.162

3.0

0.0

2.027

TN
0.004

0.0

2.89

0.034

2.823

0.0

0.023

FP
0.0

0.0

3.0

0.0

2.0

FN
0.083

0.0

0.25

3.0

1.75

3.0

0.0

0.167

classification cluster shared a common journal value, whereas almost none from the true neg-
ative class did. Similar observations can be made for the feature references as well.

Our current setup results in a precision of 0.949, recall of 0.755 and an accuracy of 0.991.

By varying the scores assigned by each feature level distance function, we can affect the

focus of the entire system from achieving a high level of precision to a high level of recall.

To improve our relatively poor recall value, we have experimented with different setups for
distance scores. At high performance levels, a tradeoff persists between precision and recall.
By applying changes to score assignment as seen in Table 8, we arrive at the results in Table 9.

Table 8. Updated disambiguation scores for high recall setup

High precision
1

High recall
5

saffiliation

scoauthors,1

scoauthors,2

scoauthors,3

stitles,1

stitles,2

stitles,3

syears

sjournals

sconferences

sself-references

sreferences,1

sreferences,2

sreferences,3

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Table 9. Diffusion matrix for high recall setup

Positive classification

Negative classification

Total

Positive label
45

Negative label

1,413

1,426

Total
58

1,417

1,475

We were able to increase the recall from 0.755 to 0.918. At the same time, our precision
plummeted from the original 0.949 to 0.776. As a result, the accuracy stayed at a similar level
of 0.988. The exact diffusion matrix can be found in Table 9. With our new setup, we were
able to identify the majority of all duplicates (45 out of 49), though at the cost of a significant
increase in the number of false positives (from 2 to 13). By further analyzing the exact reason-
ing behind each type of classification through analysis of individual feature scores in Table 10,
we can see that the true positive and false positive classifications result from the same feature
similarities, therefore creating a theoretical upper limit to the performance of our specific
approach and data set. We hypothesize that additional external data may be necessary to
exceed this upper limit of performance.

We must consider the heavily imbalanced nature of our classification labels when evalu-
ating the results to avoid falling into the trap of the “high accuracy paradox”: that is, the result-
ing high accuracy score of a model on highly imbalanced data sets, where negative labels
significantly outnumber positive labels. The model’s favorable ability to predict the true neg-
atives outweighs its shortcomings for identifying the few positive labels.

Ultimately, we decided to use the high-precision setup to create the final knowledge graph,
as precision is a much more meaningful metric for author name disambiguation as opposed to
recall. It is often preferable to avoid removing nonduplicate entities rather than identifying all
duplicates at the cost of false positives.

We also analyzed the average feature density per author in the MAKG and the ORCID eval-
uation data set to gain deeper insight into the validity of our results. Feature density here refers
to the average number of data entries within an individual feature, such as the number of
papers for the feature “published papers.” The results can be found in Table 11.

Table 10. Average disambiguation score per feature for the high recall setup (TP = True Positive;
TN = True Negative; FP = False Positive; FN = False Negative). As we consider the scores for
disambiguation and not the confusion matrix for the classification, values can be greater than 1.

score_affiliation

score_coauthors

score_titles

score_years

score_journals

score_conferences

score_self-reference

score_references

TP
0.111

0.0

0.133

3.0

3.911

4.0

0.0

1.667

TN
0.004

0.0

2.89

0.023

3.762

0.0

0.023

FP
1.538

0.0

3.0

3.077

4.0

0.0

0.308

FN
0.0

0.0

0.75

3.0

0.0

4.0

0.0

0.5

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Table 11. Comparison between the overall MAKG and the evaluation set

AuthorID

Rank

NormalizedName

DisplayName

LastKnownAffiliationID

PaperCount

CitationCount

CreateDate

PaperID

DOI

Coauthors

Titles

Year

Journal

Conference

References

ORCID

MAKG
1.0

1.0

1.003

0.172

1.0

2.612

1.240

11.187

2.620

1.528

0.698

0.041

20.530

0.0003

Evaluation
1.0

1.0

0.530

1.0

1.196

1.0

4.992

1.198

1.107

0.819

0.025

26.590

1.0

As we can observe, there is a variation in “feature richness” between the evaluation set and
the overall data set. However, for the most important features used for disambiguation—
namely journals, conferences, and references—the difference is not as pronounced. Therefore,
we can assume that the disambiguation results will not be strongly affected by this variation.

Performing our author name disambiguation approach on the whole MAKG containing
243,042,675 authors (MAKG version from June 2020) resulted in a reduced set of
151,355,324 authors. This is a reduction by 37.7% and shows that applying author name dis-
ambiguation is highly beneficial.

Importantly, we introduced a maximum block size of 500 in our final approach. Without it,
the number of authors grouped into the same block would theoretically be unlimited. The
introduction of a limit to block size further improves performance significantly, reducing the
runtime from over a week down to about 48 hours, using an Intel Xeon E5-2660 v4 processor
and 128 GB of RAM. We have therefore opted to keep the limit, as the tradeoff in performance
decrease is manageable and as we aimed to provide an approach for real application rather
than a proof of concept. However, the limit can be easily removed or adjusted.

3.6. Discussion

Due to the high number of authors with identical names within the MAG and, thus, the MAKG,
our blocking algorithm sometimes still generates large blocks with more than 20,000 authors.

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

The number of pairwise classifications necessary equates to the number of combinations,

namely

, leading to high computational complexity for larger block sizes. One way of

(cid:6) (cid:7)
n
2

dealing with this issue would be to manually limit the maximum number of entities within one
block, as we have done. Doing so will split potential duplicate entities into distinct blocks,
meaning they will never be subject to comparison by the binary classifier, although the entire
process may be sped up significantly depending on the exact size limit selected. To highlight the
challenge, Table 12 showcases the author names with the largest block sizes created by our
blocking algorithm (i.e., author names generating the most complexity). The difference in
total comparisons for the name block of “Wang Wei” would be 204,717,495 comparisons

(total comparisons for 20,235 authors with no block size limit:

= 204,717,495)

(cid:6)

(cid:7)

20; 235
2

with no block size limit, compared to 5,017,495 comparisons (total comparisons for 20,235

(cid:6)

(cid:7)

(cid:6)

(cid:7)

authors with a block size limit of 500: 40 ×

500
2

235
2

= 5,017,495) for a block limit

of 500 authors. We have found the difference in performance to be negligible compared to
the total amount of duplicate authors found, as it differs by less than 2 million authors compared
to the almost 100 million duplicate authors found.

Our approach can be further optimized through hand-crafted rules for dealing with certain
author names. Names of certain origins, such as Chinese or Korean names, possess certain
nuances. While the alphabetized Romanized forms of two Chinese names may be similar
or identical, the original language text often shows a distinct difference. Furthermore, under-
standing the composition of surnames and given names in this case may also help further
reduce the complexity. As an example, the names “Zhang Lei” and “Zhang Wei” only differ
by a single character in their Romanized forms and would be classified as potential dupli-
cates or typos due to their similarity, even though for native Chinese speakers such names
signify two distinctly separate names, especially when written in the original Chinese charac-
ter form. Chinese research publications have risen in number in recent years (Johnson et al.,
2018). Given their susceptibility to creating duplicate entries as well as their significant

Table 12.

Largest author name blocks during disambiguation

Author name
Wang Wei

Zhang Wei

Li Li

Wang Jun

Li Jun

Li Wei

Wei Wang

Liu Wei

Zhang Jun

Wei Zhang

Quantitative Science Studies

Block size
20,235

19,944

19,049

16,598

15,975

15,474

14,020

13,580

13,553

13,366

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

presence in the MAKG already, future researchers might be well suited to isolate this problem
as a focal point.

Additionally, there is the possibility to apply multiple classifiers and combine their results in
a hybrid approach. If we were able to generate training data of sufficient volume and quality,
we would be able to apply certain supervised learning approaches, such as neural networks or
support vector machines using our generate feature vectors as input.

4. FIELD OF STUDY CLASSIFICATION

4.1. Motivation

Publications modeled in the MAKG are assigned to specific fields of study. Additionally, the
fields of study are organized in a hierarchy. In the MAKG as of June 2020, 709,940 fields of
study are organized in a multilevel hierarchical system (see Table 13). Both the field of study
paper assignments and the field of study hierarchy in the MAKG originate from the MAG data
provided by Microsoft Research. The entire classification scheme is highly comprehensive and
covers a huge variety of research areas, but the labeling of papers contains many shortcom-
ings. Thus, the second task in this article for improving the MAKG is the revision of field of
study assignment of individual papers.

Many of the higher-level fields of study in the hierarchical system are highly specific, and
therefore lead to many misclassifications purely based on certain matching keywords in the
paper’s textual information. For instance, papers on the topic of machine learning architecture
are sometimes classified as “Architecture.” Because the MAG does not contain any full texts of
papers, but is limited to the titles and abstracts only, we do not believe that the information
provided in the MAG is comprehensive enough for effective classification on such a sophis-
ticated level.

On top of that, an organized structure is highly rigid and difficult to change. When intro-
ducing a previously unincorporated field of study, we have to not only modify the entire clas-
sification scheme, but ideally also relabel all papers in case some fall under the new label.

We believe the underlying problem to be the complexity of the entire classification scheme.
We aim to create a simpler structure that is extendable. Our idea is not aimed at replacing the
existing structure and field of study labels, but rather enhancing and extending the current system.
Instead of limiting each paper to being part of a comprehensive structured system, we (1) merely
assign a single field of study label at the top level (also called “discipline” in the following, level 0
in the MAKG), such as computer science, physics, or mathematics. We then (2) assign to each

Table 13. Overview of MAG field of study hierarchy

Level
0

# of fields of study

292

138,192

208,368

135,913

167,676

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

publication a list of keywords (i.e., tags), which are used to describe the publication in further
detail. Our system is therefore essentially descriptive in nature rather than restrictive.

Compared to the classification scheme of the original MAKG and the MAG so far, our pro-
posed system is more fluid and extendable as its labels or tags are not constrained to a rigid
hierarchy. New concepts are freely introduced without affecting existing labels.

Our idea therefore is to classify papers on a basic level, then extract keywords in the form of
tags for each paper. These can be used to describe the content of a specific work, while leav-
ing the structuring of concepts to domain experts in each field. We classify papers into their
respective fields of study using a transformer-based classifier and generate tags for papers using
keyword extraction from the publications’ abstracts.

In Section 4.2, we introduce related work concerning text classification and tagging. We
describe our approach in Section 4.3. In Section 4.4, we present our evaluation of existing
field of study labels, the MAKG field of study hierarchy, and the newly created field of study
labels. Finally, we discuss our findings in Section 4.5.

4.2. Related Work

4.2.1. Text classification

The tagging of papers based on their abstracts can be regarded as a text classification task. Text
classification aims to categorize given texts into distinct subgroups according to predefined
characteristics. As with any classification task, text classification can be separated into binary,
multilabel, and multiclass classification.

Kowsari, Meimandi et al. (2019) provide a recent survey of text classification approaches.
Traditional approaches include techniques such as the Rocchio algorithm (Rocchio, 1971),
boosting (Schapire, 1990), bagging (Breiman, 1996), and logistic regression (Cox & Snell,
1989), as well as naïve Bayes. Clustering-based approaches include k-nearest neighbor and
support vector machines (Vapnik & Chervonenkis, 1964). More recent approaches mostly uti-
lize deep learning. Recurrent neural networks (Rumelhart, Hinton, & Williams, 1986) and long
short-term memory networks (LSTMs) (Hochreiter & Schmidhuber, 1997) had been the pre-
dominant approaches for representing language and solving language-related tasks until the
rise of transformer-based models.

Transformer-based models can be generally separated into autoregressive and autoencod-
ing models. Autoregressive models such as Transformer-XL (Dai, Yang et al., 2019) learn rep-
resentations for individual word tokens sequentially, whereas autoencoding models such as
BERT (Devlin, Chang et al., 2019) are able to learn representations in parallel using the entirety
of the document, even words found after the word token. Newer autoregressive models such
as XLNet (Yang, Dai et al., 2019) combine features from both categories and are able to
achieve state-of-the-art performance. Additionally, other variants of the BERT model exist, such
as ALBERT (Lan, Chen et al., 2020) and RoBERTa (Liu, Ott et al., 2019). Furthermore, special-
ized BERT variants have been created. One such variant is SciBERT (Beltagy, Lo, & Cohan,
2019), which specializes in academic texts.

4.2.2. Tagging

Tagging—based on extracting the tags from a text—can be considered synonymous with key-
word extraction. To extract keywords from publications’ full texts, several approaches and
challenges have been proposed (Alzaidy, Caragea, & Giles, 2019; Florescu & Caragea,
2017; Kim, Medelyan et al., 2013), exploiting publications’ structures, such as citation

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

networks (Caragea, Bulgarov et al., 2014). In our scenario, we use publications’ abstracts, as
the full texts are not available in the MAKG. Furthermore, we focus on keyphrase extraction
methods requiring no additional background information and not designed for specific tasks,
such as text summarization.

TextRank (Mihalcea & Tarau, 2004) is a graph-based ranking model for text processing. It
performs well for tasks such as keyword extraction as it does not rely on local context to deter-
mine the importance of a word, but rather uses the entire context through a graph. For every
input text, the algorithm splits the input into fundamental units (words or phrases depending on
the task) and structures them into a graph. Afterwards, an algorithm similar to PageRank deter-
mines the relevance of each word or phrase to extract the most important ones.

Another popular algorithm for keyword extraction is RAKE, which stands for rapid auto-
matic keyword extraction (Rose, Engel et al., 2010). In RAKE, the text is split by a previously
defined list of keywords. Thus, a less comprehensive list would lead to longer phrases. In con-
trast, TextRank splits the text into individual words first and combines words which benefit
from each other’s context at a later stage in the algorithm. Overall, RAKE is more suitable
for text summarization tasks due to its longer extracted key phrases, whereas TextRank is suit-
able for extracting shorter keywords used for tagging, in line with our task. In their original
publication, the authors of TextRank applied their algorithm for keyword extraction from pub-
lications’ abstracts. Due to all these reasons, we use TextRank for publication tagging.

4.3. Approach

Our approach is to fine-tune a state-of-the-art transformer model for the task of text classifica-
tion. We use the given publications’ abstracts as input to classify each paper into one of 19
top-level field of study labels (i.e., level 0) predefined by the MAG (see Table 11). After that, we
apply TextRank to extract keyphrases and assign them to papers.

4.4. Evaluation

4.4.1.

Evaluation data

For the evaluation, we produce three labeled data sets in an automatic fashion. Two of the data
sets are used to evaluate the current field of study labels in the MAKG (and MAG) and the
given MAKG field of study hierarchy, while the last data set acts as our source for training
and evaluating our approach for the field of study classification.

In the following, we describe our approaches for generating our three data sets.

1. For our first data set, we select field of study labels directly from the MAKG. As men-
tioned previously, the MAKG’s fields of study are provided in a hierarchical structure
(i.e., fields of study, such as research topics) can have several fields of study below
them. We filter the field of study labels associated with papers for level-0 labels only;
that is, we consider only the 19 top-level labels and their assignments to papers.
Table 14 lists all 19 level-0 fields of study in the MAKG; these, associated with the
papers, are also our 19 target labels for our classifier. This data set will be representative
of the field of study assignment quality of the MAKG overall as we compare its field of
study labels with our ground truth (see Section 4.4).

2. For our second data set, we extrapolate field of study labels from the MAKG/MAG using
the field of study hierarchy—that is, we relabel the papers using their associated
top-level fields of study on level 0. For example, if a paper is currently labeled as
“neural network,” we identify its associated level-0 field of study (the top-level field

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Table 14.

List of level-0 fields of study from the MAG

MAG ID
41008148

86803240

17744445

192562407

205649164

185592680

162324750

33923547

127313418

127413603

121332964

144024400

144133560

71924100

15744967

142362112

95457728

138885662

39432304

Field of study
Computer Science

Biology

Political Science

Materials Science

Geography

Chemistry

Economics

Mathematics

Geology

Engineering

Physics

Sociology

Business

Medicine

Psychology

Art

History

Philosophy

Environmental Science

of study in the MAKG). In this case, the paper would be assigned the field of study of
“computer science.”
We prepare our data set by first replacing all field of study labels using their respective
top-level fields of study. Each field of study assignment in the MAKG has a correspond-
ing confidence score. We thus sort all labels by their corresponding level-0 fields of
study and calculate the final field of study of a given paper by summarizing their indi-
vidual scores. For example, consider a paper that originally has the field of study labels
“neural network” with a confidence score of 0.6, “convolutional neural network” with a
confidence score of 0.5, and “graph theory” with a confidence score of 0.8. The labels
“neural network” and “convolutional neural network” are mapped back to the top-level
field of study of “computer science,” whereas “graph theory” is mapped back to “math-
ematics.” To calculate the final score for each discipline, we totaled the weights of every
occurrence of a given label. In our example, “computer science” would have a score of
0.5 + 0.6 = 1.1, and “mathematics” a score of 0.8, resulting in the paper being labeled
as “computer science.”
This approach can be interpreted as an addition of weights on the direct labels we gen-
erated for our previous approach. By analyzing the differences in results from these two

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

data sets, we aim to gather some insights into the validity of the hierarchical structure of
the fields of study found in the MAG.

3. Our third data set is created by utilizing the papers’ journal information. We first select a
specific set of journals from the MAKG for which the journal papers’ fields of study can
easily be identified. This is achieved through simple string matching between the names
of top-level fields of study and the names of journals. For instance, if the phrase “com-
puter science” occurs in the name of a journal, we assume it publishes papers in the
field of computer science.
We expect the data generated by this approach to be highly accurate, as the journal is
an identifying factor of the field of study. We cannot rely on this approach to match all
papers from the MAKG, as a majority of papers were published in journals whose main
disciplines could not be discerned directly from their names. Also, there exist a portion
of papers that do not have any associated journal entries in the MAKG.
We are able to label 2,553 journals in this fashion. We then label all 2,863,258 papers
from these given journals using their journal-level field of study labels. We use the
resulting data set to evaluate the fields of study in the MAKG as well as to generate
training data for the classifier.
In the latter case, we randomly selected 20,000 abstracts per field of study label, resulting
in 333,455 training samples (i.e., paper–field-of-study assignment pairs). The mismatch
compared to the theoretical training data size of 380,000 comes from the fact that
some labels had fewer than 20,000 papers available to select from.

Our data for evaluating the classifier comes from our third approach, namely the field of
study assignment based on journal names. We randomly drew 2,000 samples for each label
from the labeled set to form our test data set. Note that the test set does not overlap in any way
with the training data set generated through the same approach, as both consist of distinctly
separate samples (covering all scientific disciplines). In total, the evaluation set consists of
38,000 samples spread over the 19 disciplines.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

4.4.2.

Evaluation setup

All our implementations use the Python module Simple Transformers (https://github.com
/ ThilinaRajapakse/simpletransformers; based on Transformers, https://github.com
/huggingface/transformers), which provides a ready-made implementation of transformer-
based models for the task of multiclass classification. We set the number of output classes
to 19, corresponding to the number of top-level fields of study we are trying to label. As
mentioned in Section 4.4.1, we prepare our evaluation data set based on labels generated
via journal names. We also prepare out training set from the same data set.

We choose the following model variants for each architecture:

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

scibert_scivocab_uncased for SciBERT,

1. bert-large-uncased for BERT,
2.
3. albert-base-v2 for ALBERT,
4.
5. xlnet-large-cased for XLNet.

roberta-large for RoBERTa, and

All transformer models were trained on the bwUnicluster using GPU nodes containing four

Nvidia Tesla V100 GPUs and an Intel Xeon Gold 6230 processor.

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

4.4.3.

Evaluation metrics

We evaluate our model performances using two specific metrics: the micro-F1 score and Math-
ews correlation coefficient.

The micro-F1 as an extension to the F1 score is calculated as follows:

micro‐F1 ¼

true positives
P

true positives

false positives

The micro-F1 score is herein identical to microprecision, microrecall, and accuracy; though it
does not take the distribution of classes into consideration, that aspect is irrelevant for our
case, as all our target labels have an equal number of samples and are therefore identically
weighted.

The Matthews correlation coefficient (MCC), also known as the phi coefficient, is another
standard metric used for multiclass classifications. It is often preferred for binary classification
or multiclass classification with unevenly distributed class sizes. The MCC only achieves high
values if all four classes of the diffusion matrix are classified accurately, and is therefore
preferred for evaluating unbalanced data sets (Chicco & Jurman, 2020). Even though our
evaluation set is balanced, we nevertheless provide MCC as an alternative metric. The
MCC is calculated as follows:

MCC ¼

TP (cid:2) TN − FP (cid:2) FN
p
ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
ð
Þ
Þ (cid:2) TN þ FN
ð
TP þ FP

Þ (cid:2) TP þ FN

Þ (cid:2) TN þ FP

with TP = true positives, FP = false positives, TN = true negatives, and FN = false negatives.

4.4.4.

Evaluation results

Evaluation of existing field of study labels

In the following, we outline our evaluation
4.4.4.1.
concerning the validity of the existing MAG field of study labels. We take our two labeled sets
generated by our direct labeling (first data set; 2,863,258 papers) as well as labeling through
journal names (third data set) and compare the associated labels on level 0.

As we can see from the results in Table 15, the quality of top-level labels in the MAG can be
improved. Out of the 2,863,258 papers, 1,595,579 matching labels were found, correspond-
ing to a 55.73% match, meaning 55.73% of fields of study were labeled correctly according to
our ground truth. Table 15 also showcases an in-depth view of the quality of labels for each
discipline. We show the total number of papers for each field of study and the number of
papers that are correctly classified according to our ground truth, followed by the percentage.

Evaluation of MAKG field of study hierarchy To determine the validity of the existing
4.4.4.2.
field of study hierarchy, we compare the indirectly labeled data set (second data set) with
our ground truth based on journal names (third data set). The indirectly labeled data set is
labeled using inferred information based on the overall MAKG field of study hierarchy (see
Section 4.4.1). Here, we want to examine the effect the hierarchical structure would have
on the truthfulness of field of study labels. The results can be found in Table 16.

Our result based on this approach is very similar to the previous evaluation. Out of the
2,863,258 papers, we found 1,514,840 labels matching those based on journal names, result-
ing in a 52.91% match (compared to 55.73% in the previous evaluation). Including the MAKG
field of study hierarchy did not improve the quality of labels. For many disciplines, the number
of mislabelings increased significantly, further devaluing the quality of existing MAG labels.

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Table 15.

Evaluation results of existing field of study labels

Label
Computer Science

Biology

Political Science

Materials Science

Geography

Chemistry

Economics

Mathematics

Geology

Engineering

Physics

Sociology

Business

Medicine

Psychology

Art

History

Philosophy

Environm. Science

# labels

21,157

212,356

12,043

23,561

4,286

339,501

91,411

109,797

22600

731,505

694,631

10,725

141,498

311,197

36,080

23,728

39,938

19,517

17,727

# matching
15,056

132,203

4,083

18,475

575

285,569

62,482

92,519

18,377

187,807

500,723

9,245

33,641

186,194

31,834

4,336

5,161

6,363

936

Total

2,863,258

1,595,579

% matching
71.163

62.255

33.904

78.413

13.416

84.114

68.353

84.264

81.314

25.674

72.085

86.200

23.775

59.832

88.232

18.274

12.923

32.602

5.280

55.726

Evaluation of classification In the following, we evaluate the newly created field of

4.4.4.3.
study labels for papers determined by our transformer-based classifiers.

We first analyze the effect of training size on the overall results. Although we observe a
steady increase in performance with each increase in size of our training set, the marginal
increment deteriorates after a certain value. Therefore, with training time in mind, we decided
to limit the training input size to 20,000 samples per label, leading a theoretical training data
size of 390,000 samples. The number is slightly smaller in reality, however, due to certain
labels having fewer than 20,000 training samples in total.

We then compared the performances of various transformer-based models for our task.
Table 17 shows performances of our models trained on the same training set after one epoch.
As we can see, SciBERT and BERTbase outperform other models significantly, with SciBERT
slightly edging ahead in comparison. Surprisingly, the larger BERT variant performs signifi-
cantly worse than its smaller counterpart.

We then compare the effect of training epochs on performance. We limit our comparison to
the SciBERT model in this case. We choose SciBERT as it achieves the best performance after

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Table 16.

Evaluation results of the field of study hierarchy

Label
Computer Science

Biology

Political Science

Materials Science

Geography

Chemistry

Economics

Mathematics

Geology

Engineering

Physics

Sociology

Business

Medicine

Psychology

Art

History

Philosophy

Environm. Science

# labels

21,157

212,356

12,043

23,561

4,286

339,501

91,411

109,797

22,600

731,505

694,631

10,725

141,498

311,197

36,080

23,728

39,938

19,517

17,727

# matching
13,055

145,671

8,035

13,618

285

239,576

62,025

79,959

15,777

207,063

464,083

4,418

26,095

192,397

25,548

4,901

3,391

8,641

302

Total

2,863,258

1,514,840

% matching
61.705

68.598

66.719

57.799

6.650

70.567

67.853

72.824

69.810

28.306

66.810

41.193

18.442

61.825

70.809

20.655

8.491

44.274

1.704

52.906

one epoch of training. We fine-tune the same SciBERT model using an identical training set
(20,000 samples per label) as well as the same evaluation set. We observe a peak in perfor-
mance after two epochs (see Table 18). Although performance for certain individual labels
keeps improving steadily afterward, the overall performance starts to deteriorate. Therefore,

Table 17.

Result comparison of various transformer-based classifiers

Model
BERTbase

BERTlarge

SciBERT

Albert

RoBERTa

XLNet

MCC
0.7452

0.6853

0.7552

0.7037

0.7170

0.6755

F1-score
0.7584

0.7014

0.7678

0.7188

0.7316

0.6920

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Table 18.

Comparison between various number of training epochs

# of epoch
1

MCC
0.7552

0.7708

0.7665

0.7615

0.7558

F1-score
0.7678

0.7826

0.7787

0.7739

0.7685

training was stopped after two epochs for our final classifier. Note that we have performed
similar analysis with some other models in a limited fashion as well. The best performance
was generally achieved after two or three epochs, depending on the model.

Table 19 showcases the performance per label for our SciBERT model after two training
epochs on the evaluation set. On average, the classifier achieves an macro average F1-score

Table 19. Detailed evaluation results per label

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Label
Computer Science

Precision
0.77

Recall
0.83

Biology

Political Science

Materials Science

Geography

Chemistry

Economics

Mathematics

Geology

Engineering

Physics

Sociology

Business

Medicine

Psychology

Art

History

Philosophy

Environm. Science

Macro average

0.83

0.78

0.96

0.79

0.66

0.79

0.90

0.58

0.84

0.81

0.65

0.84

0.85

0.68

0.70

0.81

0.79

0.78

0.84

0.81

0.83

0.67

0.80

0.68

0.81

0.94

0.49

0.81

0.70

0.69

0.84

0.89

0.76

0.75

0.81

0.86

0.78

F1
0.80

0.84

0.82

0.80

0.79

0.80

0.67

0.80

0.92

0.53

0.83

0.75

0.67

0.84

0.87

0.72

0.81

0.82

0.78

# samples
2,000

2,000

38,000

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

of 0.78. In the detailed results for each label, we highlighted labels that achieved scores one
standard deviation above and below the average.

Classification performances for the majority of labels are similar to the overall average,

though some outliers can be found.

Overall, the setup is especially adept at classifying papers from the fields of geology (0.94),
psychology (0.87), medicine (0.84), and biology (0.84); whereas it performs the worst for engi-
neering (0.53), economics (0.67), and business (0.67). The values in parentheses are the
respective F1-scores achieved during classification.

We suspect the performance differences to be a result of the breadth of vocabularies used in
each discipline. Disciplines for which the classifier performs well usually use highly specific
and technical vocabularies. Engineering especially follows this assumption, as engineering is
an agglomeration of a multitude of disciplines, such as physics, chemistry, biology, and would
encompass their respective vocabularies as well.

4.4.5. Keyword extraction

As outlined in Section 4.3, we apply TextRank to extract keywords from text and assign them
to publications. We use “pytextrank” (https://github.com/DerwenAI/pytextrank/), a Python
implementation of the TextRank algorithm, as our keyword extractor. Due to the generally
smaller text size of an abstract, we limit the number of keywords/key phrases to five. A greater
number of keywords would inevitably introduce additional “filler phrases,” which are not con-
ducive for representing the content of a given abstract. Further statistics about the keywords
are given in Section 6.

4.5. Discussion

In the following, we discuss certain challenges faced, lessons learned, and future outlooks.

Our classification approach relied on the existing top-level fields of study (level-0) found in
the MAKG. Instead, we could have established an entirely new selection of disciplines as our
label set. It is also possible to adapt an established classification scheme, such as the ACM
Computing Classification System (https://dl.acm.org/ccs) or the Computer Science Ontology
(Salatino, Thanapalasingam et al., 2018). However, to the best of our knowledge, there is
not an equivalent classification scheme covering the entirety of research topics found in the
MAKG, which was a major factor leading us to adapt the field of study system.

Regarding keyword extraction, grouping of extracted keywords and key phrases and building
a taxonomy or ontology are natural continuations of the work. We suggest categories to be
constructed on an individual discipline level, rather than having a fixed category scheme
for all possible fields of study. For instance, within the discipline of computer science, we
could try to categorize tasks, data sets, approaches and so forth from the list of extracted key-
words. Brack, D’Souza et al. (2020) and Färber et al. (2021) recently published such an entity
recognition approach. Both have also adapted the SciBERT architecture to extract scientific
concepts from paper abstracts.

Future researchers can expand our extracted tags by enriching them with additional rela-
tionships to recreate a similar structure to the current MAKG field of study hierarchy.
Approaches such as the Scientific Information Extractor (Luan, He et al., 2018) could be
applied to categorize or to establish relationships between keywords, building an ontology
or rich knowledge graph.

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

5. KNOWLEDGE GRAPH EMBEDDINGS

5.1. Motivation

Embeddings provide an implicit knowledge representation for otherwise symbolic information.
They are often used to represent concepts in a fixed low-dimensional space. Traditionally,
embeddings are used in the field of natural language processing to represent vocabularies,
allowing computer models to capture the context of words and, thus, the contextual meaning.

Knowledge graph embeddings follow a similar principle, in which the vocabulary consists
of entities and relation types. The final embedding encompasses the relationships between
specific entities but also generalizes relations for entities of similar types. The embeddings
retain the structure and relationships of information from the original knowledge graph and
facilitate a series of tasks, such as knowledge graph completion, relation extraction, entity clas-
sification, question answering, and entity resolution (Wang, Mao et al., 2017).

Färber (2019) published pretrained embeddings for MAKG publications using RDF2Vec
(Ristoski, 2017) as an “add-on” to the MAKG. Here, we provide an updated version of embed-
dings for a newer version of the MAG data set and for a variety of entity types instead of papers
alone. We experiment with various types of embeddings and provide evaluation results for
each approach. Finally, we provide embeddings for millions of papers and thousands of jour-
nals and conferences, as well as millions of disambiguated authors.

In the following, we introduce related work in Section 5.2. Section 5.3 describes our approach
to knowledge graph embedding computation, followed by our evaluation in Section 5.4. We
conclude in Section 5.5.

5.2. Related Work

Generally, knowledge graphs are described using triplets in the form of (h, r, t), referring to the
head entity h 2 , the relationship between both entities r 2 ℝ, and the tail entity t 2 .
Nguyen (2017) and Wang et al. (2017) provide overviews of existing approaches for creating
knowledge graph embeddings, as well as differences in complexity and performance.

Within the existing literature, there have been numerous approaches to train embeddings
for knowledge graphs. Generally speaking, the main difference between the approaches lies in
the scoring function used to calculate the similarity or distance between two triplets. Overall,
two major families of algorithms exist: ones using translational distance models and ones using
semantic matching models.

Translational distance models use distance function scores to determine the plausibility of
specific sets of triplets existing within a given knowledge graph context (Wang et al., 2017).
More specifically, the head entity of a triplet is projected as a point in a fixed dimensional
space; the relationship entity is herein, for example, a directional vector originating from
the head entity. The distance between the end point of the relationship entity and the tail entity
in this given fixed dimensional space describes the accuracy or quality of the embeddings.
One such example is the TransE (Bordes, Usunier et al., 2013) algorithm. The standard TransE
model does not perform well on knowledge graphs with one-to-many, many-to-one, or many-
to-many relationships (Wang, Zhang et al., 2014) because the tail entities’ embeddings are
heavily influenced by the relations. Two tail entities that share the same head entity as well
as relation are therefore similar in the embedding space created by TransE, even if they may be
different concepts entirely in the real world. As an effort to overcome the deficits of TransE,
TransH (Wang et al., 2014) was introduced to distinguish the subtleties of tail entities sharing a

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

common head entity as well as relation. Later on, TransR was introduced to further model
relations as separate vectors rather than hyperplanes, as is the case with TransH. The efficiency
was later improved with the TransD model (Ji, He et al., 2015).

Semantic matching models compare similarity scores to determine the plausibility of a
given triplet. Here, relations are not modeled as vectors similar to entities, but rather as matri-
ces describing interactions between entities. Such approaches include RESCAL (Nickel, Tresp,
& Kriegel, 2011), DistMult (Yang, Yih et al., 2015), HolE (Nickel, Rosasco, & Poggio, 2016),
ComplEx (Trouillon et al., 2016), and others.

More recent approaches use neural network architectures to represent relation embeddings.
ConvE, for instance, represents head entity and relations as input and tail entity as output of a
convolutional neural network (Dettmers, Minervini et al., 2018). ParamE extends the approach
by representing relations as parameters of a neural network used to “translate” the input of
head entity into the corresponding output of tail entity (Che, Zhang et al., 2020).

In addition, there are newer variations of knowledge graph embeddings, for example using
textual information (Lu, Cong, & Huang, 2020) and literals (Gesese, Biswas et al., 2019;
Kristiadi, Khan et al., 2019). Overall, we decided to use established methods to generate
our embeddings for stability in results, performance during training, and compatibility with
file formats and graph structure.

5.3. Approach

We experiment with various embedding types and compare their performances on our data
set. We include both translational distance models and semantic matching models of the fol-
lowing types: TransE (Bordes et al., 2013), TransR (Lin, Liu et al., 2015), DistMult (Yang et al.,
2015), ComplEx (Trouillon et al., 2016), and RESCAL (Nickel et al., 2016) (see Section 5.2 for
an overview of how these approaches differ from each other). The reasoning behind the
choices is as follows: The embedding types need to be state-of-the-art and widespread, therein
acting as the basis of comparison. In addition, there needs to be an efficient implementation to
train each embedding type, as runtime is a limiting factor. For example, the paper embeddings
by Färber (2019) were trained using RDF2Vec (Ristoski, 2017) and took 2 weeks to complete.
RDF2Vec did not scale well enough using all authors and other entities in the MAKG. Also
current implementations of RDF2Vec, such as pyRDF2Vec, are not designed for such a large
scale: “Loading large RDF files into memory will cause memory issues as the code is not
optimized for larger files” (https://github.com/IBCNServices/pyRDF2Vec). This turned out to
be true when running RDF2Vec on the MAKG. For the difference between RDF2Vec and other
algorithms, such as TransE, we can refer to Portisch, Heist, and Paulheim (2021).

5.4. Evaluation

5.4.1.

Evaluation data

Our aim is to generate knowledge graph embeddings for the entities of type papers, journals,
conferences, and authors to solve machine learning-based tasks, such as search and recom-
mendation tasks. The RDF representations can be downloaded from the MAKG website
(https://makg.org/).

We first select the required data files containing the entities of our chosen entity types and
combine them into a single input. Ideally, we would train paper and author embeddings simul-
taneously, such that they benefit from each other’s context. However, the required memory
space proved to be a limiting factor given the more than 200 million authors and more than

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

200 million papers. Ultimately, we train embeddings for papers, journals, and conferences
together; we train the embeddings for authors separately.

Due to the large number of input entities within the knowledge graph, we try to minimize
the overall input size and thereby the memory requirement for training. We first filter out the
relationships we aim to model. To further reduce memory consumption, we “abbreviate” rela-
tions by removing their prefixes.

Furthermore, we use a mapping for entities and relations to further reduce memory con-
sumption. All entities and relations are mapped to a specific index in the form of an integer.
In this way, all statements within the knowledge graph are reduced to a triple of integers and
used as input for training together with the mapping files.

5.4.2.

Evaluation setup

We use the Python package DGL-KE (Zheng et al., 2020) for our implementation of knowledge
graph embedding algorithms. DGL-KE is a recently published package optimized for training
knowledge graph embeddings at a large scale. It outperforms other state-of-the-art packages
while achieving linear scaling with machine resources as well as high model accuracies. We
set the dimension size of our output embeddings to 100. We set the limit due to greater mem-
ory constraints for training higher-dimensional embeddings. We experiment with a dimension
size of 150 and did not observe any improvements to our metrics. Embedding sizes any higher
will result in out-of-memory errors on our setup. The exact choices of hyperparameters are in
Table 20. We perform evaluation through randomly masking entities and relations and trying
to repredict the missing part.

We perform training on the bwUnicluster using GPU nodes with eight Nvidia Tesla V100
GPUs and 752 GB of RAM. We use standard ranking metrics Hit@k, mean rank (MR), and
mean reciprocal rank (MRR).

5.4.3.

Evaluation results

Our evaluation results can be found in Table 21. Note that performing a full-scale analysis of
the effects of the hyperparameters on the embedding quality was out of the scope of this paper.
Results are based on embeddings trained on paper, journal, and conference entities. We
observed an average mean rank of 1.301 and a mean reciprocal rank of 0.958 for the best-
performing embedding type.

Interestingly, TransE and TransR greatly outperform other algorithms during fewer training
steps (1,000). For higher training steps, the more modern models, such as ComplEx and
DistMult, achieve state-of-the-art performance. Across all metrics, ComplEx, which is based
on complex embeddings instead of real-valued embeddings, achieves the best results (e.g.,
MRR of 0.958 and HITS@1 of 0.937) while having competitive training times to other

Table 20. Hyperparameters for training embeddings

Hyperparameter
Embedding size

Maximum training step

Batch size

Negative sampling size

Quantitative Science Studies

Value

100

1,000,000

1,000

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Table 21.

Evaluation results of various embedding types

Average MR

Average MRR

Average HITS@1

Average HITS@3

Average HITS@10

TransR*
105.598

0.388

0.338

0.403

0.474

TransE
15.224

0.640

0.578

0.659

0.769

RESCAL
4.912

ComplEx
1.301

DistMult
2.094

0.803

0.734

0.851

0.920

0.958

0.937

0.975

0.992

0.923

0.893

0.945

0.977

Training time

10 hours

8 hours

18 hours

8 hours

methods. A direct comparison of these evaluation results with the evaluation results for link
prediction with embeddings in the general domain is not possible, in our view, because the
performance depends heavily on the used training data and test data. However, it is remark-
able that embedding methods that perform quite well on our tasks (e.g., RESCAL) do not per-
form so well in the general domain (e.g., using the data sets WN18 and FB15K) (Dai, Wang
et al., 2020), while the embedding method that performs best in our case, namely ComplEx,
also counts as state-of-the-art in the general domain (Dai et al., 2020).

It is important to note that we train the TransR embedding type on 250,000 maximum train-
ing steps compared to 1,000,000 for all others embedding types. This is due to the extremely
long training time for this specific embedding; we were unable to finish training in 48 hours,
and, therefore, had to adjust the training steps manually. The effect can be seen in its perfor-
mance; although for fewer training steps, TransR performed similarly to TransE.

Table 22 shows the quality of our final embeddings, which we published at https://makg.org/.

5.5. Discussion

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

The main challenge of the task lies in the hardware requirement for training embeddings on
such a large scale. For publications, for instance, even after the approaches we have carried
out for reducing memory consumption, it still required a significant amount of memory. For
example, we were not able to train publications and author embeddings simultaneously given
750 GB of memory space. Given additional resources, future researchers could increase the
dimensionality of embeddings, which might increase performance.

Other embedding approaches may be suitable for our case as well, though the limiting fac-
tor here is the large file size of the input graph. Any approach needs to be scalable and perform

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Average MR

Average MRR

Average HITS@1

Average HITS@3

Average HITS@10

Table 22.

Evaluation of final embeddings

Author
2.644

0.896

0.862

0.918

0.960

Paper/Journal/Conference
1.301

0.958

0.937

0.975

0.992

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

efficiently on such large data sets. One of the limiting factors for choosing embedding types (e.g.,
TransE) is the availability of an efficient implementation. The DGL-KE provides such implemen-
tations, but only for a select number of embedding types. In the future, as other implementations
become publicly available, further evaluations may be performed. Alternatively, custom imple-
mentations can also be developed, though such tasks are not the subject of our paper.

Future researchers might further experiment with various combinations of hyperparameters.
We have noticed a great effect of training steps on the embedding qualities of various models.
Other effects might be learnable with additional experimentations.

6. KNOWLEDGE GRAPH PROVISIONING AND STATISTICAL ANALYSIS

In this section, we outline how we provide the enhanced MAKG. Furthermore, we show the
results of a statistical analysis on various aspects of the MAKG.

6.1. Knowledge Graph Provisioning

For creating the enhanced MAKG, we followed the initial schema and data model of Färber
(2019). However, we introduced new properties to model novel relationships and data attributes.
A list of all new properties to the MAKG ontology can be found in Table 23. An updated schema
for the MAKG is in Figure 7 and on the MAKG homepage, together with the updated ontology.

Besides the MAKG, Wikidata models millions of scientific publications. Thus, similar to the
initial MAKG (Färber, 2019), we created mappings between the MAKG and Wikidata in the
form of owl:sameAs statements. Using the DOI as unique identifier for publications, we were
able to create 20,872,925 links between the MAKG and Wikidata.

The MAKG RDF files—containing 8.7 billion RDF triples as the core part—are available at
https://doi.org/10.5281/zenodo.4617285. The updated SPARQL endpoint is available at
https://makg.org/sparql.

6.2. General Statistics

Similar to analyses performed by Herrmannova and Knoth (2016) and Färber (2019), we aim to
provide some general data set statistics regarding the content of the MAKG. Since the last pub-
lication, the MAG has received many updates in the form of additional data entries, as well as
some small to moderate data schema changes. Therefore, we aim to provide some up-to-date
statistics of the MAKG and further detailed analyses of other areas.

We carried out all analysis using the MAKG based on the MAG data as of June 2020 and our
modified variants (i.e., custom fields of study and enhanced author set). Table 24 shows general
statistics of the enhanced MAKG. In the following, we describe key statistics in more detail.

6.2.1. Authors

The original MAKG encompasses 243,042,675 authors, of which 43,514,250 had an affilia-
tion given in the MAG. Our disambiguation approach reduced this set to 151,355,324 authors.

Table 25 showcases certain author statistics with respect to publication and cooperation.
The average paper in the MAG has 2.7 authors with the most having 7,545 authors. On aver-
age, an author published 2.65 papers according to the MAKG. The author with the highest
number of papers published 8,551 papers. The average author cooperated with 10.69 other
authors in their combined work, with the most “connected” author having 65,793 coauthors
overall, which might be plausible, but is likely misleading due to unclean data to some extent.

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Table 23.

Properties added to the MAKG using the prefixes shown in Figure 7

Property
https://makg.org/property/paperFamilyCount

https://makg.org/property/ownResource

https://makg.org/property/citedResource

https://makg.org/property/resourceType

https://www.w3.org/1999/02/22-rdf-syntax-ns#type

https://purl.org/spar/fabio/hasURL

https://makg.org/property/familyId

https://makg.org/property/isRelatedTo

Domain

Range

:Author

:Affiliation

:Journal

xsd:integer

:ConferenceSeries

xsd:integer

:ConferenceInstance

xsd:integer

:FieldOfStudy

:Paper

:Resource

:Paper

:Affiliation

:Journal

xsd:integer

Resource

:Resource

xsd:integer

fabio:Work

xsd:anyURI

xsd:integer

:Affiliation

:Journal

:ConferenceSeries

:FieldOfStudy

https://makg.org/property/recommends

https://prismstandard.org/namespaces/basic/2.0/keyword

https://www.w3.org/2003/01/geo/wgs84_pos#lat

https://www.w3.org/2003/01/geo/wgs84_pos#long

:Paper

:Affiliation

:Paper

xsd:string

xsd:float

https://dbpedia.org/ontology/location

:ConferenceInstance

dbp:location

https://dbpedia.org/ontology/publisher

https://dbpedia.org/ontology/patent

https://purl.org/spar/fabio/hasPatentNumber

https://purl.org/spar/fabio/hasPubMedId

https://purl.org/spar/fabio/hasPubMedCentrialId

:Paper

dbp:Publisher

epo:EPOID

justia:JustiaID

xsd:string

pm:PubMedID

pmc:PMCID

https://www.w3.org/2000/01/rdf-schema#seeAlso

:FieldOfStudy

gn:WikipediaArticle

nih:NihID

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

.
a
m
e
h
c
s
G
K
A
M
d
e
t
a
d
p
U

e
r
u
g
i
F

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

Table 24. General statistics for the MAG/MAKG and the enhanced MAKG as of June 19, 2020

Papers

Paper abstracts

Authors

Affiliations

Journals

Conference series

Conference instances

Unique fields of study

ORCID iDs

# in MAG/MAKG
238,670,900

139,227,097

243,042,675

# in enhanced MAKG
238,670,900

139,227,097

151,355,324

25,767

48,942

4,468

16,142

740,460

–

25,767

48,942

4,468

16,142

740,460

34,863

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Table 25. General author and paper statistics

Metric
Average authors per paper

Maximum authors per paper

Average papers per author

Maximum papers per author

Average coauthors per author

Maximum coauthors per author

Table 26. General reference and citation statistics

Key statistics
Average references

At least one reference

Average references (filtered)

Median references (filtered)

Most references

Average citations

At least one citation

Average citations (filtered)

Median citations (filtered)

Most citations

Value
2.6994

7,545

2.6504

8,551

10.6882

65,793

Value

6.8511

78,684,683

20.7813

26,690

6.8511

90,887,343

17.9912

252,077

Quantitative Science Studies

Q
u
a
n

t
i
t

t
i
v
e
S
c
e
n
c
e
S
u
d
e
s

Table 27. Detailed reference and citation statistics

Average references

Journal

13.089

Conference
10.309

Patent

3.470

At least one reference

42,660,071

3,913,744

19,023,288

Average references (filtered)

26.313

12.400

Median references (filtered)

Most references

Average citations

13,220

14.729

4,156

9.024

9.643

19,352

Book

2.460

93,644

56.315

5,296

BookSection
3.286

Repository
11.649

Data Set
0.063

No data

2.782

339,439

1,305,000

130

11,349,367

26.268

14.988

18.969

21.758

7,747

0.813

2,092

2.251

196

0.188

26,690

1.019

3.225

29.206

At least one citation

50,599,935

3,063,123

22,591,991

1,299,728

351,448

549,526

1,187

12,430,405

Average citations (filtered)

24.963

13.869

7.547

48.177

6.277

6.878

6.240

Median citations (filtered)

Most citations

252,077

34,134

32,096

137,596

4,119

20,503

633

7.274

103,540

8
9

T
h
e

i
c
r
o
s
o
f
t
A
c
a
d
e
m
i
c
K
n
o
w
l
e
d
g
e
G

r
a
p
h

e
n
h
a
n
c
e
d

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

6.2.2. Papers

We first analyze the composition of paper entities by their associated type (see Table 2). The
most frequently found document type is journal articles, followed by patents. A huge propor-
tion of paper entities in the MAKG do not have a document type.

In the following, we analyze the number of citations and references for papers within the

MAKG. The results can be found in Table 26.

The average paper in the MAKG references 6.85 papers and received 6.85 citations. The
exact match in numbers here seems too unlikely to be coincidental. Therefore, we suspect
these numbers to be a result of a closed referencing system of the original MAG, meaning
references for a paper are only counted if they reference another paper within the MAG;
and citations are only counted if a paper is cited by another paper found in the MAKG. When
we remove papers with zero references, we are left with a set of 78,684,683 papers. The aver-
age references per paper from the filtered paper set is now 20.78. In the MAKG, 90,887,343
papers are cited at least once, with the average among this new set being 17.99. As averages
are highly susceptible to outliers, which were frequent in our data set due to unclean data and
the power law distribution of scientific output, we also calculated the median of references
and citations. These values should give us a more representative picture of reality. The paper
with the most references from the MAG has 26,690 references, whereas the paper with the
most citations received 252,077 citations as of June 2020.

Table 27 showcases detailed reference and citation statistics for each document type found
in our (enhanced) MAKG. Unsurprisingly, books have the most amount of references on aver-
age due to their significant lengths, followed by journal papers (and book sections). However,
the median value for books is less than for journals, likely due to outliers. Citation wise, books
and journal papers again are the most cited document types on average. Again, journal papers
have fewer citations on average but a higher median value.

Figure 8 shows the number of papers published each year in the time span recorded by the
MAKG (1800–present). The number of publications has been on a steady exponential trajec-
tory. This is, of course, partly due to advances in the digitalization of libraries and journals, as
well as the increasing ease of accessing new research papers. However, we can certainly attri-
bute a large part of the growth to the increasing number of publications every year (Johnson
et al., 2018).

Interestingly, the average number of references per paper has been on a steady increase (see
Figure 9 and Johnson et al. (2018)). This could be due to a couple of reasons. First, as scientific

Figure 8. Number of papers published per year (starting with 1900).

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Figure 9. Average number of references of a paper per year.

fields develop and grow, novel work becomes increasingly rare. Rather, researchers publish
work built on top of previous research (“on the shoulders of giants”), leading to a growing
number of references for new publications. Furthermore, the increasing number of research
papers further contribute to more works being considered for referencing. Second, develop-
ments in technology, such as digital libraries, enable the spread of research and ease the shar-
ing of ideas and communication between researchers (see, for example, the open access
efforts (Piwowar, Priem et al., 2018)). Therefore, a researcher from the modern age has a huge
advantage in accessing other papers and publications. The ease of access could contribute to
more works being referenced in this way. Third, as the MAKG is (most likely) a closed refer-
ence system, meaning papers referenced are only included if they are part of the MAKG, and
as modern publications are more likely to be included in the MAKG, newer papers will auto-
matically have a higher number of recorded references in the MAKG. Although this is a pos-
sibility, we do not suspect it to be the main reason behind the rising number of references.
Most likely, the cause is a combination of several factors.

Surprisingly, the average number of citations a paper receives has increased, as shown in
Figure 10. Intuitively, one would assume older papers to receive more citations on average
purely due to longevity. However, as our graph shows, the number of citations an average
paper receives has increased since the turn of the last century. We observe a peak of growth
around 1996, which might be where the age of a paper exhibits its effect. Coupled with the
exponential growth of publications, the average citations per paper plummets.

Figure 11 shows the average number of authors per paper per year and publication type,
using the MAKG paper’s publication year. As we can observe, there has been a clear upward

Figure 10. Average number of citations of a paper per year.

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Figure 11. Average number of authors per paper and paper type over the years, including standard deviation.

trend for the average number of authors per paper specifically concerning journal articles,
conference papers, and patents since the 1970s. The level of cooperation within the scientific
community has grown, partly led by the technological developments that enable researchers
to easily connect and cooperate. This finding reconfirms the results from the STM report 2018
(Johnson et al., 2018).

6.2.3.

Fields of study

In the following, we analyze the development of fields of study over time. First, Figure 12
showcases the current number of publications per top-level field of study within the MAKG.
Each field of study here has two distinct values. The blue bars represent the field of study as
labeled by the MAKG, whereas the red bars are labels as generated by our custom classifier.

Figure 12. Number of papers per field of study.

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Importantly, there is a discrepancy between the total number of paper labels between the orig-
inal MAKG field of study labels and our custom labels. The original MAG hierarchy includes
labels for 199,846,956 papers. Our custom labels are created through classification of paper
abstracts and are therefore limited by the number of abstracts available in the data set; thus, we
only generated labels for 139,227,097 papers. Rather surprisingly, the disciplines of medicine
and materials science are the most common fields of study within the MAG, according to the
original MAG field of study labels. According to our classification, engineering and medicine
are the most represented disciplines.

Evaluating the cumulative number of papers associated with the different fields of study
over the years, we can confirm the exponential growth of scientific output shown by Larsen
and von Ins (2010). In many areas, our data show greater rates of growth than previously
anticipated.

Figure 13 shows the interdisciplinary works of authors. Here, we modeled the relationships
between fields of study in a chord graph. Each chord between two fields of study represents
authors who have published papers in both disciplines. The thickness of each chord is repre-
sentative of the number of authors who have done so. We observe strong relationships
between the disciplines of biology and medicine, materials science and engineering, and
computer science and engineering. Furthermore, there is a moderately strong relationship
between the disciplines of chemistry and medicine, biology and engineering, and chemistry
and biology. The multitude of links between engineering and other disciplines could be due
to mislabeling of engineering papers, as our classifier is not adept at classifying papers from
engineering in comparison to other fields of study, as shown in Table 19.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 13.

Interdisciplinary researchers in form of authors who publish in multiple fields of study.

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

7. CONCLUSION AND OUTLOOK

In this paper, we developed and applied several methods for enhancing the MAKG, a large-
scale scholarly knowledge graph. First, we performed author name disambiguation on the set
of 243 million authors using background information, such as the metadata of 239 million
publications. Our classifier achieved a precision of 0.949, a recall of 0.755, and an accuracy
of 0.991. We managed to reduce the number of total author entities from 243 million to
151 million.

Second, we reclassified existing papers from the MAKG into a distinct set of 19 disciplines
(i.e., level-0 fields of study). We performed an evaluation of existing labels and determined
55% of the existing labels to be accurate, whereas our newly generated labels achieved an
accuracy of approximately 78%. We then assigned tags to papers based on the papers’
abstracts to create a more suitable description of paper content in comparison to the preexist-
ing rigid field of study hierarchy in the MAKG.

Third, we generated entity embeddings for all paper, journal, conference, and author
entities. Our evaluation showed that ComplEx was the best performing large-scale entity
embedding method that we could apply to the MAKG.

Finally, we performed a statistical analysis on key features of the enhanced MAKG. We
updated the MAKG based on our results and provided all data sets, as well as the updated
MAKG, online at https://makg.org and https://doi.org/10.5281/zenodo.4617285.

Future researchers could further improve upon our results. For author name disambigua-
tion, we believe the results could be further improved by incorporating additional author infor-
mation from other sources. For field of study classification, future approaches could develop
ways to organize our generated paper tags into a more hierarchical system. For the trained
entity embeddings, future research could generate embeddings at a higher dimensionality. This
was not possible because of the lack of existing efficient scalable implementations of most
algorithms. Beyond these enhancements, the MAKG should be enriched with the key content
of scientific publications, such as research data sets (Färber & Lamprecht, 2022), scientific
methods (Färber et al., 2021), and research contributions (Jaradeh et al., 2019b).

AUTHOR CONTRIBUTIONS

Michael Färber: Conceptualization, Data curation, Investigation, Methodology, Resources,
Supervision, Visualization, Writing—review & editing. Lin Ao: Conceptualization, Data cura-
tion, Investigation, Methodology, Resources, Software, Visualization, Writing—original draft.

COMPETING INTERESTS

The authors have no competing interests.

FUNDING INFORMATION

The authors did not receive any funding for this research.

DATA AVAILABILITY

We provide all generated data online to the public at https://makg.org and https://doi.org/10
.5281/zenodo.4617285 under the ODC-BY license (https://opendatacommons.org/licenses/by
/1-0/). Our code is available online at https://github.com/lin-ao/enhancing_the_makg.

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

REFERENCES

Ajileye, T., Motik, B., & Horrocks, I. (2021). Streaming partitioning
of RDF graphs for datalog reasoning. In Proceedings of the 18th
Extended Semantic Web Conference. https://doi.org/10.1007/978
-3-030-77385-4_1

Alzaidy, R., Caragea, C., & Giles, C. L. (2019). Bi-LSTM-CRF
sequence labeling for keyphrase extraction from scholarly docu-
ments. In Proceedings of the 28th World Wide Web Conference
(pp. 2551–2557). https://doi.org/10.1145/3308558.3313642
Baskaran, A. (2017). UNESCO science report: Towards 2030.

Institutions and Economies, 125–127.

Beel, J., Langer, S., Genzmehr, M., Gipp, B., Breitinger, C., &
Nürnberger, A. (2013). Research paper recommender system
evaluation: A quantitative literature survey. In Proceedings of
the International Workshop on Reproducibility and Replication
in Recommender Systems Evaluation (pp. 15–22). https://doi.org
/10.1145/2532508.2532512

Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained
language model for scientific text. In Proceedings of the 2019
Conference on Empirical Methods in Natural Language Pro-
cessing and the 9th International Joint Conference on Natural
Language Processing (pp. 3613–3618). https://doi.org/10.18653
/v1/D19-1371

Bordes, A., Usunier, N., García-Durán, A., Weston, J., & Yakhnenko,
O. (2013). Translating embeddings for modeling multi-relational
data. In Proceedings of the 27th Annual Conference on Neural
Information Processing Systems (pp. 2787–2795).

Brack, A., D’Souza, J., Hoppe, A., Auer, S., & Ewerth, R. (2020).
Domain-independent extraction of scientific concepts from research
articles. In Proceedings of the 42nd European Conference on IR
(pp. 251–266). https://doi.org/10.1007/978-3-030-45439-5_17
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2),

123–140. https://doi.org/10.1007/BF00058655

Caragea, C., Bulgarov, F. A., Godea, A., & Gollapalli, S. D. (2014).
Citation-enhanced keyphrase extraction from research papers: A
supervised approach. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing (pp. 1435–1446).
https://doi.org/10.3115/v1/D14-1150

Caron, E., & van Eck, N. J. (2014). Large scale author name disam-
biguation using rule-based scoring and clustering. In Proceedings
of the 19th International Conference on Science and Technology
Indicators (pp. 79–86).

Che, F., Zhang, D., Tao, J., Niu, M., & Zhao, B. (2020). ParamE:
Regarding neural network parameters as relation embeddings
for knowledge graph completion. In Proceedings of the 34th
AAAI Conference on Artificial Intelligence (pp. 2774–2781).
https://doi.org/10.1609/aaai.v34i03.5665

Chicco, D., & Jurman, G. (2020). The advantages of the Matthews
correlation coefficient (MCC) over F1 score and accuracy in binary
classification evaluation. BMC Genomics, 21(1), 6. https://doi.org
/10.1186/s12864-019-6413-7, PubMed: 31898477

Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003). A comparison
of string distance metrics for name-matching tasks. In Proceedings
of IJCAI-03 Workshop on Information Integration on the Web
(pp. 73–78).

Cox, D. R., & Snell, E. J. (1989). Analysis of binary data (Vol. 32).

CRC Press.

Dai, Y., Wang, S., Xiong, N. N., & Guo, W. (2020). A survey on knowl-
edge graph embedding: Approaches, applications and benchmarks.
Electronics, 9(5). https://doi.org/10.3390/electronics9050750

Dai, Z., Yang, Z., Yang, Y., Carbonell, J. G., Le, Q. V., & Salakhutdinov,
R. (2019). Transformer-XL: Attentive language models beyond a

fixed-length context. In Proceedings of the 57th Conference of
the Association for Computational Linguistics (pp. 2978–2988).
https://doi.org/10.18653/v1/P19-1285

Daquino, M., Peroni, S., Shotton, D. M., Colavizza, G., Ghavimi,
B., … Zumstein, P. (2020). The OpenCitations Data Model. In
Proceedings of the 19th International Semantic Web Conference
(pp. 447–463). https://doi.org/10.1007/978-3-030-62466-8_28
Dettmers, T., Minervini, P., Stenetorp, P., & Riedel, S. (2018). Convo-
lutional 2D knowledge graph embeddings. In Proceedings of the
32nd AAAI Conference on Artificial Intelligence (pp. 1811–1818).
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT:
Pretraining of deep bidirectional transformers for language
understanding. In Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (pp. 4171–4186).
Färber, M. (2019). The Microsoft Academic Knowledge Graph: A
linked data source with 8 billion triples of scholarly data. In
Proceedings of the 18th International Semantic Web Conference
(pp. 113–129). Springer. https://doi.org/10.1007/978-3-030
-30796-7_8

Färber, M. (2020). Analyzing the GitHub repositories of research
papers. In Proceedings of the ACM/ IEEE Joint Conference on
Digital Libraries (pp. 491–492). https://doi.org/10.1145/3383583
.3398578

Färber, M., Albers, A., & Schüber, F. (2021). Identifying used
methods and datasets in scientific publications. In Proceedings
of the AAAI-21 Workshop on Scientific Document Understanding
(SDU’21)@AAAI’21.

Färber, M., & Jatowt, A. (2020). Citation recommendation: Approaches
and datasets. International Journal on Digital Libraries, 21(4),
375–405. https://doi.org/10.1007/s00799-020-00288-2

Färber, M., & Lamprecht, D. (2022). The Data set knowledge graph:
Creating a linked open data source for data sets. Quantitative
Science Studies, 2(4), 1324–1355. https://doi.org/10.1162/qss_a
_00161

Färber, M., & Leisinger, A. (2021a). Datahunter: A system for
finding datasets based on scientific problem descriptions. In
Proceedings of the 15th ACM Conference on Recommender
Systems (pp. 749–752). https://doi.org/10.1145/3460231
.3478882

Färber, M., & Leisinger, A. (2021b). Recommending datasets based
for scientific problem descriptions. In Proceedings of the 30th
ACM International Conference on Information and Knowledge
Management. https://doi.org/10.1145/3459637.3482166

Fathalla, S., Vahdati, S., Auer, S., & Lange, C. (2017). Towards a
knowledge graph representing research findings by semantifying
survey articles. In Proceedings of the 21st International Confer-
ence on Theory and Practice of Digital Libraries (pp. 315–327).
https://doi.org/10.1007/978-3-319-67008-9_25

Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage.
Journal of the American Statistical Association, 64(328), 1183–1210.
https://doi.org/10.1080/01621459.1969.10501049

Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2012). A brief
survey of automatic methods for author name disambiguation.
ACM SIGMOD Record, 41(2), 15–26. https://doi.org/10.1145
/2350036.2350040

Florescu, C., & Caragea, C. (2017). Positionrank: An unsupervised
approach to keyphrase extraction from scholarly documents. In
Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (pp. 1105–1115). https://doi.org/10
.18653/v1/P17-1102

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Fortunato, S., Bergstrom, C. T., Börner, K., Evans, J. A., Helbing, D.,
… Barabási, A.-L. (2018). Science of science. Science, 359(6379).
https://doi.org/10.1126/science.aao0185, PubMed: 29496846
Gesese, G. A., Biswas, R., Alam, M., & Sack, H. (2019). A survey on
knowledge graph embeddings with literals: Which model links
better literal-ly? CoRR, abs/1910.12507.

Han, H., Giles, C. L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004).
Two supervised learning approaches for name disambiguation in
author citations. In Proceedings of the ACM/IEEE Joint Confer-
ence on Digital Libraries (pp. 296–305). https://doi.org/10.1145
/996350.996419

Hernández, M. A., & Stolfo, S. J. (1995). The merge/purge problem
for large databases. ACM SIGMOD Record, 24(2), 127–138.
https://doi.org/10.1145/568271.223807

Herrmannova, D., & Knoth, P. (2016). An analysis of the Microsoft
Academic Graph. D-Lib Magazine, 22(9/10). https://doi.org/10
.1045/september2016-herrmannova

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.
Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162
/neco.1997.9.8.1735, PubMed: 9377276

Hoffman, M. R., Ibáñez, L. D., Fryer, H., & Simperl, E. (2018). Smart
papers: Dynamic publications on the blockchain. In Proceedings
of the 15th Extended Semantic Web Conference (pp. 304–318).
https://doi.org/10.1007/978-3-319-93417-4_20

Jaradeh, M. Y., Auer, S., Prinz, M., Kovtun, V., Kismihók, G., &
Stocker, M. (2019a). Open research knowledge graph: Towards
machine actionability in scholarly communication. CoRR,
abs/1901.10816.

Jaradeh, M. Y., Oelen, A., Farfar, K. E., Prinz, M., D’Souza, J., …
Auer, S. (2019b). Open research knowledge graph: Next gener-
ation infrastructure for semantic scholarly knowledge. In Pro-
ceedings of the 10th International Conference on Knowledge
Capture (pp. 243–246). https://doi.org/10.1145/3360901
.3364435

Jaro, M. A. (1989). Advances in record-linkage methodology as
applied to matching the 1985 census of Tampa, Florida. Journal
of the American Statistical Association, 84(406), 414–420. https://
doi.org/10.1080/01621459.1989.10478785

Ji, G., He, S., Xu, L., Liu, K., & Zhao, J. (2015). Knowledge graph
embedding via dynamic mapping matrix. In Proceedings of the
53rd Annual Meeting of the Association for Computational
Linguistics and the 7th International Joint Conference on Natural
Language Processing of the Asian Federation of Natural Language
Processing (pp. 687–696). https://doi.org/10.3115/v1/P15-1067
Johnson, R., Watkinson, A., & Mabe, M. (2018). The STM report: An
overview of scientific and scholarly publishing (5th ed.). The
Hague: International Association of Scientific, Technical and
Medical Publishers.

Kanakia, A., Shen, Z., Eide, D., & Wang, K. (2019). A scalable
hybrid research paper recommender system for Microsoft Aca-
demic. In Proceedings of the 28th World Wide Web Conference
(pp. 2893–2899). https://doi.org/10.1145/3308558.3313700
Kastner, S., Choi, S., & Jung, H. (2013). Author name disambig-
uation in technology trend analysis using SVM and random
forests and novel topic based features. In Proceedings of the
2013 IEEE International Conference on Green Computing and
Communications (GreenCom) and IEEE Internet of Things
(iThings) and IEEE Cyber, Physical and Social Computing
(CPSCom) (pp. 2141–2144). https://doi.org/10.1109/GreenCom
-iThings-CPSCom.2013.403

Kim, J. (2018). Evaluating author name disambiguation for digital
libraries: A case of DBLP. Scientometrics, 116(3), 1867–1886.
https://doi.org/10.1007/s11192-018-2824-5

Kim, J. (2019). Scale-free collaboration networks: An author name
disambiguation perspective. Journal of the Association for Infor-
mation Science and Technology, 70(7), 685–700. https://doi.org
/10.1002/asi.24158

Kim, J., Kim, J., & Owen-Smith, J. (2019). Generating automatically
labeled data for author name disambiguation: An iterative clus-
tering method. Scientometrics, 118(1), 253–280. https://doi.org
/10.1007/s11192-018-2968-3

Kim, K., Khabsa, M., & Giles, C. L. (2016). Random forest
DBSCAN for USPTO inventor name disambiguation. CoRR,
abs/1602.01792.

Kim, K., Rohatgi, S., & Giles, C. L. (2019). Hybrid deep pairwise
classification for author name disambiguation. In Proceedings
of the 28th ACM International Conference on Information and
Knowledge Management (pp. 2369–2372). https://doi.org/10
.1145/3357384.3358153

Kim, S. N., Medelyan, O., Kan, M., & Baldwin, T. (2013). Auto-
matic keyphrase extraction from scientific articles. Language
Resources and Evaluation, 47(3), 723–742. https://doi.org/10
.1007/s10579-012-9210-3

Kowsari, K., Meimandi, K. J., Heidarysafa, M., Mendu, S., Barnes, L. E.,
& Brown, D. E. (2019). Text classification algorithms: A survey.
Information, 10(4), 150. https://doi.org/10.3390/info10040150
Kristiadi, A., Khan, M. A., Lukovnikov, D., Lehmann, J., & Fischer,
A. (2019). Incorporating literals into knowledge graph embed-
dings. In Proceedings of the 18th International Semantic Web
Conference (pp. 347–363). https://doi.org/10.1007/978-3-030
-30793-6_20

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R.
(2020). ALBERT: A lite BERT for self-supervised learning of
language representations. In Proceedings of the 8th International
Conference on Learning Representations (pp. 1–17).

Larsen, P. O., & von Ins, M. (2010). The rate of growth in scientific
publication and the decline in coverage provided by science
citation index. Scientometrics, 84(3), 575–603. https://doi.org
/10.1007/s11192-010-0202-z, PubMed: 20700371

Lin, X., Zhu, J., Tang, Y., Yang, F., Peng, B., & Li, W. (2017). A novel
approach for author name disambiguation using ranking confi-
dence. In Proceedings of the 2017 International Workshops on
Database Systems for Advanced Applications (pp. 169–182).
https://doi.org/10.1007/978-3-319-55705-2_13

Lin, Y., Liu, Z., Sun, M., Liu, Y., & Zhu, X. (2015). Learning entity
and relation embeddings for knowledge graph completion. In
Proceedings of the 29th AAAI Conference on Artificial Intelli-
gence (pp. 2181–2187).

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., … Stoyanov, V. (2019).
RoBERTa: A robustly optimized BERT pretraining approach.
CoRR, abs/1907.11692. Retrieved from https://arxiv.org/abs
/1907.11692

Lu, F., Cong, P., & Huang, X. (2020). Utilizing textual information in
knowledge graph embedding: A survey of methods and applica-
tions. IEEE Access, 8, 92072–92088. https://doi.org/10.1109
/ACCESS.2020.2995074

Luan, Y., He, L., Ostendorf, M., & Hajishirzi, H. (2018). Multi-task
identification of entities, relations, and coreference for scientific
knowledge graph construction. In Proceedings of the 2018 Con-
ference on Empirical Methods in Natural Language Processing
(pp. 3219–3232). https://doi.org/10.18653/v1/D18-1360

Ma, X., Wang, R., & Zhang, Y. (2019). Author name disambiguation
in heterogeneous academic networks. In Proceedings of the 16th
International Conference on Web Information Systems and Appli-
cations (pp. 126–137). https://doi.org/10.1007/978-3-030-30952
-7_15

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Maidasani, H., Namata, G., Huang, B., & Getoor, L. (2012). Entity
resolution evaluation measure (Technical Report). Retrieved from
https://web.archive.org/web/20180414024919/https://honors.cs
.umd.edu/reports/hitesh.pdf

Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing order into text.
In Proceedings of the 2004 Conference on Empirical Methods in
Natural Language Processing (pp. 404–411).

Momeni, F., & Mayr, P. (2016). Using co-authorship networks for author
name disambiguation. In Proceedings of the 16th ACM/IEEE-CS
on Joint Conference on Digital Libraries (pp. 261–262). https://doi
.org/10.1145/2910896.2925461

Müller, M. (2017). Semantic author name disambiguation with
word embeddings. In Proceedings of the 21st International Con-
ference on Theory and Practice of Digital Libraries (pp. 300–311).
https://doi.org/10.1007/978-3-319-67008-9_24

Newcombe, H. B., Kennedy, J. M., Axford, S., & James, A. P. (1959).
Automatic linkage of vital records. Science, 130(3381), 954–959.
https://doi.org/10.1126/science.130.3381.954, PubMed:
14426783

Nguyen, D. Q. (2017). An overview of embedding models of
entities and relationships for knowledge base completion. CoRR,
abs/1703.08098. Retrieved from https://arxiv.org/abs/1703
.08098

Nickel, M., Rosasco, L., & Poggio, T. A. (2016). Holographic
embeddings of knowledge graphs. In Proceedings of the 30th
AAAI Conference on Artificial Intelligence (pp. 1955–1961).
Nickel, M., Tresp, V., & Kriegel, H. (2011). A three-way model for
collective learning on multi-relational data. In Proceedings of the
28th International Conference on Machine Learning (pp. 809–816).
Noia, T. D., Mirizzi, R., Ostuni, V. C., Romito, D., & Zanker, M.
(2012). Linked open data to support content-based recommender
systems. In Proceedings of the 8th International Conference on
Semantic Systems (pp. 1–8). https://doi.org/10.1145/2362499
.2362501

OpenAIRE. (2021). OpenAIRE Research Graph. https://graph

.openaire.eu/. Accessed: June 11, 2021.

Peroni, S., Dutton, A., Gray, T., & Shotton, D. M. (2015). Setting
our bibliographic references free: Towards open citation data.
Journal of Documentation, 71(2), 253–277. https://doi.org/10
.1108/JD-12-2013-0166

Piwowar, H., Priem, J., Larivière, V., Alperin, J. P., Matthias, L., …
Haustein, S. (2018). The state of OA: A large-scale analysis of
the prevalence and impact of open access articles. PeerJ, 6,
e4375. https://doi.org/10.7717/peerj.4375, PubMed: 29456894
Pooja, K. M., Mondal, S., & Chandra, J. (2018). An unsupervised
heuristic based approach for author name disambiguation. In
Proceedings of the 10th International Conference on Communi-
cation Systems & Networks (pp. 540–542). https://doi.org/10
.1109/COMSNETS.2018.8328267

Pooja, K. M., Mondal, S., & Chandra, J. (2020). A graph combina-
tion with edge pruning-based approach for author name disam-
biguation. Journal of the Association for Information Science and
Technology, 71(1), 69–83. https://doi.org/10.1002/asi.24212
Portisch, J., Heist, N., & Paulheim, H. (2021). Knowledge graph
embedding for data mining vs. knowledge graph embedding for
link prediction—Two sides of the same coin? Semantic Web—
Interoperability, Usability, Applicability. https://doi.org/10.3233
/SW-212892

Protasiewicz, J., & Dadas, S. (2016). A hybrid knowledge-based
framework for author name disambiguation. In Proceedings of
the 2016 IEEE International Conference on Systems, Man, and
Cybernetics (pp. 594–600). https://doi.org/10.1109/SMC.2016
.7844305

Qian, Y., Zheng, Q., Sakai, T., Ye, J., & Liu, J. (2015). Dynamic
author name disambiguation for growing digital libraries. Infor-
mation Retrieval Journal, 18(5), 379–412. https://doi.org/10
.1007/s10791-015-9261-3

Qiu, Y. (2020). Data wrangling: Using publicly available knowledge
graphs (kgs) to construct a domain-specific kg. https://cs.anu.edu
.au/courses/CSPROJECTS/20S1/reports/u5776733_report.pdf
Quass, D., & Starkey, P. (2003). Record linkage for genealogical
databases. In Proceedings of the ACM SIGKDD 2003 Workshop
on Data Cleaning, Record Linkage, and Object Consolidation
(pp. 40–42).

Ristoski, P. (2017). Exploiting Semantic Web Knowledge Graphs in

Data Mining (Unpublished doctoral dissertation).

Ristoski, P., Rosati, J., Noia, T. D., Leone, R. D., & Paulheim, H.
(2019). RDF2Vec: RDF graph embeddings and their applications.
Semantic Web, 10(4), 721–752. https://doi.org/10.3233/SW
-180317

Roark, B., Wolf-Sonkin, L., Kirov, C., Mielke, S. J., Johny, C., … Hall,
K. B. (2020). Processing South Asian languages written in the Latin
script: The Dakshina dataset. In Proceedings of the 12th Language
Resources and Evaluation Conference (pp. 2413–2423).

Rocchio, J. J. (1971). Relevance feedback in information retrieval.
In G. Salton (Ed.), The smart retrieval system—Experiments in
automatic document processing. Englewood, Cliffs, NJ: Prentice
Hall.

Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic
keyword extraction from individual documents. In M. W. Berry &
J. Kogan (Eds.), Text mining: Applications and theory (pp. 1–20).
John Wiley & Sons. https://doi.org/10.1002/9780470689646.ch1
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning
representations by back-propagating errors. Nature, 323(6088),
533–536. https://doi.org/10.1038/323533a0

Salatino, A. A., Thanapalasingam, T., Mannocci, A., Osborne, F., &
Motta, E. (2018). The computer science ontology: A large-scale
taxonomy of research areas. In Proceedings of the 17th Interna-
tional Semantic Web Conference (pp. 187–205). https://doi.org
/10.1007/978-3-030-00668-6_12

Schapire, R. E. (1990). The strength of weak learnability. Machine
Learning, 5, 197–227. https://doi.org/10.1007/BF00116037
Schindler, D., Zapilko, B., & Krüger, F. (2020). Investigating soft-
ware usage in the social sciences: A knowledge graph approach.
In Proceedings of the 17th Extended Semantic Web Conference
(pp. 271–286). https://doi.org/10.1007/978-3-030-49461-2_16
Schubert, T., Jäger, A., Türkeli, S., & Visentin, F. (2019). Addressing
the productivity paradox with big data. A literature review and
adaptation of the CDM econometric model. Technical Report,
Maastricht University.

Schulz, C., Mazloumian, A., Petersen, A. M., Penner, O., & Helbing,
D. (2014). Exploiting citation networks for large-scale author
name disambiguation. EPJ Data Science, 3(1), 11. https://doi.org
/10.1140/epjds/s13688-014-0011-3

Shaver, P. (2018). Science today. In The rise of science: From pre-
history to the far future (pp. 129–209). Cham: Springer Interna-
tional Publishing. https://doi.org/10.1007/978-3-319-91812-9_4
Singla, P., & Domingos, P. M. (2006). Entity resolution with Markov
logic. In Proceedings of the 6th IEEE International Conference on
Data Mining (pp. 572–582). IEEE Computer Society. https://doi
.org/10.1109/ICDM.2006.65

Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., … Wang, K. (2015).
An overview of Microsoft Academic Service (MAS) and applica-
tions. In Proceedings of the 24th International Conference on
World Wide Web Companion (pp. 243–246). https://doi.org/10
.1145/2740908.2742839

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d

f
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
a
_
0
0
1
8
3
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Sun, S., Zhang, H., Li, N., & Chen, Y. (2017). Name disambiguation
for Chinese scientific authors with multi-level clustering. In
Proceedings of the 2017 IEEE International Conference on
Computational Science and Engineering and IEEE International
Conference on Embedded and Ubiquitous Computing
(pp. 176–182). IEEE Computer Society. https://doi.org/10.1109
/CSE-EUC.2017.39

Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008).
ArnetMiner: Extraction and mining of academic social networks.
In Proceedings of the 14th ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining (pp. 990–998).
https://doi.org/10.1145/1401890.1402008

Tekles, A., & Bornmann, L. (2019). Author name disambiguation of
bibliometric data: A comparison of several unsupervised
approaches. In Proceedings of the 17th International Conference
on Scientometrics and Informetrics (pp. 1548–1559).

Tran, H. N., Huynh, T., & Do, T. (2014). Author name disambigua-
tion by using deep neural network. In Proceedings of the 6th Asian
Conference on Intelligent Information and Database Systems
(pp. 123–132). https://doi.org/10.1007/978-3-319-05476-6_13
Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., & Bouchard, G. (2016).
Complex embeddings for simple link prediction. In Proceedings
of
the 33rd International Conference on Machine Learning
(pp. 2071–2080).

Tzitzikas, Y., Pitikakis, M., Giakoumis, G., Varouha, K., & Karkanaki,
E. (2020). How can a university take its first steps in open data?
In Proceedings of the 14th Metadata and Semantics Research
Conference. https://doi.org/10.1007/978-3-030-71903-6_16
Vapnik, V., & Chervonenkis, A. Y. (1964). A class of algorithms for
pattern recognition learning. Avtomat. i Telemekh, 25(6), 937–945.
Wang, H., Wang, R., Wen, C., Li, S., Jia, Y., … Wang, X. (2020).
Author name disambiguation on heterogeneous information net-
work with adversarial representation learning. In Proceedings of
the 34th AAAI Conference on Artificial Intelligence (pp. 238–245).
https://doi.org/10.1609/aaai.v34i01.5356

Wang, J., Li, G., Yu, J. X., & Feng, J. (2011). Entity matching: How
similar is similar. Proceedings of the VLDB Endowment, 4(10),
622–633. https://doi.org/10.14778/2021017.2021020

Wang, K., Shen, Z., Huang, C., Wu, C., Eide, D., … Rogahn, R.
(2019). A review of Microsoft Academic Services for science of
science studies. Frontiers in Big Data, 2, 45. https://doi.org/10
.3389/fdata.2019.00045, PubMed: 33693368

Wang, K., Shen, Z., Huang, C., Wu, C.-H., Dong, Y., & Kanakia, A.
(2020). Microsoft Academic Graph: When experts are not
enough. Quantitative Science Studies, 1(1), 396–413. https://doi
.org/10.1162/qss_a_00021

Wang, Q., Mao, Z., Wang, B., & Guo, L. (2017). Knowledge
graph embedding: A survey of approaches and applications.
IEEE Transactions on Knowledge and Data Engineering, 29(12),
2724–2743. https://doi.org/10.1109/TKDE.2017.2754499

Wang, R., Yan, Y., Wang, J., Jia, Y., Zhang, Y., … Wang, X. (2018).
AceKG: A large-scale knowledge graph for academic data min-
ing. In Proceedings of the 27th ACM International Conference on
Information and Knowledge Management (pp. 1487–1490).
https://doi.org/10.1145/3269206.3269252

Wang, Z., Zhang, J., Feng, J., & Chen, Z. (2014). Knowledge graph
embedding by translating on hyperplanes. In Proceedings of the
28th AAAI Conference on Artificial Intelligence (pp. 1112–1119).
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G.,
Axton, M., … Mons, B. (2016). The FAIR Guiding Principles for
scientific data management and stewardship. Scientific Data, 3(1),
1–9. https://doi.org/10.1038/sdata.2016.18, PubMed: 26978244
Wilson, D. R. (2011). Beyond probabilistic record linkage: Using
neural networks and complex features to improve genealogical
record linkage. In Proceedings of the 2011 International Joint
Conference on Neural Networks (pp. 9–14). https://doi.org/10
.1109/IJCNN.2011.6033192

Winkler, W. E. (1999). The state of record linkage and current
research problems. In Statistical Research Division, US Census
Bureau. World Higher Education Database (2021). https://www
.whed.net/home.php.

World Higher Education Database. (2021). https://www.whed.net

/home.php.

Xu, X., Li, Y., Liptrott, M., & Bessis, N. (2018). NDFMF: An author
name disambiguation algorithm based on the fusion of multiple
features. In Proceedings of the 2018 IEEE 42nd Annual Computer
Software and Applications Conference (pp. 187–190). https://doi
.org/10.1109/COMPSAC.2018.10226

Yang, B., Yih, W., He, X., Gao, J., & Deng, L. (2015). Embedding
entities and relations for learning and inference in knowledge
bases. In Proceedings of the 3rd International Conference on
Learning Representations.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J. G., Salakhutdinov, R.,
& Le, Q. V.
(2019). XLNet: Generalized autoregressive pre-
the
training for language understanding.
Annual Conference on Neural Information Processing Systems
(pp. 5754–5764).

In Proceedings of

Zhang, S., Xinhua, E., & Pan, T. (2019). A multi-level author name
disambiguation algorithm. IEEE Access, 7, 104250–104257.
https://doi.org/10.1109/ACCESS.2019.2931592

Zhang, W., Yan, Z., & Zheng, Y. (2019). Author name disambigua-
tion using graph node embedding method. In Proceedings of the
23rd IEEE International Conference on Computer Supported
Cooperative Work in Design (pp. 410–415). https://doi.org/10
.1109/CSCWD.2019.8791898

Zheng, D., Song, X., Ma, C., Tan, Z., Ye, Z., … Karypis, G. (2020).
DGL-KE: Training knowledge graph embeddings at scale. In Pro-
ceedings of the 43rd International ACM SIGIR Conference on
Research and Development in Information Retrieval (pp. 739–748).
https://doi.org/10.1145/3397271.3401172

Quantitative Science Studies

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
q
s
s
/
a
r
t
i
c
e
–
p
d