RESEARCH ARTICLE

The Microsoft Academic Knowledge Graph
enhanced: Author name disambiguation,
publication classification, and embeddings

开放访问

杂志

Karlsruhe Institute of Technology (KIT), Karlsruhe, 德国

Michael Färber

and Lin Ao

关键词: linked open data, open science, scholarly data, scientific knowledge graph

抽象的

Although several large knowledge graphs have been proposed in the scholarly field, 这样的
graphs are limited with respect to several data quality dimensions such as accuracy and
覆盖范围. 在本文中, we present methods for enhancing the Microsoft Academic Knowledge
图形 (MAKG), a recently published large-scale knowledge graph containing metadata about
scientific publications and associated authors, venues, and affiliations. Based on a qualitative
analysis of the MAKG, we address three aspects. 第一的, we adopt and evaluate unsupervised
approaches for large-scale author name disambiguation. 第二, we develop and evaluate
methods for tagging publications by their discipline and by keywords, facilitating enhanced
search and recommendation of publications and associated entities. 第三, we compute and
evaluate embeddings for all 239 million publications, 243 million authors, 49,000 journals,
和 16,000 conference entities in the MAKG based on several state-of-the-art embedding
技巧. 最后, we provide statistics for the updated MAKG. Our final MAKG is publicly
available at https://makg.org and can be used for the search or recommendation of scholarly
实体, as well as enhanced scientific impact quantification.

介绍

最近几年, knowledge graphs have been proposed and made publicly available in the
scholarly field, covering information about entities such as publications, authors, and venues.
They can be used for a variety of use cases: (1) Using the semantics encoded in the knowledge
graphs and RDF as a common data format, which allows easy data integration from different
data sources, scholarly knowledge graphs can be used for providing advanced search and rec-
ommender systems (Noia, Mirizzi et al., 2012) in academia (例如, recommending publications
(Beel, Langer et al., 2013), citations (Färber & Jatowt, 2020), and data sets (Färber & Leisinger,
2021A, 2021乙)). (2) The representation of knowledge as a graph and the interlinkage of entities
of various entity types (例如, 出版物, authors, 机构) allows us to propose novel ways to
scientific impact quantification (Färber, Albers, & Schüber, 2021). (3) If scholarly knowledge
graphs model the key content of publications, such as data sets, 方法, 索赔, and research
contributions (Jaradeh, Oelen et al., 2019乙), they can be used as a reference point for scientific
知识 (例如, 索赔) (Fathalla, Vahdati et al., 2017), similar to DBpedia and Wikidata in the
case of cross-domain knowledge. In light of the FAIR principles (Wilkinson, Dumontier et al.,
2016) and the overload of scientific information resulting from the increasing publishing rate
in the various fields (约翰逊, Watkinson, & Mabe, 2018), one can envision that researchers’

引文: Färber, M。, & Ao, L. (2022).
The Microsoft Academic Knowledge
Graph enhanced: Author name
disambiguation, 出版物
classification, and embeddings.
Quantitative Science Studies, 3(1),
51–98. https://doi.org/10.1162/qss_a
_00183

DOI:
https://doi.org/10.1162/qss_a_00183

Peer Review:
https://publons.com/publon/10.1162
/qss_a_00183

已收到: 16 六月 2021
公认: 16 十月 2021

通讯作者:
Michael Färber
michael.faerber@kit.edu

处理编辑器:
Ludo Waltman

版权: © 2022 Michael Färber and
Lin Ao. Published under a Creative
Commons Attribution 4.0 国际的
(抄送 4.0) 执照.

麻省理工学院出版社

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

working styles will change considerably over the next few decades (Hoffman, Ibáñez et al., 2018;
Jaradeh, Auer et al., 2019A) 然后, in addition to PDF documents, scientific knowledge might
be provided manually or semiautomatically via appropriate forms (Jaradeh et al., 2019乙) or auto-
matically based on information extraction on the publications’ full-texts (Färber et al., 2021).

The Microsoft Academic Knowledge Graph (MAKG) (Färber, 2019), AMiner (唐, 张
等人。, 2008), OpenCitations (Peroni, Dutton et al., 2015), AceKG (王, Yan et al., 2018),
and Open-AIRE (OpenAIRE, 2021) are examples of large domain-specific knowledge graphs
with millions or sometimes billions of facts about publications and associated entities, 例如
authors, venues, and fields of study. 此外, scholarly knowledge graphs edited by the crowd
(Jaradeh et al., 2019乙) and providing scholarly key content (Färber & Lamprecht, 2022; Jaradeh
等人。, 2019乙) have been proposed. 最后, freely available cross-domain knowledge graphs
such as Wikidata (https://wikidata.org/) provide an increasing amount of information about
the academic world, although not as systematic as the domain-specific offshoots.

The Microsoft Academic Knowledge Graph (MAKG) (Färber, 2019) was published in its first
version in 2019 and is peculiar in the sense that (1) it is one of the largest freely available schol-
arly knowledge graphs (超过 8 billion RDF triples as of September 2019), (2) it is linked to other
data sources in the Linked Open Data cloud, 和 (3) it provides metadata for entities that
are—particularly in combination—often missing in other scholarly knowledge graphs (例如,
authors, 机构, journals, fields of study, in-text citations). As of June 2020, the MAKG con-
tains metadata for more than 239 million publications from all scientific disciplines, 也
超过 1.38 billion references between publications. As outlined in Section 2.2, 自从 2019, 这
MAKG has already been used in various scenarios, such as recommender systems (Kanakia,
Shen et al., 2019), data analytics, bibliometrics, and scientific impact quantification (Färber,
2020; Färber et al., 2021; Schindler, Zapilko, & Krüger, 2020; Tzitzikas, Pitikakis et al., 2020),
as well as knowledge graph query processing optimization (Ajileye, Motik, & Horrocks, 2021).

Despite its data richness, the MAKG suffers from data quality issues arising primarily due to
the application of automatic information extraction methods from the publications (see further
章节分析 2). We highlight as major issues (1) the containment of author duplicates in
the range of hundreds of thousands, (2) the inaccurate and limited tagging (IE。, assignment) 的
publications with keywords given by the fields of study (Färber, 2019), 和 (3) the lack of
embeddings for the majority of MAKG entities, which hinders the development of machine
learning approaches based on the MAKG.

在本文中, we present methods for solving these issues and apply them to the MAKG,

resulting in an enhanced MAKG.

第一的, we perform author name disambiguation on the MAKG’s author set. 为此, 我们
adopt an unsupervised approach to author name disambiguation that uses the rich publication
representations in the MAKG and that scales for hundreds of millions of authors. We use
ORCID iDs to evaluate our approach.

第二, we develop a method for tagging all publications with fields of study and with a
newly generated set of keywords based on the publications’ abstracts. While the existing
field of study labels assigned to papers are often misleading (see Wang, 沉等人. (2019) 和
部分 4) 和, 因此, often not beneficial for search and recommender systems, the enhanced
field of study labels assigned to publications can be used, 例如, to search for and rec-
ommend publications, authors, and venues, as our evaluation results show.

第三, we create embeddings for all 239 million publications, 243 million authors, 49,000
journals, 和 16,000 conference entities in the MAKG. We experimented with various state-of-

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

the-art embedding approaches. Our evaluations show that the ComplEx embedding method
(Trouillon, Welbl et al., 2016) outperforms other embeddings in all metrics. To the best of our
知识, RDF knowledge graph embeddings have not yet been computed for such a large
(scholarly) knowledge graph. 例如, RDF2Vec (Ristoski, Rosati et al., 2019) was trained
在 17 million Wikidata entities. Even DGL-KE (郑, Song et al., 2020), a recently published
package optimized for training knowledge graph embeddings at a large scale, was evaluated
on a benchmark with only 86 million entities.

最后, we provide statistics concerning the authors, 文件, and fields of study in the
newly created MAKG. 例如, we analyze the authors’ citing behaviors, 号码
of authors per paper over time, and the distribution of fields of study using the disambiguated
author set and the new field of study assignments. We incorporate the results of all mentioned
tasks into a final knowledge graph, which we provide online to the public at https://makg.org
(以前: http://ma-graph.org) and http://doi.org/10.5281/zenodo.4617285. Thanks to the
disambiguated author set, the new paper tags, and the entity embeddings, the enhanced
MAKG opens the door to improved scholarly search and recommender systems and advanced
scientific impact quantification.

全面的, our contributions are as follows:

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

▪ We present and evaluate an approach for large-scale author name disambiguation, 哪个
can deal with the peculiarities of large knowledge graphs, such as heterogeneous entity
types and 243 million author entries.

▪ We propose and evaluate transformer-based methods for classifying publications according

to their fields of study based on the publications’ abstracts.

▪ We apply state-of-the-art entity embedding approaches to provide entity embeddings for
243 million authors, 239 million publications, 49,000 journals, 和 16,000 conferences,
and evaluate them.

▪ We provide a statistical analysis of the newly created MAKG.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

Our implementation for enhancing scholarly knowledge graphs can be found online at

https://github.com/lin-ao/enhancing_the_makg.

本文其余部分的结构如下. 在部分 2, we describe the MAKG,
along with typical application scenarios and its wide usage in the real world. We also outline
the MAKG’s limitations regarding its data quality, thereby providing our motivation for
enhancing the MAKG. 随后, in Sections 3, 4, 和 5, we describe in detail our
approaches to author name disambiguation, paper classification, and knowledge graph
embedding computation. 在部分 6, we describe the schema of the updated MAKG, infor-
mation regarding the knowledge graph provisioning and statistical key figures of the enhanced
MAKG. We provide a conclusion and give an outlook in Section 7.

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

2. OVERVIEW OF THE MICROSOFT ACADEMIC KNOWLEDGE GRAPH

2.1. Schema and Key Statistics

We can differentiate between three data sets:

1.
2.

the Microsoft Academic Graph (MAG) provided by Microsoft (Sinha, Shen et al., 2015),
the Microsoft Academic Knowledge Graph (MAKG) in its original version provided by
Färber since 2019 (Färber, 2019), 和
the enhanced MAKG outlined in this article.

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

The initial MAKG (Färber, 2019) was derived from the MAG, a database consisting of tab-
separated text files (Sinha et al., 2015). The MAKG is based on the information provided by the
MAG and enriches the content by modeling the data according to linked data principles to gen-
erate a Linked Open Data source (IE。, an RDF knowledge graph with resolvable URIs, 公众
SPARQL endpoint, and links to other data sources). During the creation of the MAKG, 数据
originating from the MAG is not modified (except for minor tasks, such as data cleaning, linking
locations to DB-pedia, and providing sameAs-links to DOI and Wikidata). 像这样, the data quality
of the MAKG is mainly equivalent to the data quality of the MAG provided by Microsoft.

桌子 1 shows the number of entities in the MAG as of May 29, 2020. 因此, 这
MAKG created from the MAG would also exhibit these numbers. This MAKG impresses with
its size: It contains the metadata for 239 million publications (包括 139 million abstracts),
243 million authors, 以及超过 1.64 billion references between publications (see also
https://makg.org/).

It is remarkable that the MAKG contains more authors than publications. The high number of
authors (243 百万) appears to be too high given that there were eight million scientists in the
world in 2013 according to UNESCO (Baskaran, 2017). For more information about the increase
in the number of scientists worldwide, we can refer to Shaver (2018). 此外, 号码
of affiliations in the MAKG (关于 26,000) appears to be relatively low, given that all research
institutions in all fields should be represented and that there exist 20,000 officially accredited
or recognized higher education institutions ( World Higher Education Database, 2021).

Compared to a previous analysis of the MAG in 2016 (Herrmannova & Knoth, 2016),
whose statistics would be identical to the MAKG counterpart if it had existed in 2016, 这
number of instances has increased for all entity types (including the number of conference
series from 1,283 到 4,468), except for the number of conference instances, which has
dropped from 50,202 到 16,142. An obvious reason for this reduction is the data cleaning pro-
cess as a part of the MAG generation at Microsoft. Although the numbers of journals, authors,
and papers have doubled in size compared to the 2016 version (Herrmannova & Knoth,
2016), the number of conference series and fields of study have nearly quadrupled.

数字 1 shows how many publications represented in the MAKG have been published per
discipline (IE。, 等级-0 field of study). 药品, materials science, and computer science

桌子 1. General statistics for MAG/MAKG entities as of June 2020

Key
文件

Papers with Link

Papers with Abstract

Authors

Affiliations

Journals

Conference Series

Conference Instances

Fields of Study

Quantitative Science Studies

# in MAG/MAKG
238,670,900

224,325,750

139,227,097

243,042,675

25,767

48,942

4,468

16,142

740,460

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

数字 1. Number of publications per discipline.

occupy the top positions. This was not always the case. According to the analysis of the MAG
在 2016 (Herrmannova & Knoth, 2016), 物理, 计算机科学, and engineering were the
disciplines with the highest numbers of publications. We assume that additional and changing
data sources of the MAG resulted in this change.

数字 2 presents the overall number of publication citations per discipline. The descending
order of the disciplines is, to a large extent, similar to the descending order of the disciplines
considering their associated publication counts (见图 1). 然而, specific disciplines,
such as biology, exhibit a large publication citation count compared to their publication count,
while the opposite is the case for disciplines such as computer science. The paper citation
count per discipline is not provided by the 2016 MAG analysis (Herrmannova & Knoth, 2016).

桌子 2 shows the frequency of instances per subclass of mag:纸, generated by means
of a SPARQL query using the MAKG SPARQL endpoint. Listing 1 shows an example of how the
MAKG can be queried using SPARQL.

2.2. Current Usage and Application Scenarios

The MAKG RDF dumps on Zenodo have been viewed almost 6,000 times and downloaded
多于 42,000 次 (as of June 15, 2021). As the RDF dumps were also available directly

数字 2. Paper citation count per discipline (IE。, 等级-0 field of study).

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

桌子 2. Number of publications by document type

Document type
杂志

Patent

会议

Book chapter

书

No type given

数字
85,759,950

52,873,589

4,702,268

2,713,052

2,143,939

90,478,102

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

Listing 1. Querying the top 100 institutions in the area of machine learning according to their
overall number of citations.

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

在https://makg.org/rdf-dumps/ (以前: http://ma-graph.org/rdf-dumps/) until January 2021,
这 21,725 访问 (since April 4, 2019) to this web page are also relevant.

数字 3, 4, 和 5 were created based on the log files of the SPARQL endpoint. They show
the number of SPARQL queries per day, the number of unique users per day, and which user
agents were used to which extent. Given these figures and a further analysis of the SPARQL
endpoint log files, the following facts are observable:

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

▪ Except for in 2 月, the number of daily requests increased steadily.
▪ The number of unique user agents remained fairly constant, apart from a period between

十月 2019 and January 2020.

▪ The frequency of more complex queries (based on query length) is increasing.

Within only one year of its publication in November 2019, the MAKG has been used in
diverse ways by various third parties. Below we list some of them based on citations of the
MAKG publication (Färber, 2019).

2.2.1.

Search and recommender systems and data analytics

▪ The MAKG has been used for recommender systems, such as paper recommendation

(Kanakia et al., 2019).

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

数字 3. Number of queries.

数字 4. Number of unique users.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 5. User agents.

▪ Scholarly data is becoming increasingly important for businesses. Due to its large number
of items (例如, 出版物, 研究人员), the MAKG has been discussed as a data source
in enterprises (Schubert, Jäger et al., 2019).

▪ The MAKG has been used by nonprofit organizations for data analytics. 例如,
Nesta uses the MAKG in its business intelligence tools (see https://www.nesta.org.uk and
https://github.com/michaelfaerber/MAG2RDF/issues/1).

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

▪ As a unique data source for scholarly data, the MAKG has been used as one of several
publicly available knowledge graphs to build a custom domain-specific knowledge
graph that considers specific domains of interest (Qiu, 2020).

2.2.2. Bibliometrics and scientific impact quantification

▪ The Data Set Knowledge Graph (Färber & Lamprecht, 2022) provides information about
data sets as linked open data source and contains links to MAKG publications in which
the data sets are mentioned. Utilizing the publications’ metadata in the MAKG allows
researchers to employ novel methods for scientific impact quantification (例如, 在职的
on an “h-index” for data sets).

▪ SoftwareKG (Schindler et al., 2020) is a knowledge graph that links about 50,000 scientific
articles from the social sciences to the software mentioned in those articles. The knowl-
edge graph also contains links to other knowledge graphs, such as the MAKG. 这样,
the SoftwareKG provides the means to assess the current state of software usage.

▪ Publications modeled in the MAKG have been linked to the GitHub repositories contain-
ing the source code associated with the publications (Färber, 2020). 例如, 这
facilitates the detection of trends on the implementation level and monitoring of how
the FAIR principles are followed by which people (例如, considering who provides the
source code to the public in a reproducible way).

▪ According to Tzitzikas et al. (2020), the scholarly data of the MAKG can be used to mea-

▪

sure institutions’ research output.
In Färber et al. (2021), an approach for extracting scientific methods and data sets used
by the authors is presented. The extracted methods and data sets are linked to the pub-
lications in the MAKG enabling novel scientific impact quantification tasks (例如, 事物-
suring how often which data sets and methods have been reused by researchers) 和
recommendation of methods and data sets. 全面的, linking the key content of scientific
publications as modeled in knowledge graphs or integrating such information into the
MAKG can be considered as a natural extension of the MAKG in the future.

▪ The MAKG has inspired other researchers to use it in the context of data-driven history of
科学 (see https://www.downes.ca/post/69870), (IE。, for science of science [Fortunato,
Bergstrom et al., 2018]).

▪ Daquino, Peroni et al. (2020) present the OpenCitations data model and evaluate the

representation of citation data in several knowledge graphs, such as the MAKG.

2.2.3. Benchmarking

▪ As a very large RDF knowledge graph, the MAKG has served as a data set for evaluating

novel approaches to streaming partitioning of RDF graphs (Ajileye et al., 2021).

2.3. Current Limitations

Based on the statistical analysis of the MAKG and the analysis of the usage scenarios of the
MAKG so far, we have identified the following shortcomings:

▪ Author name disambiguation is apparently one of the most pressing needs for enhancing

the MAKG.

▪ The assigned fields of study associated with the papers in the MAKG are not accurate

(例如, 建筑学), and the field of study hierarchy is quite erroneous.

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

▪ The use cases of the MAKG show that the MAKG has not been used extensively for
machine learning tasks. 迄今为止, only entity embeddings for the MAKG as of 2019 骗局-
cerning the entity type paper are available, and these have not been evaluated. 因此, 我们
perceive a need to provide state-of-the-art embeddings for the MAKG covering many
instance types, such as papers, authors, journals, and conferences.

3. AUTHOR NAME DISAMBIGUATION

3.1. 动机

The MAKG is a highly comprehensive data set containing more than 243 million author enti-
ties alone. As is the case with any large database, duplicate entries cannot be easily avoided
(王, Shen et al., 2020). When adding a new publication to the database, the maintainers
must determine whether the authors of the new paper already exist within the database or if a
new author entity is to be created. This process is highly susceptible to errors, as certain names
are common. Given a large enough sample size, it is not rare to find multiple people with
identical surnames and given names. 因此, a plain string-matching algorithm is not sufficient
for detecting duplicate authors. 桌子 3 showcases the 10 most frequently occurring author
names in the MAKG to further emphasize the issue, using the December 2019 version of
the MAKG for this analysis. All author names are of Asian origin. While it is true that romanized
Asian names are especially susceptible to causing duplicate entries within a database (Roark,
Wolf-Sonkin et al., 2020), the problem is not limited to any geographical or cultural origin and
是, 实际上, a common problem shared by Western names as well (Sun, 张等人。, 2017).

The goal of the author name disambiguation task is to identify the maximum number of
duplicate authors, while minimizing the number of “false positives”; 那是, it aims to limit the
number of authors classified as duplicates even though they are distinct persons in the real world.

在部分 3.2, we dive into the existing literature concerning author name disambiguation
和, 更普遍, entity resolution. 在部分 3.3, we define our problem formally. 在
部分 3.4, we introduce our approach, and we present our evaluation in Section 3.5. 最后,
we conclude with a discussion of our results and lessons learned in Section 3.6.

桌子 3. Most frequently occurring author names in the MAKG

Author name
Wang Wei

Zhang Wei

Li Li

Wang Jun

Li Jun

Li Wei

Wei Wang

Liu Wei

Zhang Jun

Wei Zhang

Quantitative Science Studies

Frequency
20,235

19,944

19,049

16,598

15,975

15,474

14,020

13,578

13,553

13,366

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

3.2. 相关工作

3.2.1.

Entity resolution

Entity resolution is the task of identifying and removing duplicate entries in a data set that refer
to the same real-world entity. This problem persists across many domains and, 讽刺地, 是
itself affected by duplicate names: “object identification” in computer vision, “coreference res-
olution” in natural language processing, “database merging,” “merge/purge processing,”
“deduplication,” “data alignment,” or “entity matching” in the database domain, and “entity
resolution” in the machine learning domain (Maidasani, Namata et al., 2012). The entities to
be resolved are either part of the same data set or may reside in multiple data sources.

Newcombe, Kennedy et al. (1959) were the first ones to define the entity linking problem,
which was later modeled mathematically by Fellegi and Sunter (1969). They derived a set of
formulas to determine the probabilities of two entities being “matching” based on given precon-
版本 (IE。, similarities between feature pairs). Later studies refer to the probabilistic formulas
as equivalent to a naïve Bayes classifier (Quass & Starkey, 2003; Singla & Domingos, 2006).

Generally speaking, there exist two approaches to dealing with entity resolution (王, 李
等人。, 2011). In statistics and machine learning, the task is formulated as a classification problem,
in which all pairs of entries are compared to each other and classified as matching or non-
matching by an existing classifier. In the database community, a rule-based approach is usually
used to solve the task. Rule-based approaches can often be transformed into probabilistic
classifiers, such as naïve Bayes, and require certain previous domain knowledge for its setup.

3.2.2. Author name disambiguation

Author name disambiguation is a subcategory of entity resolution and is performed on collec-
tions of authors. 桌子 4 provides an overview of papers specifically approaching the task of
author name disambiguation in the scholarly field in the last decade.

费雷拉, Gonçalves, and Laender (2012) surveyed existing methods for author name disam-
歧义. They categorized existing methods by their types of approach, such as author
grouping or author assignment methods, as well as their clustering features, such as citation
信息, web information, or implicit evidence.

Caron and van Eck (2014) applied a strict set of rules for scoring author similarities, 例如
100 points for identical email addresses. Author pairs scoring above a certain threshold are
classified as identical. Although the creation of such a rule set requires specific domain knowl-
边缘, the approach is still very simplistic in nature compared to other supervised learning
方法. 此外, it outperforms other clustering-based unsupervised approaches signif-
icantly (Tekles & Bornmann, 2019). 由于这些原因, we base our approach on the one pre-
sented in their paper.

3.3. Problem Formulation

Existing papers usually aim to introduce a new fundamental approach to author name disam-
biguation and do not focus on the general applicability of their approaches. 因此, 这些
approaches are often impractical when applied to a large data set. 例如, 一些
clustering-based approaches require the prior knowledge of the number of clusters (孙等人。,
2017) and other approaches require the pairwise comparison of all entities (Qian et al., 2015),
whereas some require external information gathered through web queries (Pooja et al., 2018),
which cannot be feasibly done when dealing with millions of entries, as the inherent bottleneck
of web requests greatly limits the speed of the overall processes. 所以, instead of choosing

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

桌子 4.

Approaches to author name disambiguation in 2011–2021

Authors
Pooja, Mondal, and Chandra (2020)

王, Wang et al. (2020)

Kim, Kim, and Owen-Smith (2019)

张, Xinhua, and Pan (2019)

Ma, 王, and Zhang (2019)

Kim, Rohatgi, and Giles (2019)

张, 严, and Zheng (2019)

张等人. (2019)

徐, 李等人. (2018)

Pooja, Mondal, and Chandra (2018)

Sun et al. (2017)

林, Zhu et al. (2017)

穆勒 (2017)

Kim, Khabsa, and Giles (2016)

Momeni and Mayr (2016)

Protasiewicz and Dadas (2016)

Qian, Zheng et al. (2015)

特兰, Huynh, and Do (2014)

Caron and van Eck (2014)

Schulz, Mazloumian et al. (2014)

Kastner, Choi, and Jung (2013)

Wilson (2011)

年
2020

2020

2019

2018

2017

2016

2015

2014

2013

2011

Approach
Graph-based combination of author similarity and topic graph

Supervised
✗

Adversarial representation learning

Matching email address, self-citation and coauthorship

with iterative clustering

Hierarchical clustering with edit distances

Graph-based approach

Deep neural network

Graph-based approach and clustering

Molecular cross clustering

Combination of single features

Rule-based clustering

Multi-level clustering

Hierarchical clustering with combination of similarity metrics

Neural network using embeddings

DBSCAN with random forest

Clustering based on coauthorship

Rule-based heuristic, linear regression, support vector

machines and AdaBoost

Support vector machines

Deep neural network

Rule-based scoring

Pairwise comparison and clustering

Random forest, support vector machines and clustering

Single layer perceptron

✓

✗

✓

✗

✓

✗

✓

✗

✓

✗

✓

a single approach, we aim to select features from different models and combine them to fit to
our target data set containing millions of author names.

We favor the use of unsupervised learning for the reasons mentioned above: lack of training
数据, lack of need for maintaining and updating of training data, and generally more favorable
time and space complexity. 因此, in our approach, we chose the hierarchical agglomerative
clustering algorithm (HAC). We formulate the problem as follows.

Given a set of n authors A = {a1, a2, a3, ……, 一个}, where ai represents an individual entry in the
数据集. 此外, each individual author ai consists of m features (IE。, ai = {ai1, ai2, ai3, ……,
目的}). aik is the kth feature of the ith author. The goal of our approach is to eliminate duplicate
entries in the data set that describe the same real-world entity, in this case the same person. 到
这结束, we introduce a matching function f which determines whether two given input entities

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

are “matching” (IE。, describe the same real-world person) or “nonmatching” (IE。, describe two
distinct people). Given an input of two authors ai and aj, the function returns the following:
(西德:1)
¼ 1 if ai and aj refer to the same real world entity i:e:; are “matching”
f ai; aj

0 if ai and aj refer to different real world entities

我:e:; are “nonmatching”

ð
ð

(西德:4)

(西德:3)

The goal of our entity resolution task is therefore to reduce the given set of authors A into a subset
Ã where ∀ai, aj 2 Ã, F (人工智能, aj) = 0.

3.4. Approach

We follow established procedures from existing research for unsupervised author name disambigua-
的 (Caron & van Eck, 2014; 费雷拉等人。, 2012) and utilize a two-part approach consisting of
pairwise similarity measurement using author and paper metadata and clustering. 此外, 我们
use blocking (参见章节 3.4) to reduce the complexity considerably. 数字 6 shows the entire
system used for the author name disambiguation process. The system’s steps are as follows:

1. 预处理. We preprocess the data by aggregating all relevant information (例如,
concerning authors, 出版物, and venues) into one single file for easier access.
We then sort our data by author name for the final input.

2. Disambiguation. We apply blocking to significantly reduce the complexity of the task.
We then use hierarchical agglomerative clustering with a rule-based binary classifier as
our distance function to group authors into distinct disambiguated clusters.

3. Postprocessing. We aggregate the output clusters into our final disambiguated author set.

以下, the most important aspects of these steps are outlined in more detail.

3.4.1.

Feature selection

We use both author and publication metadata for disambiguation. We choose the features
based on their availability in the MAKG and on their previous use in similar works from
桌子 4. 全面的, we use the following features:

▪ Author name: This is not used explicitly for disambiguation, but rather as a feature for

blocking to reduce the complexity of the overall algorithm.

▪ Affiliation: This determines whether two authors share a common affiliation.
▪ Coauthors: This determines whether two authors share common coauthors.
▪ Titles: This calculates the most frequently used keywords in each author’s published titles

in order to determine common occurrences.

数字 6. Author name disambiguation process.

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

▪ Years: This compares the time frame in which authors published works.
▪

Journals and conferences: These compare the journals and conferences where each
author published.

▪ References: This determines whether two authors share common referenced publications.

Although email has proven to be a highly effective distinguishing feature for author name
disambiguation (Caron & van Eck, 2014; Kim, 2018; Schulz et al., 2014), this information is
not available to us directly and therefore omitted from our setup. Coauthorship, 在另一
手, is one of the most important features for author name disambiguation (Han, Giles et al.,
2004). Affiliation could be an important feature, though we could not rely solely on it, 作为
researchers often change their place of work. 此外, as the affiliation information is auto-
matically extracted from the publications, it might be on varying levels (例如, department vs.
university) and written in different ways (例如, full name vs. abbreviation). Journals and confer-
ences could be effective features, as many researchers tend to publish in places familiar to
他们. For a similar reason, references can be an effective measure as well.

3.4.2. Binary classifier

We adapt a rule-based binary classifier as seen in the work of Caron and van Eck (2014). 我们
choose a simple rule-based classifier because of its simplicity, interpretability, and scalability.
The unsupervised approach does not require any training data and is therefore well suited for
our situation. 此外, it is easily adapted and fine-tuned to achieve the best performance
based on our data set. Its lack of necessary training time, as well as fast run time, makes it ideal
when working with large-scale data sets containing millions of authors.

The binary classifier uses as input two feature vectors representing two author entities.
Given two authors ai, aj, each consisting of m features ai = {ai1, ai2, ai3, ……, 目的}, the similarity
模拟(人工智能, aj) between these two authors is the sum of similarities between each of their respective
features where simk is the similarity between the kth feature of two authors.

(西德:1)
sim ai; aj

(西德:3)

(西德:1)
simk aik ; ajk

(西德:3)

k¼1

The classifier then compares the similarity sim(人工智能, aj) with a predetermined threshold
θmatching to determine whether two authors are “matching” or “nonmatching.” Our classifier
function takes the following shape:

(西德:3)

(西德:1)
f ai; aj

(西德:4)
¼ 1;
0;

(西德:1)
if sim ai; aj
(西德:1)
if sim ai; aj

(西德:3)
(西德:3)

≥ θmatching
< θmatching For each feature, the similarity function consists of rule-based scoring. Below, we briefly describe how similarities between each individual feature are calculated. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / / 3 1 5 1 2 0 0 8 2 8 0 q s s _ a _ 0 0 1 8 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 1. For features with one individual value, as is the case with affiliations because it does not record historical data, the classifier determines whether both entries match and assigns a fixed score saffiliation. (cid:1) simaffiliation ai; aj (cid:3) ¼ saffiliation 0 if ai;affiliation ¼ aj;affiliation else (cid:4) 2. For other features consisting of multiple values such as coauthors, the classifier deter- mines the intersection of both value sets. Here, we assign scores using a stepping func- tion (i.e., fixed scores for an intersection of one, two, three, etc.). Quantitative Science Studies 63 The Microsoft Academic Knowledge Graph enhanced The following formula represents the similarity function for calculating similarities between two authors for the feature coauthors, though the same formula holds for fea- tures journals, conferences, titles, and references with their respective values. (cid:1) simcoauthors ai; aj (cid:3) ¼ 8 >>< >>:

scoauthors1
scoauthors2
scoauthors3
0

(西德:5)
(西德:5)
(西德:5)
(西德:5)
(西德:5)
(西德:5)

if ai;coauthors ∩ aj;共同作者
if ai;coauthors ∩ aj;共同作者
if ai;coauthors ∩ aj;共同作者
别的

(西德:5)
(西德:5) ¼ 1
(西德:5)
(西德:5) ¼ 2
(西德:5)
(西德:5) ≥ 3

Papers’ titles are a special case for scoring, as they must be numericalized to allow a
比较. 理想情况下, we would use a form of word embeddings to measure the true
semantic similarity between two titles, 但, based on the results of preliminary experi-
评论, we did not find it worth doing, as the added computation necessary would be
significant and would most likely not translate directly into huge performance
增加. We therefore adapt a plain surface form string comparison. 具体来说,
we extract the top 10 most frequently used words from the tokenized and lemmatized
titles of works published by an author and calculate their intersection with the set of
another author.

3. A special case exists for the references feature. A bonus score sself-reference is applied to
the case of self-referencing, that is if two compared authors directly reference each
other in their respective works, as can be seen in the work of Caron and van Eck (2014).
4. For some features, such as journals and conferences, a large intersection between two
authors may be uncommon. We only assign a nonzero value if both items share a com-
mon value.

(西德:1)

simjournals ai; aj

(西德:3)

¼ sjournals;
0;

if ai;journals ∩ aj;journals
别的

(西德:4)

(西德:5)
(西德:5)

(西德:5)
(西德:5) ≥ 1

5. Other features such as publication year also consist of multiple values, though we inter-
pret them as extremes of a time span. Based on their feature values, we construct a time
span for each author in which they were active and check for overlap in active years
when comparing two authors (similar to Qian et al. (2015)). 再次, a fixed score is
assigned based on the binary decision. 例如, if author A published papers in
2002, 2005, 和 2009, we extrapolate the active research period for author A as
2002–2009. If another author B was active during the same time period or within 10
years of both ends of the time span (IE。, 1992–2019), we assign a score syears as the
输出. We expect most author comparisons to share an overlap in research time span
and thus receive a score of greater than zero. 所以, this feature is more aimed at
“punishing” obvious nonmatches. The scoring function takes the following shape:

if ai and aj were active within 10 years of one another
别的

(西德:4)
¼ syears
0

(西德:1)
simyears ai; aj

(西德:3)

3.4.3. Blocking

Due to the high complexity of traditional clustering algorithms (例如, 氧(n2)), there is a need to
implement a blocking mechanism to improve the scalability of the algorithm to accommodate
large amounts of input data. We implement sorted neighborhood (Hernández & Stolfo, 1995)
as a blocking mechanism. We sort authors based on their names as provided to us by the
MAKG and measure the similarity using the Jaro-Winkler distance (Jaro, 1989), as Winkler
(1999) provides good performances for name-matching tasks on top of being a fast heuristic
(科恩, Ravikumar, & Fienberg, 2003).

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

The Jaro-Winkler similarity returns values between 0 和 1, where a greater value signifies a
closer match. We choose 0.95 as the threshold θblocking, based on performance on our eval-
uation data set, and we choose 0.1 as the standard value for the scaling factor p. Similar names
will be formed into blocks where we perform pairwise comparison and cluster authors that
were classified as similar by our binary classifier.

3.4.4. Clustering

The final step of our author name disambiguation approach consists of clustering the authors.
为此, we choose the traditional hierarchical agglomerative clustering approach. 我们
generate all possible pairs between authors for each block and apply our binary classifier to
distinguish matching and nonmatching entities. We then aggregate the resulting disambigu-
ated blocks and receive the final collection of unique authors as output.

3.5. 评估

3.5.1.

Evaluation data

The MAKG contains bibliographical data on scientific publications, 研究人员, organiza-
系统蒸发散, and their relationships. We use the version published in December 2019 for evaluation,
though our final published results were performed on an updated version (with only minor
变化) from June 2020 consisiting of 243,042,675 authors.

桌子 5. Hyperparameter values for high precision setup

Hyperparameter
saffiliation

scoauthors1

scoauthors2

scoauthors3

stitles1

stitles2

stitles3

sjournals

sconferences

syears

sreferences1

sreferences2

sreferences3

sself-references

θmatching

θblocking

Quantitative Science Studies

Value
1

0.95

0.1

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

3.5.2.

Evaluation setup

For the evaluation, we use the ORCID, a persistent digital identifier for researchers, 作为一个
ground truth, following Kim (2019). ORCID have been established as a common way to
identify researchers. Although the ORCID is still in the process of being adopted, 这是
already widely used. 多于 7,000 journals already collect ORCID from authors (看
https://info.orcid.org/requiring-orcid-in-publications/). Our ORCID evaluation set consists of
69,742 author entities.

Although using ORCID as a ground truth, we are aware that this data set may be charac-
terized by imbalanced metadata. First of all, ORCID became widely adopted only a few years
前. 因此, primarily author names from publications published in recent years are considered
in our evaluation. 此外, we can assume that ORCID is more likely to be used by active
researchers with a comparatively higher number of publications and that the more publica-
tions’ metadata we have available for one author, the higher the probability is for a correct
author name disambiguation.

We set the parameters as given in Table 5. We refer to these as the high precision config-
uration. These values were chosen based on choices in other similar approaches (Caron & van
埃克, 2014) and adjusted through experimentations with our evaluation data as well as analysis
of the relevancy of each individual feature (参见章节 3.5, Evaluation Results).

We rely on the traditional metrics of precision, 记起, and accuracy for our evaluation.

3.5.3.

Evaluation results

Due to blocking, the total number of pairwise comparisons was reduced from 2,431,938,411
到 1,475. Out of them, 49 pairs were positive according to our ORCID labels (IE。, they refer to
the same real-world person); 另一个 1,426 were negative. Full classification results can be
found in Table 6. We have a heavily imbalanced evaluation set, with a majority of pairings
being negative. 尽管如此, we were able to correctly classify the majority of negative labels
(1,424 在......之外 1,426). The great number of false negative classifications is immediately notice-
有能力的. This is due to the selection of features or lack of distinguishing features overall to classify
certain difficult pairings.

We have therefore chosen to opt for a high percentage of false negatives to minimize the
amount of false positive classifications, as those are tremendously more damaging to an author
disambiguation result.

桌子 7 showcases the average scores for each feature separated into each possible category
of outcome. 例如, the average score for the feature titles from all comparisons falling
under the true positive class was 0.162, and the average score for the feature years for compar-
isons from the true negative class was 2.899. Based on these results, journals and references play
a significant role in identifying duplicate author entities within the MAKG; 那是, they contribute
high scores for true positives and true negatives. Every single author pair from the true positive

桌子 6. Diffusion matrix of high-precision setup

Positive classification

Negative classification

全部的

Positive label
37

Negative label

1,424

1,426

全部的
39

1,436

1,475

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

桌子 7. Average disambiguation score per feature for high precision setup (TP = True Positive;
TN = True Negative; FP = False Positive; FN = False Negative)

saffiliation

scoauthors

stitles

syears

sjournals

sconferences

sself-reference

sreferences

TP
0.0

0.0

0.162

3.0

0.0

2.027

TN
0.004

0.0

2.89

0.034

2.823

0.0

0.023

FP
0.0

0.0

3.0

0.0

2.0

FN
0.083

0.0

0.25

3.0

1.75

3.0

0.0

0.167

classification cluster shared a common journal value, whereas almost none from the true neg-
ative class did. Similar observations can be made for the feature references as well.

Our current setup results in a precision of 0.949, recall of 0.755 and an accuracy of 0.991.

By varying the scores assigned by each feature level distance function, we can affect the

focus of the entire system from achieving a high level of precision to a high level of recall.

To improve our relatively poor recall value, we have experimented with different setups for
distance scores. At high performance levels, a tradeoff persists between precision and recall.
By applying changes to score assignment as seen in Table 8, we arrive at the results in Table 9.

桌子 8. Updated disambiguation scores for high recall setup

High precision
1

High recall
5

saffiliation

scoauthors,1

scoauthors,2

scoauthors,3

stitles,1

stitles,2

stitles,3

syears

sjournals

sconferences

sself-references

sreferences,1

sreferences,2

sreferences,3

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

桌子 9. Diffusion matrix for high recall setup

Positive classification

Negative classification

全部的

Positive label
45

Negative label

1,413

1,426

全部的
58

1,417

1,475

We were able to increase the recall from 0.755 到 0.918. 同时, our precision
plummeted from the original 0.949 到 0.776. 因此, the accuracy stayed at a similar level
的 0.988. The exact diffusion matrix can be found in Table 9. With our new setup, we were
able to identify the majority of all duplicates (45 在......之外 49), though at the cost of a significant
increase in the number of false positives (从 2 到 13). By further analyzing the exact reason-
ing behind each type of classification through analysis of individual feature scores in Table 10,
we can see that the true positive and false positive classifications result from the same feature
similarities, therefore creating a theoretical upper limit to the performance of our specific
approach and data set. We hypothesize that additional external data may be necessary to
exceed this upper limit of performance.

We must consider the heavily imbalanced nature of our classification labels when evalu-
ating the results to avoid falling into the trap of the “high accuracy paradox”: 那是, the result-
ing high accuracy score of a model on highly imbalanced data sets, where negative labels
significantly outnumber positive labels. The model’s favorable ability to predict the true neg-
atives outweighs its shortcomings for identifying the few positive labels.

最终, we decided to use the high-precision setup to create the final knowledge graph,
as precision is a much more meaningful metric for author name disambiguation as opposed to
记起. It is often preferable to avoid removing nonduplicate entities rather than identifying all
duplicates at the cost of false positives.

We also analyzed the average feature density per author in the MAKG and the ORCID eval-
uation data set to gain deeper insight into the validity of our results. Feature density here refers
to the average number of data entries within an individual feature, such as the number of
papers for the feature “published papers.” The results can be found in Table 11.

桌子 10. Average disambiguation score per feature for the high recall setup (TP = True Positive;
TN = True Negative; FP = False Positive; FN = False Negative). As we consider the scores for
disambiguation and not the confusion matrix for the classification, values can be greater than 1.

score_affiliation

score_coauthors

score_titles

score_years

score_journals

score_conferences

score_self-reference

score_references

TP
0.111

0.0

0.133

3.0

3.911

4.0

0.0

1.667

TN
0.004

0.0

2.89

0.023

3.762

0.0

0.023

FP
1.538

0.0

3.0

3.077

4.0

0.0

0.308

FN
0.0

0.0

0.75

3.0

0.0

4.0

0.0

0.5

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

桌子 11. Comparison between the overall MAKG and the evaluation set

AuthorID

Rank

NormalizedName

DisplayName

LastKnownAffiliationID

PaperCount

CitationCount

CreateDate

PaperID

DOI

Coauthors

Titles

年

杂志

会议

参考

ORCID

MAKG
1.0

1.0

1.003

0.172

1.0

2.612

1.240

11.187

2.620

1.528

0.698

0.041

20.530

0.0003

评估
1.0

1.0

0.530

1.0

1.196

1.0

4.992

1.198

1.107

0.819

0.025

26.590

1.0

As we can observe, there is a variation in “feature richness” between the evaluation set and
the overall data set. 然而, for the most important features used for disambiguation—
namely journals, conferences, and references—the difference is not as pronounced. 所以,
we can assume that the disambiguation results will not be strongly affected by this variation.

Performing our author name disambiguation approach on the whole MAKG containing
243,042,675 authors (MAKG version from June 2020) resulted in a reduced set of
151,355,324 authors. This is a reduction by 37.7% and shows that applying author name dis-
ambiguation is highly beneficial.

重要的, we introduced a maximum block size of 500 in our final approach. Without it,
the number of authors grouped into the same block would theoretically be unlimited. 这
introduction of a limit to block size further improves performance significantly, reducing the
runtime from over a week down to about 48 小时, using an Intel Xeon E5-2660 v4 processor
和 128 GB of RAM. We have therefore opted to keep the limit, as the tradeoff in performance
decrease is manageable and as we aimed to provide an approach for real application rather
than a proof of concept. 然而, the limit can be easily removed or adjusted.

3.6. 讨论

Due to the high number of authors with identical names within the MAG and, 因此, the MAKG,
our blocking algorithm sometimes still generates large blocks with more than 20,000 authors.

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

The number of pairwise classifications necessary equates to the number of combinations,

即

, leading to high computational complexity for larger block sizes. One way of

(西德:6) (西德:7)
n
2

dealing with this issue would be to manually limit the maximum number of entities within one
block, as we have done. Doing so will split potential duplicate entities into distinct blocks,
meaning they will never be subject to comparison by the binary classifier, although the entire
process may be sped up significantly depending on the exact size limit selected. To highlight the
challenge, 桌子 12 showcases the author names with the largest block sizes created by our
blocking algorithm (IE。, author names generating the most complexity). 差异在于
total comparisons for the name block of “Wang Wei” would be 204,717,495 comparisons

(total comparisons for 20,235 authors with no block size limit:

= 204,717,495)

(西德:6)

(西德:7)

20; 235
2

with no block size limit, 相比 5,017,495 comparisons (total comparisons for 20,235

(西德:6)

(西德:7)

(西德:6)

(西德:7)

authors with a block size limit of 500: 40 ×

500
2

235
2

= 5,017,495) for a block limit

的 500 authors. We have found the difference in performance to be negligible compared to
the total amount of duplicate authors found, as it differs by less than 2 million authors compared
to the almost 100 million duplicate authors found.

Our approach can be further optimized through hand-crafted rules for dealing with certain
author names. Names of certain origins, such as Chinese or Korean names, possess certain
nuances. While the alphabetized Romanized forms of two Chinese names may be similar
or identical, the original language text often shows a distinct difference. 此外, 在下面-
standing the composition of surnames and given names in this case may also help further
reduce the complexity. 举个例子, the names “Zhang Lei” and “Zhang Wei” only differ
by a single character in their Romanized forms and would be classified as potential dupli-
cates or typos due to their similarity, even though for native Chinese speakers such names
signify two distinctly separate names, especially when written in the original Chinese charac-
ter form. Chinese research publications have risen in number in recent years (Johnson et al.,
2018). Given their susceptibility to creating duplicate entries as well as their significant

桌子 12.

Largest author name blocks during disambiguation

Author name
Wang Wei

Zhang Wei

Li Li

Wang Jun

Li Jun

Li Wei

Wei Wang

Liu Wei

Zhang Jun

Wei Zhang

Quantitative Science Studies

Block size
20,235

19,944

19,049

16,598

15,975

15,474

14,020

13,580

13,553

13,366

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

presence in the MAKG already, future researchers might be well suited to isolate this problem
as a focal point.

此外, there is the possibility to apply multiple classifiers and combine their results in
a hybrid approach. If we were able to generate training data of sufficient volume and quality,
we would be able to apply certain supervised learning approaches, such as neural networks or
support vector machines using our generate feature vectors as input.

4. FIELD OF STUDY CLASSIFICATION

4.1. 动机

Publications modeled in the MAKG are assigned to specific fields of study. 此外, 这
fields of study are organized in a hierarchy. In the MAKG as of June 2020, 709,940 fields of
study are organized in a multilevel hierarchical system (见表 13). Both the field of study
paper assignments and the field of study hierarchy in the MAKG originate from the MAG data
provided by Microsoft Research. The entire classification scheme is highly comprehensive and
covers a huge variety of research areas, but the labeling of papers contains many shortcom-
英格斯. 因此, the second task in this article for improving the MAKG is the revision of field of
study assignment of individual papers.

Many of the higher-level fields of study in the hierarchical system are highly specific, 和
therefore lead to many misclassifications purely based on certain matching keywords in the
paper’s textual information. 例如, papers on the topic of machine learning architecture
are sometimes classified as “Architecture.” Because the MAG does not contain any full texts of
文件, but is limited to the titles and abstracts only, we do not believe that the information
provided in the MAG is comprehensive enough for effective classification on such a sophis-
ticated level.

On top of that, an organized structure is highly rigid and difficult to change. When intro-
ducing a previously unincorporated field of study, we have to not only modify the entire clas-
sification scheme, but ideally also relabel all papers in case some fall under the new label.

We believe the underlying problem to be the complexity of the entire classification scheme.
We aim to create a simpler structure that is extendable. Our idea is not aimed at replacing the
existing structure and field of study labels, but rather enhancing and extending the current system.
Instead of limiting each paper to being part of a comprehensive structured system, 我们 (1) merely
assign a single field of study label at the top level (also called “discipline” in the following, 等级 0
in the MAKG), such as computer science, 物理, or mathematics. We then (2) assign to each

桌子 13. Overview of MAG field of study hierarchy

等级
0

# of fields of study

292

138,192

208,368

135,913

167,676

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

publication a list of keywords (IE。, tags), which are used to describe the publication in further
细节. Our system is therefore essentially descriptive in nature rather than restrictive.

Compared to the classification scheme of the original MAKG and the MAG so far, our pro-
posed system is more fluid and extendable as its labels or tags are not constrained to a rigid
等级制度. New concepts are freely introduced without affecting existing labels.

Our idea therefore is to classify papers on a basic level, then extract keywords in the form of
tags for each paper. These can be used to describe the content of a specific work, while leav-
ing the structuring of concepts to domain experts in each field. We classify papers into their
respective fields of study using a transformer-based classifier and generate tags for papers using
keyword extraction from the publications’ abstracts.

在部分 4.2, we introduce related work concerning text classification and tagging. 我们
describe our approach in Section 4.3. 在部分 4.4, we present our evaluation of existing
field of study labels, the MAKG field of study hierarchy, and the newly created field of study
labels. 最后, we discuss our findings in Section 4.5.

4.2. 相关工作

4.2.1. Text classification

The tagging of papers based on their abstracts can be regarded as a text classification task. Text
classification aims to categorize given texts into distinct subgroups according to predefined
特征. As with any classification task, text classification can be separated into binary,
multilabel, and multiclass classification.

Kowsari, Meimandi et al. (2019) provide a recent survey of text classification approaches.
Traditional approaches include techniques such as the Rocchio algorithm (Rocchio, 1971),
boosting (Schapire, 1990), bagging (布赖曼, 1996), and logistic regression (考克斯 & Snell,
1989), as well as naïve Bayes. Clustering-based approaches include k-nearest neighbor and
support vector machines (Vapnik & Chervonenkis, 1964). More recent approaches mostly uti-
lize deep learning. Recurrent neural networks (Rumelhart, 欣顿, & 威廉姆斯, 1986) and long
short-term memory networks (LSTMs) (Hochreiter & 施米德胡贝尔, 1997) had been the pre-
dominant approaches for representing language and solving language-related tasks until the
rise of transformer-based models.

Transformer-based models can be generally separated into autoregressive and autoencod-
ing models. Autoregressive models such as Transformer-XL (Dai, 杨等人。, 2019) learn rep-
resentations for individual word tokens sequentially, whereas autoencoding models such as
BERT (Devlin, Chang et al., 2019) are able to learn representations in parallel using the entirety
of the document, even words found after the word token. Newer autoregressive models such
as XLNet (哪个, Dai et al., 2019) combine features from both categories and are able to
achieve state-of-the-art performance. 此外, other variants of the BERT model exist, 这样的
as ALBERT (兰, 陈等人。, 2020) and RoBERTa (刘, Ott et al., 2019). 此外, special-
ized BERT variants have been created. One such variant is SciBERT (Beltagy, Lo, & Cohan,
2019), which specializes in academic texts.

4.2.2. Tagging

Tagging—based on extracting the tags from a text—can be considered synonymous with key-
word extraction. To extract keywords from publications’ full texts, several approaches and
challenges have been proposed (Alzaidy, Caragea, & 贾尔斯, 2019; Florescu & Caragea,
2017; Kim, Medelyan et al., 2013), exploiting publications’ structures, such as citation

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

网络 (Caragea, Bulgarov et al., 2014). In our scenario, we use publications’ abstracts, 作为
the full texts are not available in the MAKG. 此外, we focus on keyphrase extraction
methods requiring no additional background information and not designed for specific tasks,
such as text summarization.

TextRank (Mihalcea & Tarau, 2004) is a graph-based ranking model for text processing. 它
performs well for tasks such as keyword extraction as it does not rely on local context to deter-
mine the importance of a word, but rather uses the entire context through a graph. For every
input text, the algorithm splits the input into fundamental units (words or phrases depending on
the task) and structures them into a graph. Afterwards, an algorithm similar to PageRank deter-
mines the relevance of each word or phrase to extract the most important ones.

Another popular algorithm for keyword extraction is RAKE, which stands for rapid auto-
matic keyword extraction (Rose, Engel et al., 2010). In RAKE, the text is split by a previously
defined list of keywords. 因此, a less comprehensive list would lead to longer phrases. 在骗子-
特拉斯特, TextRank splits the text into individual words first and combines words which benefit
from each other’s context at a later stage in the algorithm. 全面的, RAKE is more suitable
for text summarization tasks due to its longer extracted key phrases, whereas TextRank is suit-
able for extracting shorter keywords used for tagging, in line with our task. In their original
出版物, the authors of TextRank applied their algorithm for keyword extraction from pub-
lications’ abstracts. Due to all these reasons, we use TextRank for publication tagging.

4.3. Approach

Our approach is to fine-tune a state-of-the-art transformer model for the task of text classifica-
的. We use the given publications’ abstracts as input to classify each paper into one of 19
top-level field of study labels (IE。, 等级 0) predefined by the MAG (见表 11). After that, 我们
apply TextRank to extract keyphrases and assign them to papers.

4.4. 评估

4.4.1.

Evaluation data

For the evaluation, we produce three labeled data sets in an automatic fashion. Two of the data
sets are used to evaluate the current field of study labels in the MAKG (and MAG) 和
given MAKG field of study hierarchy, while the last data set acts as our source for training
and evaluating our approach for the field of study classification.

In the following, we describe our approaches for generating our three data sets.

1. For our first data set, we select field of study labels directly from the MAKG. 作为男人-
tioned previously, the MAKG’s fields of study are provided in a hierarchical structure
(IE。, fields of study, such as research topics) can have several fields of study below
他们. We filter the field of study labels associated with papers for level-0 labels only;
那是, we consider only the 19 top-level labels and their assignments to papers.
桌子 14 lists all 19 等级-0 fields of study in the MAKG; 这些, associated with the
文件, are also our 19 target labels for our classifier. This data set will be representative
of the field of study assignment quality of the MAKG overall as we compare its field of
study labels with our ground truth (参见章节 4.4).

2. For our second data set, we extrapolate field of study labels from the MAKG/MAG using
the field of study hierarchy—that is, we relabel the papers using their associated
top-level fields of study on level 0. 例如, if a paper is currently labeled as
“neural network,” we identify its associated level-0 field of study (the top-level field

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

桌子 14.

List of level-0 fields of study from the MAG

MAG ID
41008148

86803240

17744445

192562407

205649164

185592680

162324750

33923547

127313418

127413603

121332964

144024400

144133560

71924100

15744967

142362112

95457728

138885662

39432304

Field of study
计算机科学

生物学

政治学

Materials Science

Geography

Chemistry

经济学

Mathematics

Geology

Engineering

Physics

Sociology

商业

药品

心理学

Art

历史

Philosophy

Environmental Science

of study in the MAKG). 在这种情况下, the paper would be assigned the field of study of
“computer science.”
We prepare our data set by first replacing all field of study labels using their respective
top-level fields of study. Each field of study assignment in the MAKG has a correspond-
ing confidence score. We thus sort all labels by their corresponding level-0 fields of
study and calculate the final field of study of a given paper by summarizing their indi-
vidual scores. 例如, consider a paper that originally has the field of study labels
“neural network” with a confidence score of 0.6, “convolutional neural network” with a
confidence score of 0.5, and “graph theory” with a confidence score of 0.8. The labels
“neural network” and “convolutional neural network” are mapped back to the top-level
field of study of “computer science,” whereas “graph theory” is mapped back to “math-
ematics.” To calculate the final score for each discipline, we totaled the weights of every
occurrence of a given label. In our example, “computer science” would have a score of
0.5 + 0.6 = 1.1, and “mathematics” a score of 0.8, resulting in the paper being labeled
as “computer science.”
This approach can be interpreted as an addition of weights on the direct labels we gen-
erated for our previous approach. By analyzing the differences in results from these two

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

data sets, we aim to gather some insights into the validity of the hierarchical structure of
the fields of study found in the MAG.

3. Our third data set is created by utilizing the papers’ journal information. We first select a
specific set of journals from the MAKG for which the journal papers’ fields of study can
easily be identified. This is achieved through simple string matching between the names
of top-level fields of study and the names of journals. 例如, if the phrase “com-
puter science” occurs in the name of a journal, we assume it publishes papers in the
field of computer science.
We expect the data generated by this approach to be highly accurate, as the journal is
an identifying factor of the field of study. We cannot rely on this approach to match all
papers from the MAKG, as a majority of papers were published in journals whose main
disciplines could not be discerned directly from their names. 还, there exist a portion
of papers that do not have any associated journal entries in the MAKG.
We are able to label 2,553 journals in this fashion. We then label all 2,863,258 文件
from these given journals using their journal-level field of study labels. We use the
resulting data set to evaluate the fields of study in the MAKG as well as to generate
training data for the classifier.
In the latter case, we randomly selected 20,000 abstracts per field of study label, resulting
在 333,455 training samples (IE。, paper–field-of-study assignment pairs). The mismatch
compared to the theoretical training data size of 380,000 comes from the fact that
some labels had fewer than 20,000 papers available to select from.

Our data for evaluating the classifier comes from our third approach, namely the field of
study assignment based on journal names. We randomly drew 2,000 samples for each label
from the labeled set to form our test data set. Note that the test set does not overlap in any way
with the training data set generated through the same approach, as both consist of distinctly
separate samples (covering all scientific disciplines). 总共, the evaluation set consists of
38,000 samples spread over the 19 学科.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

4.4.2.

Evaluation setup

All our implementations use the Python module Simple Transformers (https://github.com
/ ThilinaRajapakse/simpletransformers; based on Transformers, https://github.com
/huggingface/transformers), which provides a ready-made implementation of transformer-
based models for the task of multiclass classification. We set the number of output classes
到 19, corresponding to the number of top-level fields of study we are trying to label. 作为
mentioned in Section 4.4.1, we prepare our evaluation data set based on labels generated
via journal names. We also prepare out training set from the same data set.

We choose the following model variants for each architecture:

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

scibert_scivocab_uncased for SciBERT,

1. bert-large-uncased for BERT,
2.
3. albert-base-v2 for ALBERT,
4.
5. xlnet-large-cased for XLNet.

roberta-large for RoBERTa, 和

All transformer models were trained on the bwUnicluster using GPU nodes containing four

Nvidia Tesla V100 GPUs and an Intel Xeon Gold 6230 processor.

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

4.4.3.

Evaluation metrics

We evaluate our model performances using two specific metrics: the micro-F1 score and Math-
ews correlation coefficient.

The micro-F1 as an extension to the F1 score is calculated as follows:

micro‐F1 ¼

磷

true positives
磷

true positives

false positives

The micro-F1 score is herein identical to microprecision, microrecall, and accuracy; though it
does not take the distribution of classes into consideration, that aspect is irrelevant for our
案件, as all our target labels have an equal number of samples and are therefore identically
weighted.

The Matthews correlation coefficient (MCC), also known as the phi coefficient, is another
standard metric used for multiclass classifications. It is often preferred for binary classification
or multiclass classification with unevenly distributed class sizes. The MCC only achieves high
values if all four classes of the diffusion matrix are classified accurately, and is therefore
preferred for evaluating unbalanced data sets (Chicco & Jurman, 2020). Even though our
evaluation set is balanced, we nevertheless provide MCC as an alternative metric. 这
MCC is calculated as follows:

MCC ¼

TP (西德:2) TN − FP (西德:2) FN
p
ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
ð
Þ
Þ (西德:2) TN þ FN
ð
TP þ FP

Þ (西德:2) TP þ FN

Þ (西德:2) TN þ FP

with TP = true positives, FP = false positives, TN = true negatives, and FN = false negatives.

4.4.4.

Evaluation results

Evaluation of existing field of study labels

In the following, we outline our evaluation
4.4.4.1.
concerning the validity of the existing MAG field of study labels. We take our two labeled sets
generated by our direct labeling (first data set; 2,863,258 文件) as well as labeling through
journal names (third data set) and compare the associated labels on level 0.

As we can see from the results in Table 15, the quality of top-level labels in the MAG can be
改进的. Out of the 2,863,258 文件, 1,595,579 matching labels were found, correspond-
ing to a 55.73% 匹配, 意义 55.73% of fields of study were labeled correctly according to
our ground truth. 桌子 15 also showcases an in-depth view of the quality of labels for each
discipline. We show the total number of papers for each field of study and the number of
papers that are correctly classified according to our ground truth, followed by the percentage.

Evaluation of MAKG field of study hierarchy To determine the validity of the existing
4.4.4.2.
field of study hierarchy, we compare the indirectly labeled data set (second data set) 和
our ground truth based on journal names (third data set). The indirectly labeled data set is
labeled using inferred information based on the overall MAKG field of study hierarchy (看
部分 4.4.1). 这里, we want to examine the effect the hierarchical structure would have
on the truthfulness of field of study labels. The results can be found in Table 16.

Our result based on this approach is very similar to the previous evaluation. Out of the
2,863,258 文件, we found 1,514,840 labels matching those based on journal names, 结果-
ing in a 52.91% 匹配 (相比 55.73% in the previous evaluation). Including the MAKG
field of study hierarchy did not improve the quality of labels. For many disciplines, 号码
of mislabelings increased significantly, further devaluing the quality of existing MAG labels.

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

桌子 15.

Evaluation results of existing field of study labels

Label
计算机科学

生物学

政治学

Materials Science

Geography

Chemistry

经济学

Mathematics

Geology

Engineering

Physics

Sociology

商业

药品

心理学

Art

历史

Philosophy

Environm. 科学

# labels

21,157

212,356

12,043

23,561

4,286

339,501

91,411

109,797

22600

731,505

694,631

10,725

141,498

311,197

36,080

23,728

39,938

19,517

17,727

# matching
15,056

132,203

4,083

18,475

575

285,569

62,482

92,519

18,377

187,807

500,723

9,245

33,641

186,194

31,834

4,336

5,161

6,363

936

全部的

2,863,258

1,595,579

% matching
71.163

62.255

33.904

78.413

13.416

84.114

68.353

84.264

81.314

25.674

72.085

86.200

23.775

59.832

88.232

18.274

12.923

32.602

5.280

55.726

Evaluation of classification In the following, we evaluate the newly created field of

4.4.4.3.
study labels for papers determined by our transformer-based classifiers.

We first analyze the effect of training size on the overall results. Although we observe a
steady increase in performance with each increase in size of our training set, the marginal
increment deteriorates after a certain value. 所以, with training time in mind, we decided
to limit the training input size to 20,000 samples per label, leading a theoretical training data
大小 390,000 样品. The number is slightly smaller in reality, 然而, due to certain
labels having fewer than 20,000 training samples in total.

We then compared the performances of various transformer-based models for our task.
桌子 17 shows performances of our models trained on the same training set after one epoch.
As we can see, SciBERT and BERTbase outperform other models significantly, with SciBERT
slightly edging ahead in comparison. 出奇, the larger BERT variant performs signifi-
cantly worse than its smaller counterpart.

We then compare the effect of training epochs on performance. We limit our comparison to
the SciBERT model in this case. We choose SciBERT as it achieves the best performance after

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

桌子 16.

Evaluation results of the field of study hierarchy

Label
计算机科学

生物学

政治学

Materials Science

Geography

Chemistry

经济学

Mathematics

Geology

Engineering

Physics

Sociology

商业

药品

心理学

Art

历史

Philosophy

Environm. 科学

# labels

21,157

212,356

12,043

23,561

4,286

339,501

91,411

109,797

22,600

731,505

694,631

10,725

141,498

311,197

36,080

23,728

39,938

19,517

17,727

# matching
13,055

145,671

8,035

13,618

285

239,576

62,025

79,959

15,777

207,063

464,083

4,418

26,095

192,397

25,548

4,901

3,391

8,641

302

全部的

2,863,258

1,514,840

% matching
61.705

68.598

66.719

57.799

6.650

70.567

67.853

72.824

69.810

28.306

66.810

41.193

18.442

61.825

70.809

20.655

8.491

44.274

1.704

52.906

one epoch of training. We fine-tune the same SciBERT model using an identical training set
(20,000 samples per label) as well as the same evaluation set. We observe a peak in perfor-
mance after two epochs (见表 18). Although performance for certain individual labels
keeps improving steadily afterward, the overall performance starts to deteriorate. 所以,

桌子 17.

Result comparison of various transformer-based classifiers

模型
BERTbase

BERTlarge

SciBERT

阿尔伯特

RoBERTa

XLNet

MCC
0.7452

0.6853

0.7552

0.7037

0.7170

0.6755

F1-score
0.7584

0.7014

0.7678

0.7188

0.7316

0.6920

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

桌子 18.

Comparison between various number of training epochs

# of epoch
1

MCC
0.7552

0.7708

0.7665

0.7615

0.7558

F1-score
0.7678

0.7826

0.7787

0.7739

0.7685

training was stopped after two epochs for our final classifier. Note that we have performed
similar analysis with some other models in a limited fashion as well. The best performance
was generally achieved after two or three epochs, depending on the model.

桌子 19 showcases the performance per label for our SciBERT model after two training
epochs on the evaluation set. 平均而言, the classifier achieves an macro average F1-score

桌子 19. Detailed evaluation results per label

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Label
计算机科学

Precision
0.77

Recall
0.83

生物学

政治学

Materials Science

Geography

Chemistry

经济学

Mathematics

Geology

Engineering

Physics

Sociology

商业

药品

心理学

Art

历史

Philosophy

Environm. 科学

Macro average

0.83

0.78

0.96

0.79

0.66

0.79

0.90

0.58

0.84

0.81

0.65

0.84

0.85

0.68

0.70

0.81

0.79

0.78

0.84

0.81

0.83

0.67

0.80

0.68

0.81

0.94

0.49

0.81

0.70

0.69

0.84

0.89

0.76

0.75

0.81

0.86

0.78

F1
0.80

0.84

0.82

0.80

0.79

0.80

0.67

0.80

0.92

0.53

0.83

0.75

0.67

0.84

0.87

0.72

0.81

0.82

0.78

# 样品
2,000

2,000

38,000

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

的 0.78. In the detailed results for each label, we highlighted labels that achieved scores one
standard deviation above and below the average.

Classification performances for the majority of labels are similar to the overall average,

though some outliers can be found.

全面的, the setup is especially adept at classifying papers from the fields of geology (0.94),
心理学 (0.87), medicine (0.84), and biology (0.84); whereas it performs the worst for engi-
neering (0.53), 经济学 (0.67), 和商业 (0.67). The values in parentheses are the
respective F1-scores achieved during classification.

We suspect the performance differences to be a result of the breadth of vocabularies used in
each discipline. Disciplines for which the classifier performs well usually use highly specific
and technical vocabularies. Engineering especially follows this assumption, as engineering is
an agglomeration of a multitude of disciplines, such as physics, 化学, 生物学, 并且会
encompass their respective vocabularies as well.

4.4.5. Keyword extraction

As outlined in Section 4.3, we apply TextRank to extract keywords from text and assign them
to publications. We use “pytextrank” (https://github.com/DerwenAI/pytextrank/), a Python
implementation of the TextRank algorithm, as our keyword extractor. Due to the generally
smaller text size of an abstract, we limit the number of keywords/key phrases to five. A greater
number of keywords would inevitably introduce additional “filler phrases,” which are not con-
ducive for representing the content of a given abstract. Further statistics about the keywords
are given in Section 6.

4.5. 讨论

In the following, we discuss certain challenges faced, lessons learned, and future outlooks.

Our classification approach relied on the existing top-level fields of study (等级-0) found in
the MAKG. 反而, we could have established an entirely new selection of disciplines as our
label set. It is also possible to adapt an established classification scheme, such as the ACM
Computing Classification System (https://dl.acm.org/ccs) or the Computer Science Ontology
(Salatino, Thanapalasingam et al., 2018). 然而, to the best of our knowledge, 有
not an equivalent classification scheme covering the entirety of research topics found in the
MAKG, which was a major factor leading us to adapt the field of study system.

Regarding keyword extraction, grouping of extracted keywords and key phrases and building
a taxonomy or ontology are natural continuations of the work. We suggest categories to be
constructed on an individual discipline level, rather than having a fixed category scheme
for all possible fields of study. 例如, within the discipline of computer science, 我们
could try to categorize tasks, data sets, approaches and so forth from the list of extracted key-
字. Brack, D’Souza et al. (2020) and Färber et al. (2021) recently published such an entity
recognition approach. Both have also adapted the SciBERT architecture to extract scientific
concepts from paper abstracts.

Future researchers can expand our extracted tags by enriching them with additional rela-
tionships to recreate a similar structure to the current MAKG field of study hierarchy.
Approaches such as the Scientific Information Extractor (Luan, He et al., 2018) 可能
applied to categorize or to establish relationships between keywords, building an ontology
or rich knowledge graph.

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

5. KNOWLEDGE GRAPH EMBEDDINGS

5.1. 动机

Embeddings provide an implicit knowledge representation for otherwise symbolic information.
They are often used to represent concepts in a fixed low-dimensional space. 传统上,
embeddings are used in the field of natural language processing to represent vocabularies,
allowing computer models to capture the context of words and, 因此, the contextual meaning.

Knowledge graph embeddings follow a similar principle, in which the vocabulary consists
of entities and relation types. The final embedding encompasses the relationships between
specific entities but also generalizes relations for entities of similar types. The embeddings
retain the structure and relationships of information from the original knowledge graph and
facilitate a series of tasks, such as knowledge graph completion, relation extraction, entity clas-
sification, question answering, and entity resolution (王, Mao et al., 2017).

Färber (2019) published pretrained embeddings for MAKG publications using RDF2Vec
(Ristoski, 2017) as an “add-on” to the MAKG. 这里, we provide an updated version of embed-
dings for a newer version of the MAG data set and for a variety of entity types instead of papers
独自的. We experiment with various types of embeddings and provide evaluation results for
each approach. 最后, we provide embeddings for millions of papers and thousands of jour-
nals and conferences, as well as millions of disambiguated authors.

In the following, we introduce related work in Section 5.2. 部分 5.3 describes our approach
to knowledge graph embedding computation, followed by our evaluation in Section 5.4. 我们
conclude in Section 5.5.

5.2. 相关工作

一般来说, knowledge graphs are described using triplets in the form of (H, r, t), referring to the
head entity h 2 , the relationship between both entities r 2 ℝ, and the tail entity t 2 .
阮 (2017) and Wang et al. (2017) provide overviews of existing approaches for creating
knowledge graph embeddings, as well as differences in complexity and performance.

Within the existing literature, there have been numerous approaches to train embeddings
for knowledge graphs. Generally speaking, the main difference between the approaches lies in
the scoring function used to calculate the similarity or distance between two triplets. 全面的,
two major families of algorithms exist: ones using translational distance models and ones using
semantic matching models.

Translational distance models use distance function scores to determine the plausibility of
specific sets of triplets existing within a given knowledge graph context (王等人。, 2017).
进一步来说, the head entity of a triplet is projected as a point in a fixed dimensional
空间; the relationship entity is herein, 例如, a directional vector originating from
the head entity. The distance between the end point of the relationship entity and the tail entity
in this given fixed dimensional space describes the accuracy or quality of the embeddings.
One such example is the TransE (Bordes, Usunier et al., 2013) algorithm. The standard TransE
model does not perform well on knowledge graphs with one-to-many, many-to-one, or many-
to-many relationships (王, 张等人。, 2014) because the tail entities’ embeddings are
heavily influenced by the relations. Two tail entities that share the same head entity as well
as relation are therefore similar in the embedding space created by TransE, even if they may be
different concepts entirely in the real world. As an effort to overcome the deficits of TransE,
TransH (王等人。, 2014) was introduced to distinguish the subtleties of tail entities sharing a

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

common head entity as well as relation. Later on, TransR was introduced to further model
relations as separate vectors rather than hyperplanes, as is the case with TransH. The efficiency
was later improved with the TransD model (吉, He et al., 2015).

Semantic matching models compare similarity scores to determine the plausibility of a
given triplet. 这里, relations are not modeled as vectors similar to entities, but rather as matri-
ces describing interactions between entities. Such approaches include RESCAL (Nickel, Tresp,
& Kriegel, 2011), DistMult (哪个, Yih et al., 2015), HolE (Nickel, Rosasco, & Poggio, 2016),
ComplEx (Trouillon et al., 2016), 和别的.

More recent approaches use neural network architectures to represent relation embeddings.
ConvE, 例如, represents head entity and relations as input and tail entity as output of a
convolutional neural network (Dettmers, Minervini et al., 2018). ParamE extends the approach
by representing relations as parameters of a neural network used to “translate” the input of
head entity into the corresponding output of tail entity (Che, 张等人。, 2020).

此外, there are newer variations of knowledge graph embeddings, for example using
textual information (鲁, Cong, & 黄, 2020) and literals (Gesese, Biswas et al., 2019;
Kristiadi, Khan et al., 2019). 全面的, we decided to use established methods to generate
our embeddings for stability in results, performance during training, and compatibility with
file formats and graph structure.

5.3. Approach

We experiment with various embedding types and compare their performances on our data
放. We include both translational distance models and semantic matching models of the fol-
lowing types: TransE (Bordes et al., 2013), TransR (林, 刘等人。, 2015), DistMult (杨等人。,
2015), ComplEx (Trouillon et al., 2016), and RESCAL (Nickel et al., 2016) (参见章节 5.2 为了
an overview of how these approaches differ from each other). The reasoning behind the
choices is as follows: The embedding types need to be state-of-the-art and widespread, therein
acting as the basis of comparison. 此外, there needs to be an efficient implementation to
train each embedding type, as runtime is a limiting factor. 例如, the paper embeddings
by Färber (2019) were trained using RDF2Vec (Ristoski, 2017) and took 2 weeks to complete.
RDF2Vec did not scale well enough using all authors and other entities in the MAKG. 还
current implementations of RDF2Vec, such as pyRDF2Vec, are not designed for such a large
规模: “Loading large RDF files into memory will cause memory issues as the code is not
optimized for larger files” (https://github.com/IBCNServices/pyRDF2Vec). This turned out to
be true when running RDF2Vec on the MAKG. For the difference between RDF2Vec and other
算法, such as TransE, we can refer to Portisch, Heist, and Paulheim (2021).

5.4. 评估

5.4.1.

Evaluation data

Our aim is to generate knowledge graph embeddings for the entities of type papers, journals,
conferences, and authors to solve machine learning-based tasks, such as search and recom-
mendation tasks. The RDF representations can be downloaded from the MAKG website
(https://makg.org/).

We first select the required data files containing the entities of our chosen entity types and
combine them into a single input. 理想情况下, we would train paper and author embeddings simul-
一致地, such that they benefit from each other’s context. 然而, the required memory
space proved to be a limiting factor given the more than 200 million authors and more than

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

200 million papers. 最终, we train embeddings for papers, journals, and conferences
一起; we train the embeddings for authors separately.

Due to the large number of input entities within the knowledge graph, we try to minimize
the overall input size and thereby the memory requirement for training. We first filter out the
relationships we aim to model. To further reduce memory consumption, we “abbreviate” rela-
tions by removing their prefixes.

此外, we use a mapping for entities and relations to further reduce memory con-
消费. All entities and relations are mapped to a specific index in the form of an integer.
这样, all statements within the knowledge graph are reduced to a triple of integers and
used as input for training together with the mapping files.

5.4.2.

Evaluation setup

We use the Python package DGL-KE (Zheng et al., 2020) for our implementation of knowledge
graph embedding algorithms. DGL-KE is a recently published package optimized for training
knowledge graph embeddings at a large scale. It outperforms other state-of-the-art packages
while achieving linear scaling with machine resources as well as high model accuracies. 我们
set the dimension size of our output embeddings to 100. We set the limit due to greater mem-
ory constraints for training higher-dimensional embeddings. We experiment with a dimension
大小 150 and did not observe any improvements to our metrics. Embedding sizes any higher
will result in out-of-memory errors on our setup. The exact choices of hyperparameters are in
桌子 20. We perform evaluation through randomly masking entities and relations and trying
to repredict the missing part.

We perform training on the bwUnicluster using GPU nodes with eight Nvidia Tesla V100
GPUs and 752 GB of RAM. We use standard ranking metrics Hit@k, mean rank (MR), 和
mean reciprocal rank (MRR).

5.4.3.

Evaluation results

Our evaluation results can be found in Table 21. Note that performing a full-scale analysis of
the effects of the hyperparameters on the embedding quality was out of the scope of this paper.
Results are based on embeddings trained on paper, journal, and conference entities. 我们
observed an average mean rank of 1.301 and a mean reciprocal rank of 0.958 for the best-
performing embedding type.

有趣的是, TransE and TransR greatly outperform other algorithms during fewer training
脚步 (1,000). For higher training steps, the more modern models, such as ComplEx and
DistMult, achieve state-of-the-art performance. Across all metrics, ComplEx, which is based
on complex embeddings instead of real-valued embeddings, achieves the best results (例如,
MRR of 0.958 and HITS@1 of 0.937) while having competitive training times to other

桌子 20. Hyperparameters for training embeddings

Hyperparameter
Embedding size

Maximum training step

Batch size

Negative sampling size

Quantitative Science Studies

Value

100

1,000,000

1,000

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

桌子 21.

Evaluation results of various embedding types

Average MR

Average MRR

Average HITS@1

Average HITS@3

Average HITS@10

TransR*
105.598

0.388

0.338

0.403

0.474

TransE
15.224

0.640

0.578

0.659

0.769

RESCAL
4.912

ComplEx
1.301

DistMult
2.094

0.803

0.734

0.851

0.920

0.958

0.937

0.975

0.992

0.923

0.893

0.945

0.977

Training time

10 小时

8 小时

18 小时

8 小时

方法. A direct comparison of these evaluation results with the evaluation results for link
prediction with embeddings in the general domain is not possible, in our view, 因为
performance depends heavily on the used training data and test data. 然而, it is remark-
able that embedding methods that perform quite well on our tasks (例如, RESCAL) do not per-
form so well in the general domain (例如, using the data sets WN18 and FB15K) (Dai, 王
等人。, 2020), while the embedding method that performs best in our case, namely ComplEx,
also counts as state-of-the-art in the general domain (Dai et al., 2020).

It is important to note that we train the TransR embedding type on 250,000 maximum train-
ing steps compared to 1,000,000 for all others embedding types. This is due to the extremely
long training time for this specific embedding; we were unable to finish training in 48 小时,
和, 所以, had to adjust the training steps manually. The effect can be seen in its perfor-
曼斯; although for fewer training steps, TransR performed similarly to TransE.

桌子 22 shows the quality of our final embeddings, which we published at https://makg.org/.

5.5. 讨论

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

The main challenge of the task lies in the hardware requirement for training embeddings on
such a large scale. For publications, 例如, even after the approaches we have carried
out for reducing memory consumption, it still required a significant amount of memory. 为了
例子, we were not able to train publications and author embeddings simultaneously given
750 GB of memory space. Given additional resources, future researchers could increase the
dimensionality of embeddings, which might increase performance.

Other embedding approaches may be suitable for our case as well, though the limiting fac-
tor here is the large file size of the input graph. Any approach needs to be scalable and perform

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Average MR

Average MRR

Average HITS@1

Average HITS@3

Average HITS@10

桌子 22.

Evaluation of final embeddings

作者
2.644

0.896

0.862

0.918

0.960

Paper/Journal/Conference
1.301

0.958

0.937

0.975

0.992

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

efficiently on such large data sets. One of the limiting factors for choosing embedding types (例如,
TransE) is the availability of an efficient implementation. The DGL-KE provides such implemen-
tations, but only for a select number of embedding types. 将来, as other implementations
become publicly available, further evaluations may be performed. 或者, custom imple-
mentations can also be developed, though such tasks are not the subject of our paper.

Future researchers might further experiment with various combinations of hyperparameters.
We have noticed a great effect of training steps on the embedding qualities of various models.
Other effects might be learnable with additional experimentations.

6. KNOWLEDGE GRAPH PROVISIONING AND STATISTICAL ANALYSIS

在这个部分, we outline how we provide the enhanced MAKG. 此外, we show the
results of a statistical analysis on various aspects of the MAKG.

6.1. Knowledge Graph Provisioning

For creating the enhanced MAKG, we followed the initial schema and data model of Färber
(2019). 然而, we introduced new properties to model novel relationships and data attributes.
A list of all new properties to the MAKG ontology can be found in Table 23. An updated schema
for the MAKG is in Figure 7 and on the MAKG homepage, together with the updated ontology.

Besides the MAKG, Wikidata models millions of scientific publications. 因此, similar to the
initial MAKG (Färber, 2019), we created mappings between the MAKG and Wikidata in the
form of owl:sameAs statements. Using the DOI as unique identifier for publications, we were
able to create 20,872,925 links between the MAKG and Wikidata.

The MAKG RDF files—containing 8.7 billion RDF triples as the core part—are available at
https://doi.org/10.5281/zenodo.4617285. The updated SPARQL endpoint is available at
https://makg.org/sparql.

6.2. General Statistics

Similar to analyses performed by Herrmannova and Knoth (2016) and Färber (2019), we aim to
provide some general data set statistics regarding the content of the MAKG. Since the last pub-
应用, the MAG has received many updates in the form of additional data entries, 也
some small to moderate data schema changes. 所以, we aim to provide some up-to-date
statistics of the MAKG and further detailed analyses of other areas.

We carried out all analysis using the MAKG based on the MAG data as of June 2020 和我们的
modified variants (IE。, custom fields of study and enhanced author set). 桌子 24 shows general
statistics of the enhanced MAKG. In the following, we describe key statistics in more detail.

6.2.1. Authors

The original MAKG encompasses 243,042,675 authors, 其中 43,514,250 had an affilia-
tion given in the MAG. Our disambiguation approach reduced this set to 151,355,324 authors.

桌子 25 showcases certain author statistics with respect to publication and cooperation.
The average paper in the MAG has 2.7 authors with the most having 7,545 authors. On aver-
年龄, an author published 2.65 papers according to the MAKG. The author with the highest
number of papers published 8,551 文件. The average author cooperated with 10.69 其他
authors in their combined work, with the most “connected” author having 65,793 共同作者
全面的, which might be plausible, but is likely misleading due to unclean data to some extent.

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

桌子 23.

Properties added to the MAKG using the prefixes shown in Figure 7

财产
https://makg.org/property/paperFamilyCount

https://makg.org/property/ownResource

https://makg.org/property/citedResource

https://makg.org/property/resourceType

https://www.w3.org/1999/02/22-rdf-syntax-ns#type

https://purl.org/spar/fabio/hasURL

https://makg.org/property/familyId

https://makg.org/property/isRelatedTo

Domain

Range

:作者

:Affiliation

:杂志

xsd:integer

:ConferenceSeries

xsd:integer

:ConferenceInstance

xsd:integer

:FieldOfStudy

:纸

:Resource

:纸

:Affiliation

:杂志

xsd:integer

Resource

:Resource

xsd:integer

fabio:工作

xsd:anyURI

xsd:integer

:Affiliation

:杂志

:ConferenceSeries

:FieldOfStudy

https://makg.org/property/recommends

https://prismstandard.org/namespaces/basic/2.0/keyword

https://www.w3.org/2003/01/geo/wgs84_pos#lat

https://www.w3.org/2003/01/geo/wgs84_pos#long

:纸

:Affiliation

:纸

xsd:string

xsd:float

https://dbpedia.org/ontology/location

:ConferenceInstance

dbp:地点

https://dbpedia.org/ontology/publisher

https://dbpedia.org/ontology/patent

https://purl.org/spar/fabio/hasPatentNumber

https://purl.org/spar/fabio/hasPubMedId

https://purl.org/spar/fabio/hasPubMedCentrialId

:纸

dbp:出版商

epo:EPOID

justia:JustiaID

xsd:string

pm:PubMedID

pmc:PMCID

https://www.w3.org/2000/01/rdf-schema#seeAlso

:FieldOfStudy

gn:WikipediaArticle

nih:NihID

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

.
A
米
e
H
C
s
G
K
A
中号
d
e
t
A
d
p
U

e
r
你
G
我
F

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

桌子 24. General statistics for the MAG/MAKG and the enhanced MAKG as of June 19, 2020

文件

Paper abstracts

Authors

Affiliations

Journals

Conference series

Conference instances

Unique fields of study

ORCID iDs

# in MAG/MAKG
238,670,900

139,227,097

243,042,675

# in enhanced MAKG
238,670,900

139,227,097

151,355,324

25,767

48,942

4,468

16,142

740,460

–

25,767

48,942

4,468

16,142

740,460

34,863

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

桌子 25. General author and paper statistics

Metric
Average authors per paper

Maximum authors per paper

Average papers per author

Maximum papers per author

Average coauthors per author

Maximum coauthors per author

桌子 26. General reference and citation statistics

Key statistics
Average references

At least one reference

Average references (filtered)

Median references (filtered)

Most references

Average citations

At least one citation

Average citations (filtered)

Median citations (filtered)

Most citations

Value
2.6994

7,545

2.6504

8,551

10.6882

65,793

Value

6.8511

78,684,683

20.7813

26,690

6.8511

90,887,343

17.9912

252,077

Quantitative Science Studies

问
你
A
n

t
我
t

我

t
我
v
e
S
C
e
n
C
e
S
你
d
e
s

我

桌子 27. Detailed reference and citation statistics

Average references

杂志

13.089

会议
10.309

Patent

3.470

At least one reference

42,660,071

3,913,744

19,023,288

Average references (filtered)

26.313

12.400

Median references (filtered)

Most references

Average citations

13,220

14.729

4,156

9.024

9.643

19,352

书

2.460

93,644

56.315

5,296

BookSection
3.286

Repository
11.649

Data Set
0.063

No data

2.782

339,439

1,305,000

130

11,349,367

26.268

14.988

18.969

21.758

7,747

0.813

2,092

2.251

196

0.188

26,690

1.019

3.225

29.206

At least one citation

50,599,935

3,063,123

22,591,991

1,299,728

351,448

549,526

1,187

12,430,405

Average citations (filtered)

24.963

13.869

7.547

48.177

6.277

6.878

6.240

Median citations (filtered)

Most citations

252,077

34,134

32,096

137,596

4,119

20,503

633

7.274

103,540

8
9

时间
H
e

中号

我
C
r
哦
s
哦
F
t
A
C
A
d
e
米
我
C
K
n
哦
w
我
e
d
G
e
G

r
A
p
H

e
n
H
A
n
C
e
d

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

6.2.2. 文件

We first analyze the composition of paper entities by their associated type (见表 2). 这
most frequently found document type is journal articles, followed by patents. A huge propor-
tion of paper entities in the MAKG do not have a document type.

In the following, we analyze the number of citations and references for papers within the

MAKG. The results can be found in Table 26.

The average paper in the MAKG references 6.85 papers and received 6.85 citations. 这
exact match in numbers here seems too unlikely to be coincidental. 所以, we suspect
these numbers to be a result of a closed referencing system of the original MAG, 意义
references for a paper are only counted if they reference another paper within the MAG;
and citations are only counted if a paper is cited by another paper found in the MAKG. 什么时候
we remove papers with zero references, we are left with a set of 78,684,683 文件. The aver-
age references per paper from the filtered paper set is now 20.78. In the MAKG, 90,887,343
papers are cited at least once, with the average among this new set being 17.99. As averages
are highly susceptible to outliers, which were frequent in our data set due to unclean data and
the power law distribution of scientific output, we also calculated the median of references
and citations. These values should give us a more representative picture of reality. The paper
with the most references from the MAG has 26,690 参考, whereas the paper with the
most citations received 252,077 citations as of June 2020.

桌子 27 showcases detailed reference and citation statistics for each document type found
in our (enhanced) MAKG. 不出所料, books have the most amount of references on aver-
age due to their significant lengths, followed by journal papers (and book sections). 然而,
the median value for books is less than for journals, likely due to outliers. Citation wise, 图书
and journal papers again are the most cited document types on average. 再次, journal papers
have fewer citations on average but a higher median value.

数字 8 shows the number of papers published each year in the time span recorded by the
MAKG (1800–present). The number of publications has been on a steady exponential trajec-
保守党. 这是, 当然, partly due to advances in the digitalization of libraries and journals, 作为
well as the increasing ease of accessing new research papers. 然而, we can certainly attri-
bute a large part of the growth to the increasing number of publications every year (约翰逊
等人。, 2018).

有趣的是, the average number of references per paper has been on a steady increase (看
数字 9 and Johnson et al. (2018)). This could be due to a couple of reasons. 第一的, as scientific

数字 8. Number of papers published per year (starting with 1900).

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

数字 9. Average number of references of a paper per year.

fields develop and grow, novel work becomes increasingly rare. 相当, researchers publish
work built on top of previous research (“on the shoulders of giants”), leading to a growing
number of references for new publications. 此外, the increasing number of research
papers further contribute to more works being considered for referencing. 第二, 发展-
ments in technology, such as digital libraries, enable the spread of research and ease the shar-
ing of ideas and communication between researchers (看, 例如, the open access
努力 (Piwowar, Priem et al., 2018)). 所以, a researcher from the modern age has a huge
advantage in accessing other papers and publications. The ease of access could contribute to
more works being referenced in this way. 第三, as the MAKG is (most likely) a closed refer-
ence system, meaning papers referenced are only included if they are part of the MAKG, 和
as modern publications are more likely to be included in the MAKG, newer papers will auto-
matically have a higher number of recorded references in the MAKG. Although this is a pos-
能力, we do not suspect it to be the main reason behind the rising number of references.
Most likely, the cause is a combination of several factors.

出奇, the average number of citations a paper receives has increased, 如图所示
数字 10. 直观地, one would assume older papers to receive more citations on average
purely due to longevity. 然而, as our graph shows, the number of citations an average
paper receives has increased since the turn of the last century. We observe a peak of growth
大约 1996, which might be where the age of a paper exhibits its effect. Coupled with the
exponential growth of publications, the average citations per paper plummets.

数字 11 shows the average number of authors per paper per year and publication type,
using the MAKG paper’s publication year. As we can observe, there has been a clear upward

数字 10. Average number of citations of a paper per year.

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

数字 11. Average number of authors per paper and paper type over the years, including standard deviation.

trend for the average number of authors per paper specifically concerning journal articles,
conference papers, and patents since the 1970s. The level of cooperation within the scientific
community has grown, partly led by the technological developments that enable researchers
to easily connect and cooperate. This finding reconfirms the results from the STM report 2018
(Johnson et al., 2018).

6.2.3.

Fields of study

In the following, we analyze the development of fields of study over time. 第一的, 数字 12
showcases the current number of publications per top-level field of study within the MAKG.
Each field of study here has two distinct values. The blue bars represent the field of study as
labeled by the MAKG, whereas the red bars are labels as generated by our custom classifier.

数字 12. Number of papers per field of study.

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

重要的, there is a discrepancy between the total number of paper labels between the orig-
inal MAKG field of study labels and our custom labels. The original MAG hierarchy includes
labels for 199,846,956 文件. Our custom labels are created through classification of paper
abstracts and are therefore limited by the number of abstracts available in the data set; 因此, 我们
only generated labels for 139,227,097 文件. Rather surprisingly, the disciplines of medicine
and materials science are the most common fields of study within the MAG, according to the
original MAG field of study labels. According to our classification, engineering and medicine
are the most represented disciplines.

Evaluating the cumulative number of papers associated with the different fields of study
over the years, we can confirm the exponential growth of scientific output shown by Larsen
and von Ins (2010). In many areas, our data show greater rates of growth than previously
预期的.

数字 13 shows the interdisciplinary works of authors. 这里, we modeled the relationships
between fields of study in a chord graph. Each chord between two fields of study represents
authors who have published papers in both disciplines. The thickness of each chord is repre-
sentative of the number of authors who have done so. We observe strong relationships
between the disciplines of biology and medicine, materials science and engineering, 和
computer science and engineering. 此外, there is a moderately strong relationship
between the disciplines of chemistry and medicine, biology and engineering, and chemistry
and biology. The multitude of links between engineering and other disciplines could be due
to mislabeling of engineering papers, as our classifier is not adept at classifying papers from
engineering in comparison to other fields of study, 如表所示 19.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 13.

Interdisciplinary researchers in form of authors who publish in multiple fields of study.

Quantitative Science Studies

The Microsoft Academic Knowledge Graph enhanced

7. CONCLUSION AND OUTLOOK

在本文中, we developed and applied several methods for enhancing the MAKG, 一个大的-
scale scholarly knowledge graph. 第一的, we performed author name disambiguation on the set
的 243 million authors using background information, such as the metadata of 239 百万
出版物. Our classifier achieved a precision of 0.949, a recall of 0.755, and an accuracy
的 0.991. We managed to reduce the number of total author entities from 243 百万至
151 百万.

第二, we reclassified existing papers from the MAKG into a distinct set of 19 学科
(IE。, 等级-0 fields of study). We performed an evaluation of existing labels and determined
55% of the existing labels to be accurate, whereas our newly generated labels achieved an
accuracy of approximately 78%. We then assigned tags to papers based on the papers’
abstracts to create a more suitable description of paper content in comparison to the preexist-
ing rigid field of study hierarchy in the MAKG.

第三, we generated entity embeddings for all paper, journal, conference, and author
实体. Our evaluation showed that ComplEx was the best performing large-scale entity
embedding method that we could apply to the MAKG.

最后, we performed a statistical analysis on key features of the enhanced MAKG. 我们
updated the MAKG based on our results and provided all data sets, as well as the updated
MAKG, online at https://makg.org and https://doi.org/10.5281/zenodo.4617285.

Future researchers could further improve upon our results. For author name disambigua-
的, we believe the results could be further improved by incorporating additional author infor-
mation from other sources. For field of study classification, future approaches could develop
ways to organize our generated paper tags into a more hierarchical system. For the trained
entity embeddings, future research could generate embeddings at a higher dimensionality. 这
was not possible because of the lack of existing efficient scalable implementations of most
算法. Beyond these enhancements, the MAKG should be enriched with the key content
of scientific publications, such as research data sets (Färber & Lamprecht, 2022), scientific
方法 (Färber et al., 2021), and research contributions (Jaradeh et al., 2019乙).

作者贡献

Michael Färber: 概念化, 数据管理, 调查, 方法, 资源,
监督, 可视化, Writing—review & 编辑. Lin Ao: 概念化, Data cura-
的, 调查, 方法, 资源, 软件, 可视化, Writing—original draft.

COMPETING INTERESTS

The authors have no competing interests.

资金信息

The authors did not receive any funding for this research.

DATA AVAILABILITY

We provide all generated data online to the public at https://makg.org and https://doi.org/10
.5281/zenodo.4617285 under the ODC-BY license (https://opendatacommons.org/licenses/by
/1-0/). Our code is available online at https://github.com/lin-ao/enhancing_the_makg.

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

参考

Ajileye, T。, Motik, B., & Horrocks, 我. (2021). Streaming partitioning
of RDF graphs for datalog reasoning. In Proceedings of the 18th
Extended Semantic Web Conference. https://doi.org/10.1007/978
-3-030-77385-4_1

Alzaidy, R。, Caragea, C。, & 贾尔斯, C. L. (2019). Bi-LSTM-CRF
sequence labeling for keyphrase extraction from scholarly docu-
评论. In Proceedings of the 28th World Wide Web Conference
(PP. 2551–2557). https://doi.org/10.1145/3308558.3313642
Baskaran, A. (2017). UNESCO science report: Towards 2030.

Institutions and Economies, 125–127.

Beel, J。, Langer, S。, Genzmehr, M。, Gipp, B., Breitinger, C。, &
Nürnberger, A. (2013). Research paper recommender system
评估: A quantitative literature survey. 在诉讼程序中
the International Workshop on Reproducibility and Replication
in Recommender Systems Evaluation (PP. 15–22). https://doi.org
/10.1145/2532508.2532512

Beltagy, 我。, Lo, K., & Cohan, A. (2019). SciBERT: A pretrained
language model for scientific text. 在诉讼程序中 2019
Conference on Empirical Methods in Natural Language Pro-
cessing and the 9th International Joint Conference on Natural
语言处理 (PP. 3613–3618). https://doi.org/10.18653
/v1/D19-1371

Bordes, A。, Usunier, N。, García-Durán, A。, Weston, J。, & Yakhnenko,
氧. (2013). Translating embeddings for modeling multi-relational
数据. In Proceedings of the 27th Annual Conference on Neural
Information Processing Systems (PP. 2787–2795).

Brack, A。, D’Souza, J。, Hoppe, A。, Auer, S。, & Ewerth, 右. (2020).
Domain-independent extraction of scientific concepts from research
文章. In Proceedings of the 42nd European Conference on IR
(PP. 251–266). https://doi.org/10.1007/978-3-030-45439-5_17
布赖曼, L. (1996). Bagging predictors. Machine Learning, 24(2),

123–140. https://doi.org/10.1007/BF00058655

Caragea, C。, Bulgarov, F. A。, Godea, A。, & Gollapalli, S. D. (2014).
Citation-enhanced keyphrase extraction from research papers: A
supervised approach. 在诉讼程序中 2014 会议
Empirical Methods in Natural Language Processing (PP. 1435–1446).
https://doi.org/10.3115/v1/D14-1150

Caron, E., & van Eck, 氮. J. (2014). Large scale author name disam-
biguation using rule-based scoring and clustering. In Proceedings
of the 19th International Conference on Science and Technology
指标 (PP. 79–86).

Che, F。, 张, D ., Tao, J。, Niu, M。, & 赵, 乙. (2020). ParamE:
Regarding neural network parameters as relation embeddings
for knowledge graph completion. In Proceedings of the 34th
AAAI 人工智能会议 (PP. 2774–2781).
https://doi.org/10.1609/aaai.v34i03.5665

Chicco, D ., & Jurman, G. (2020). The advantages of the Matthews
相关系数 (MCC) over F1 score and accuracy in binary
classification evaluation. BMC Genomics, 21(1), 6. https://doi.org
/10.1186/s12864-019-6413-7, 考研: 31898477

科恩, 瓦. W., Ravikumar, P。, & Fienberg, S. 乙. (2003). A comparison
of string distance metrics for name-matching tasks. In Proceedings
of IJCAI-03 Workshop on Information Integration on the Web
(PP. 73–78).

考克斯, D. R。, & Snell, 乙. J. (1989). Analysis of binary data (卷. 32).

CRC Press.

Dai, Y。, 王, S。, Xiong, 氮. N。, & Guo, 瓦. (2020). A survey on knowl-
edge graph embedding: Approaches, applications and benchmarks.
Electronics, 9(5). https://doi.org/10.3390/electronics9050750

Dai, Z。, 哪个, Z。, 哪个, Y。, Carbonell, J. G。, Le, 问. 五、, & Salakhutdinov,
右. (2019). Transformer-XL: Attentive language models beyond a

fixed-length context. In Proceedings of the 57th Conference of
the Association for Computational Linguistics (PP. 2978–2988).
https://doi.org/10.18653/v1/P19-1285

Daquino, M。, Peroni, S。, Shotton, D. M。, Colavizza, G。, Ghavimi,
B., … Zumstein, 磷. (2020). The OpenCitations Data Model. 在
Proceedings of the 19th International Semantic Web Conference
(PP. 447–463). https://doi.org/10.1007/978-3-030-62466-8_28
Dettmers, T。, Minervini, P。, Stenetorp, P。, & Riedel, S. (2018). Convo-
lutional 2D knowledge graph embeddings. 在诉讼程序中
32nd AAAI Conference on Artificial Intelligence (PP. 1811–1818).
Devlin, J。, 张, M。, 李, K., & Toutanova, K. (2019). BERT:
Pretraining of deep bidirectional transformers for language
理解. 在诉讼程序中 2019 Conference of the
North American Chapter of the Association for Computational
语言学: 人类语言技术 (PP. 4171–4186).
Färber, 中号. (2019). The Microsoft Academic Knowledge Graph: A
linked data source with 8 billion triples of scholarly data. 在
Proceedings of the 18th International Semantic Web Conference
(PP. 113–129). 施普林格. https://doi.org/10.1007/978-3-030
-30796-7_8

Färber, 中号. (2020). Analyzing the GitHub repositories of research
文件. In Proceedings of the ACM/ IEEE Joint Conference on
Digital Libraries (PP. 491–492). https://doi.org/10.1145/3383583
.3398578

Färber, M。, Albers, A。, & Schüber, F. (2021). Identifying used
methods and datasets in scientific publications. In Proceedings
of the AAAI-21 Workshop on Scientific Document Understanding
(SDU’21)@AAAI’21.

Färber, M。, & Jatowt, A. (2020). Citation recommendation: Approaches
and datasets. International Journal on Digital Libraries, 21(4),
375–405. https://doi.org/10.1007/s00799-020-00288-2

Färber, M。, & Lamprecht, D. (2022). The Data set knowledge graph:
Creating a linked open data source for data sets. Quantitative
Science Studies, 2(4), 1324–1355. https://doi.org/10.1162/qss_a
_00161

Färber, M。, & Leisinger, A. (2021A). Datahunter: A system for
finding datasets based on scientific problem descriptions. 在
Proceedings of the 15th ACM Conference on Recommender
系统 (PP. 749–752). https://doi.org/10.1145/3460231
.3478882

Färber, M。, & Leisinger, A. (2021乙). Recommending datasets based
for scientific problem descriptions. In Proceedings of the 30th
ACM International Conference on Information and Knowledge
管理. https://doi.org/10.1145/3459637.3482166

Fathalla, S。, Vahdati, S。, Auer, S。, & Lange, C. (2017). Towards a
knowledge graph representing research findings by semantifying
survey articles. In Proceedings of the 21st International Confer-
ence on Theory and Practice of Digital Libraries (PP. 315–327).
https://doi.org/10.1007/978-3-319-67008-9_25

Fellegi, 我. P。, & Sunter, A. 乙. (1969). A theory for record linkage.
Journal of the American Statistical Association, 64(328), 1183–1210.
https://doi.org/10.1080/01621459.1969.10501049

费雷拉, A. A。, Gonçalves, 中号. A。, & Laender, A. H. F. (2012). A brief
survey of automatic methods for author name disambiguation.
ACM SIGMOD Record, 41(2), 15–26. https://doi.org/10.1145
/2350036.2350040

Florescu, C。, & Caragea, C. (2017). Positionrank: An unsupervised
approach to keyphrase extraction from scholarly documents. 在
Proceedings of the 55th Annual Meeting of the Association for
计算语言学 (PP. 1105–1115). https://doi.org/10
.18653/v1/P17-1102

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Fortunato, S。, 伯格斯特罗姆, C. T。, Börner, K., 埃文斯, J. A。, Helbing, D .,
… Barabási, A.-L. (2018). Science of science. 科学, 359(6379).
https://doi.org/10.1126/science.aao0185, 考研: 29496846
Gesese, G. A。, Biswas, R。, Alam, M。, & Sack, H. (2019). A survey on
knowledge graph embeddings with literals: Which model links
better literal-ly? CoRR, abs/1910.12507.

Han, H。, 贾尔斯, C. L。, Zha, H。, 李, C。, & Tsioutsiouliklis, K. (2004).
Two supervised learning approaches for name disambiguation in
author citations. In Proceedings of the ACM/IEEE Joint Confer-
ence on Digital Libraries (PP. 296–305). https://doi.org/10.1145
/996350.996419

Hernández, 中号. A。, & Stolfo, S. J. (1995). The merge/purge problem
for large databases. ACM SIGMOD Record, 24(2), 127–138.
https://doi.org/10.1145/568271.223807

Herrmannova, D ., & Knoth, 磷. (2016). An analysis of the Microsoft
Academic Graph. D-Lib Magazine, 22(9/10). https://doi.org/10
.1045/september2016-herrmannova

Hochreiter, S。, & 施米德胡贝尔, J. (1997). Long short-term memory.
神经计算, 9(8), 1735–1780. https://doi.org/10.1162
/neco.1997.9.8.1735, 考研: 9377276

Hoffman, 中号. R。, Ibáñez, L. D ., Fryer, H。, & Simperl, 乙. (2018). Smart
文件: Dynamic publications on the blockchain. In Proceedings
of the 15th Extended Semantic Web Conference (PP. 304–318).
https://doi.org/10.1007/978-3-319-93417-4_20

Jaradeh, 中号. Y。, Auer, S。, Prinz, M。, Kovtun, 五、, Kismihók, G。, &
Stocker, 中号. (2019A). Open research knowledge graph: Towards
machine actionability in scholarly communication. CoRR,
abs/1901.10816.

Jaradeh, 中号. Y。, Oelen, A。, Farfar, K. E., Prinz, M。, D’Souza, J。, ……
Auer, S. (2019乙). Open research knowledge graph: Next gener-
ation infrastructure for semantic scholarly knowledge. In Pro-
ceedings of the 10th International Conference on Knowledge
Capture (PP. 243–246). https://doi.org/10.1145/3360901
.3364435

Jaro, 中号. A. (1989). Advances in record-linkage methodology as
applied to matching the 1985 census of Tampa, Florida. 杂志
of the American Statistical Association, 84(406), 414–420. https://
doi.org/10.1080/01621459.1989.10478785

吉, G。, 他, S。, 徐, L。, 刘, K., & 赵, J. (2015). Knowledge graph
embedding via dynamic mapping matrix. 在诉讼程序中
53rd Annual Meeting of the Association for Computational
Linguistics and the 7th International Joint Conference on Natural
Language Processing of the Asian Federation of Natural Language
加工 (PP. 687–696). https://doi.org/10.3115/v1/P15-1067
约翰逊, R。, Watkinson, A。, & Mabe, 中号. (2018). The STM report: 一个
overview of scientific and scholarly publishing (5第三版。). 这
Hague: International Association of Scientific, Technical and
Medical Publishers.

Kanakia, A。, 沉, Z。, Eide, D ., & 王, K. (2019). A scalable
hybrid research paper recommender system for Microsoft Aca-
demic. In Proceedings of the 28th World Wide Web Conference
(PP. 2893–2899). https://doi.org/10.1145/3308558.3313700
Kastner, S。, Choi, S。, & 荣格, H. (2013). Author name disambig-
uation in technology trend analysis using SVM and random
forests and novel topic based features. 在诉讼程序中
2013 IEEE International Conference on Green Computing and
通讯 (GreenCom) and IEEE Internet of Things
(iThings) and IEEE Cyber, Physical and Social Computing
(CPSCom) (PP. 2141–2144). https://doi.org/10.1109/GreenCom
-iThings-CPSCom.2013.403

Kim, J. (2018). Evaluating author name disambiguation for digital
libraries: A case of DBLP. Scientometrics, 116(3), 1867–1886.
https://doi.org/10.1007/s11192-018-2824-5

Kim, J. (2019). Scale-free collaboration networks: An author name
disambiguation perspective. Journal of the Association for Infor-
mation Science and Technology, 70(7), 685–700. https://doi.org
/10.1002/asi.24158

Kim, J。, Kim, J。, & Owen-Smith, J. (2019). Generating automatically
labeled data for author name disambiguation: An iterative clus-
tering method. Scientometrics, 118(1), 253–280. https://doi.org
/10.1007/s11192-018-2968-3

Kim, K., Khabsa, M。, & 贾尔斯, C. L. (2016). Random forest
DBSCAN for USPTO inventor name disambiguation. CoRR,
abs/1602.01792.

Kim, K., Rohatgi, S。, & 贾尔斯, C. L. (2019). Hybrid deep pairwise
classification for author name disambiguation. In Proceedings
of the 28th ACM International Conference on Information and
Knowledge Management (PP. 2369–2372). https://doi.org/10
.1145/3357384.3358153

Kim, S. N。, Medelyan, 奥。, 能, M。, & Baldwin, 时间. (2013). 汽车-
matic keyphrase extraction from scientific articles. 语言
Resources and Evaluation, 47(3), 723–742. https://doi.org/10
.1007/s10579-012-9210-3

Kowsari, K., Meimandi, K. J。, Heidarysafa, M。, Mendu, S。, 巴恩斯, L. E.,
& 棕色的, D. 乙. (2019). Text classification algorithms: 一项调查.
信息, 10(4), 150. https://doi.org/10.3390/info10040150
Kristiadi, A。, 汗, 中号. A。, Lukovnikov, D ., Lehmann, J。, & Fischer,
A. (2019). Incorporating literals into knowledge graph embed-
丁斯. In Proceedings of the 18th International Semantic Web
会议 (PP. 347–363). https://doi.org/10.1007/978-3-030
-30793-6_20

兰, Z。, 陈, M。, 古德曼, S。, Gimpel, K., 夏尔马, P。, & Soricut, 右.
(2020). ALBERT: A lite BERT for self-supervised learning of
language representations. In Proceedings of the 8th International
Conference on Learning Representations (PP. 1–17).

Larsen, 磷. 奥。, & von Ins, 中号. (2010). The rate of growth in scientific
publication and the decline in coverage provided by science
citation index. Scientometrics, 84(3), 575–603. https://doi.org
/10.1007/s11192-010-0202-z, 考研: 20700371

林, X。, 朱, J。, 唐, Y。, 哪个, F。, 彭, B., & 李, 瓦. (2017). A novel
approach for author name disambiguation using ranking confi-
登塞. 在诉讼程序中 2017 International Workshops on
Database Systems for Advanced Applications (PP. 169–182).
https://doi.org/10.1007/978-3-319-55705-2_13

林, Y。, 刘, Z。, Sun, M。, 刘, Y。, & 朱, X. (2015). Learning entity
and relation embeddings for knowledge graph completion. 在
Proceedings of the 29th AAAI Conference on Artificial Intelli-
根杰斯 (PP. 2181–2187).

刘, Y。, Ott, M。, Goyal, N。, Du, J。, Joshi, M。, … Stoyanov, V. (2019).
RoBERTa: A robustly optimized BERT pretraining approach.
CoRR, abs/1907.11692. Retrieved from https://arxiv.org/abs
/1907.11692

鲁, F。, Cong, P。, & 黄, X. (2020). Utilizing textual information in
knowledge graph embedding: A survey of methods and applica-
系统蒸发散. IEEE Access, 8, 92072–92088. https://doi.org/10.1109
/ACCESS.2020.2995074

Luan, Y。, 他, L。, Ostendorf, M。, & Hajishirzi, H. (2018). Multi-task
identification of entities, 关系, and coreference for scientific
knowledge graph construction. 在诉讼程序中 2018 骗局-
ference on Empirical Methods in Natural Language Processing
(PP. 3219–3232). https://doi.org/10.18653/v1/D18-1360

Ma, X。, 王, R。, & 张, 是. (2019). Author name disambiguation
in heterogeneous academic networks. In Proceedings of the 16th
International Conference on Web Information Systems and Appli-
阳离子 (PP. 126–137). https://doi.org/10.1007/978-3-030-30952
-7_15

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Maidasani, H。, Namata, G。, 黄, B., & Getoor, L. (2012). Entity
resolution evaluation measure (技术报告). Retrieved from
https://web.archive.org/web/20180414024919/https://honors.cs
.umd.edu/reports/hitesh.pdf

Mihalcea, R。, & Tarau, 磷. (2004). TextRank: Bringing order into text.
在诉讼程序中 2004 实证方法会议
自然语言处理 (PP. 404–411).

Momeni, F。, & Mayr, 磷. (2016). Using co-authorship networks for author
name disambiguation. In Proceedings of the 16th ACM/IEEE-CS
on Joint Conference on Digital Libraries (PP. 261–262). https://土井
.org/10.1145/2910896.2925461

穆勒, 中号. (2017). Semantic author name disambiguation with
word embeddings. In Proceedings of the 21st International Con-
ference on Theory and Practice of Digital Libraries (PP. 300–311).
https://doi.org/10.1007/978-3-319-67008-9_24

Newcombe, H. B., 肯尼迪, J. M。, Axford, S。, & James, A. 磷. (1959).
Automatic linkage of vital records. 科学, 130(3381), 954–959.
https://doi.org/10.1126/science.130.3381.954, 考研:
14426783

阮, D. 问. (2017). An overview of embedding models of
entities and relationships for knowledge base completion. CoRR,
abs/1703.08098. Retrieved from https://arxiv.org/abs/1703
.08098

Nickel, M。, Rosasco, L。, & Poggio, 时间. A. (2016). Holographic
embeddings of knowledge graphs. In Proceedings of the 30th
AAAI 人工智能会议 (PP. 1955–1961).
Nickel, M。, Tresp, 五、, & Kriegel, H. (2011). A three-way model for
collective learning on multi-relational data. 在诉讼程序中
28th International Conference on Machine Learning (PP. 809–816).
Noia, 时间. D ., Mirizzi, R。, Ostuni, V. C。, Romito, D ., & Zanker, 中号.
(2012). Linked open data to support content-based recommender
系统. In Proceedings of the 8th International Conference on
Semantic Systems (PP. 1–8). https://doi.org/10.1145/2362499
.2362501

OpenAIRE. (2021). OpenAIRE Research Graph. https://图形

.openaire.eu/. Accessed: 六月 11, 2021.

Peroni, S。, Dutton, A。, Gray, T。, & Shotton, D. 中号. (2015). Setting
our bibliographic references free: Towards open citation data.
Journal of Documentation, 71(2), 253–277. https://doi.org/10
.1108/JD-12-2013-0166

Piwowar, H。, Priem, J。, Larivière, 五、, Alperin, J. P。, 马蒂亚斯, L。, ……
Haustein, S. (2018). The state of OA: A large-scale analysis of
the prevalence and impact of open access articles. 同行杂志, 6,
e4375. https://doi.org/10.7717/peerj.4375, 考研: 29456894
Pooja, K. M。, Mondal, S。, & Chandra, J. (2018). An unsupervised
heuristic based approach for author name disambiguation. 在
Proceedings of the 10th International Conference on Communi-
cation Systems & 网络 (PP. 540–542). https://doi.org/10
.1109/COMSNETS.2018.8328267

Pooja, K. M。, Mondal, S。, & Chandra, J. (2020). A graph combina-
tion with edge pruning-based approach for author name disam-
歧义. Journal of the Association for Information Science and
技术, 71(1), 69–83. https://doi.org/10.1002/asi.24212
Portisch, J。, Heist, N。, & 保尔海姆, H. (2021). Knowledge graph
embedding for data mining vs. knowledge graph embedding for
link prediction—Two sides of the same coin? Semantic Web—
Interoperability, Usability, Applicability. https://doi.org/10.3233
/SW-212892

Protasiewicz, J。, & Dadas, S. (2016). A hybrid knowledge-based
framework for author name disambiguation. 在诉讼程序中
这 2016 IEEE International Conference on Systems, 男人, 和
Cybernetics (PP. 594–600). https://doi.org/10.1109/SMC.2016
.7844305

Qian, Y。, 郑, Q., Sakai, T。, 叶, J。, & 刘, J. (2015). 动态的
author name disambiguation for growing digital libraries. Infor-
mation Retrieval Journal, 18(5), 379–412. https://doi.org/10
.1007/s10791-015-9261-3

Qiu, 是. (2020). Data wrangling: Using publicly available knowledge
图表 (kgs) to construct a domain-specific kg. https://cs.anu.edu
.au/courses/CSPROJECTS/20S1/reports/u5776733_report.pdf
Quass, D ., & Starkey, 磷. (2003). Record linkage for genealogical
databases. In Proceedings of the ACM SIGKDD 2003 作坊
on Data Cleaning, Record Linkage, and Object Consolidation
(PP. 40–42).

Ristoski, 磷. (2017). Exploiting Semantic Web Knowledge Graphs in

Data Mining (未发表的博士论文).

Ristoski, P。, Rosati, J。, Noia, 时间. D ., 莱昂内, 右. D ., & 保尔海姆, H.
(2019). RDF2Vec: RDF graph embeddings and their applications.
语义网, 10(4), 721–752. https://doi.org/10.3233/SW
-180317

Roark, B., Wolf-Sonkin, L。, Kirov, C。, Mielke, S. J。, Johny, C。, … Hall,
K. 乙. (2020). Processing South Asian languages written in the Latin
script: The Dakshina dataset. In Proceedings of the 12th Language
Resources and Evaluation Conference (PP. 2413–2423).

Rocchio, J. J. (1971). Relevance feedback in information retrieval.
In G. Salton (埃德。), The smart retrieval system—Experiments in
automatic document processing. Englewood, Cliffs, 新泽西州: Prentice
大厅.

Rose, S。, 恩格尔, D ., Cramer, N。, & Cowley, 瓦. (2010). Automatic
keyword extraction from individual documents. 在米. 瓦. Berry &
J. Kogan (编辑。), Text mining: Applications and theory (PP. 1–20).
约翰·威利 & Sons. https://doi.org/10.1002/9780470689646.ch1
Rumelhart, D. E., 欣顿, G. E., & 威廉姆斯, 右. J. (1986). 学习
representations by back-propagating errors. 自然, 323(6088),
533–536. https://doi.org/10.1038/323533a0

Salatino, A. A。, Thanapalasingam, T。, Mannocci, A。, Osborne, F。, &
Motta, 乙. (2018). The computer science ontology: 一个大规模的
taxonomy of research areas. In Proceedings of the 17th Interna-
tional Semantic Web Conference (PP. 187–205). https://doi.org
/10.1007/978-3-030-00668-6_12

Schapire, 右. 乙. (1990). The strength of weak learnability. 机器
学习, 5, 197–227. https://doi.org/10.1007/BF00116037
Schindler, D ., Zapilko, B., & Krüger, F. (2020). Investigating soft-
ware usage in the social sciences: A knowledge graph approach.
In Proceedings of the 17th Extended Semantic Web Conference
(PP. 271–286). https://doi.org/10.1007/978-3-030-49461-2_16
Schubert, T。, Jäger, A。, Türkeli, S。, & Visentin, F. (2019). Addressing
the productivity paradox with big data. A literature review and
adaptation of the CDM econometric model. 技术报告,
Maastricht University.

Schulz, C。, Mazloumian, A。, 彼得森, A. M。, Penner, 奥。, & Helbing,
D. (2014). Exploiting citation networks for large-scale author
name disambiguation. EPJ Data Science, 3(1), 11. https://doi.org
/10.1140/epjds/s13688-014-0011-3

Shaver, 磷. (2018). Science today. In The rise of science: From pre-
history to the far future (PP. 129–209). 占婆: Springer Interna-
tional Publishing. https://doi.org/10.1007/978-3-319-91812-9_4
Singla, P。, & Domingos, 磷. 中号. (2006). Entity resolution with Markov
逻辑. In Proceedings of the 6th IEEE International Conference on
Data Mining (PP. 572–582). IEEE Computer Society. https://土井
.org/10.1109/ICDM.2006.65

Sinha, A。, 沉, Z。, 歌曲, Y。, Ma, H。, Eide, D ., … Wang, K. (2015).
An overview of Microsoft Academic Service (MAS) and applica-
系统蒸发散. In Proceedings of the 24th International Conference on
World Wide Web Companion (PP. 243–246). https://doi.org/10
.1145/2740908.2742839

Quantitative Science Studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
q
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
5
1
2
0
0
8
2
8
0
q
s
s
_
A
_
0
0
1
8
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The Microsoft Academic Knowledge Graph enhanced

Sun, S。, 张, H。, 李, N。, & 陈, 是. (2017). Name disambiguation
for Chinese scientific authors with multi-level clustering. 在
诉讼程序 2017 IEEE International Conference on
Computational Science and Engineering and IEEE International
Conference on Embedded and Ubiquitous Computing
(PP. 176–182). IEEE Computer Society. https://doi.org/10.1109
/CSE-EUC.2017.39

唐, J。, 张, J。, Yao, L。, 李, J。, 张, L。, & Su, Z. (2008).
ArnetMiner: Extraction and mining of academic social networks.
In Proceedings of the 14th ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining (PP. 990–998).
https://doi.org/10.1145/1401890.1402008

Tekles, A。, & Bornmann, L. (2019). Author name disambiguation of
bibliometric data: A comparison of several unsupervised
方法. In Proceedings of the 17th International Conference
on Scientometrics and Informetrics (PP. 1548–1559).

特兰, H. N。, Huynh, T。, & 做, 时间. (2014). Author name disambigua-
tion by using deep neural network. In Proceedings of the 6th Asian
Conference on Intelligent Information and Database Systems
(PP. 123–132). https://doi.org/10.1007/978-3-319-05476-6_13
Trouillon, T。, Welbl, J。, Riedel, S。, Gaussier, É., & Bouchard, G. (2016).
Complex embeddings for simple link prediction. In Proceedings
的
the 33rd International Conference on Machine Learning
(PP. 2071–2080).

Tzitzikas, Y。, Pitikakis, M。, Giakoumis, G。, Varouha, K., & Karkanaki,
乙. (2020). How can a university take its first steps in open data?
In Proceedings of the 14th Metadata and Semantics Research
会议. https://doi.org/10.1007/978-3-030-71903-6_16
Vapnik, 五、, & Chervonenkis, A. 是. (1964). A class of algorithms for
pattern recognition learning. Avtomat. i Telemekh, 25(6), 937–945.
王, H。, 王, R。, Wen, C。, 李, S。, Jia, Y。, … Wang, X. (2020).
Author name disambiguation on heterogeneous information net-
work with adversarial representation learning. 在诉讼程序中
the 34th AAAI Conference on Artificial Intelligence (PP. 238–245).
https://doi.org/10.1609/aaai.v34i01.5356

王, J。, 李, G。, 于, J. X。, & 冯, J. (2011). Entity matching: 如何
similar is similar. Proceedings of the VLDB Endowment, 4(10),
622–633. https://doi.org/10.14778/2021017.2021020

王, K., 沉, Z。, 黄, C。, 吴, C。, Eide, D ., … Rogahn, 右.
(2019). A review of Microsoft Academic Services for science of
science studies. Frontiers in Big Data, 2, 45. https://doi.org/10
.3389/fdata.2019.00045, 考研: 33693368

王, K., 沉, Z。, 黄, C。, 吴, C.-H., Dong, Y。, & Kanakia, A.
(2020). Microsoft Academic Graph: When experts are not
足够的. Quantitative Science Studies, 1(1), 396–413. https://土井
.org/10.1162/qss_a_00021

王, Q., Mao, Z。, 王, B., & Guo, L. (2017). 知识
graph embedding: A survey of approaches and applications.
IEEE Transactions on Knowledge and Data Engineering, 29(12),
2724–2743. https://doi.org/10.1109/TKDE.2017.2754499

王, R。, 严, Y。, 王, J。, Jia, Y。, 张, Y。, … Wang, X. (2018).
AceKG: A large-scale knowledge graph for academic data min-
英. In Proceedings of the 27th ACM International Conference on
Information and Knowledge Management (PP. 1487–1490).
https://doi.org/10.1145/3269206.3269252

王, Z。, 张, J。, 冯, J。, & 陈, Z. (2014). Knowledge graph
embedding by translating on hyperplanes. 在诉讼程序中
28th AAAI Conference on Artificial Intelligence (PP. 1112–1119).
Wilkinson, 中号. D ., Dumontier, M。, Aalbersberg, 我. J。, Appleton, G。,
Axton, M。, … Mons, 乙. (2016). The FAIR Guiding Principles for
scientific data management and stewardship. Scientific Data, 3(1),
1–9. https://doi.org/10.1038/sdata.2016.18, 考研: 26978244
Wilson, D. 右. (2011). Beyond probabilistic record linkage: 使用
neural networks and complex features to improve genealogical
record linkage. 在诉讼程序中 2011 国际联合
Conference on Neural Networks (PP. 9–14). https://doi.org/10
.1109/IJCNN.2011.6033192

Winkler, 瓦. 乙. (1999). The state of record linkage and current
research problems. In Statistical Research Division, US Census
局. World Higher Education Database (2021). https://万维网
.whed.net/home.php.

World Higher Education Database. (2021). https://www.whed.net

/home.php.

徐, X。, 李, Y。, Liptrott, M。, & Bessis, 氮. (2018). NDFMF: 一位作家
name disambiguation algorithm based on the fusion of multiple
特征. 在诉讼程序中 2018 IEEE 42nd Annual Computer
Software and Applications Conference (PP. 187–190). https://土井
.org/10.1109/COMPSAC.2018.10226

哪个, B., Yih, W., 他, X。, 高, J。, & Deng, L. (2015). 嵌入
entities and relations for learning and inference in knowledge
bases. In Proceedings of the 3rd International Conference on
Learning Representations.

哪个, Z。, Dai, Z。, 哪个, Y。, Carbonell, J. G。, Salakhutdinov, R。,
& Le, 问. V.
(2019). XLNet: Generalized autoregressive pre-
这
training for language understanding.
Annual Conference on Neural Information Processing Systems
(PP. 5754–5764).

在诉讼程序中

张, S。, Xinhua, E., & Pan, 时间. (2019). A multi-level author name
disambiguation algorithm. IEEE Access, 7, 104250–104257.
https://doi.org/10.1109/ACCESS.2019.2931592

张, W., 严, Z。, & 郑, 是. (2019). Author name disambigua-
tion using graph node embedding method. 在诉讼程序中
23rd IEEE International Conference on Computer Supported
Cooperative Work in Design (PP. 410–415). https://doi.org/10
.1109/CSCWD.2019.8791898

郑, D ., 歌曲, X。, Ma, C。, Tan, Z。, 叶, Z。, … Karypis, G. (2020).
DGL-KE: Training knowledge graph embeddings at scale. In Pro-
ceedings of the 43rd International ACM SIGIR Conference on
Research and Development in Information Retrieval (PP. 739–748).
https://doi.org/10.1145/3397271.3401172