ARTICLE DE RECHERCHE

ARTICLE DE RECHERCHE

A supervised machine learning approach to trace
doctorate recipients’ employment trajectories

Dominik P. Heinisch1

, Johannes Koenig1,2

, and Anne Otto2

1University of Kassel, Institute of Economics and INCHER-Kassel (Allemagne)
2Institute of Employment Research (IAB) Rhineland-Palatinate-Saarland (Allemagne)

Mots clés: PhD, employment biographies, administrative data, record linkage, supervised machine
learning

ABSTRAIT
Only scarce information is available on doctorate recipients’ career outcomes (BuWiN, 2013).
With the current information base, graduate students cannot make an informed decision on
whether to start a doctorate or not (Benderly, 2018; Blank et al., 2017). Cependant,
administrative labor market data, which could provide the necessary information, sont
incomplete in this respect. In this paper, we describe the record linkage of two data sets to
close this information gap: data on doctorate recipients collected in the catalog of the German
National Library (DNB), and the German labor market biographies (IEB) from the German
Institute of Employment Research. We use a machine learning-based methodology, lequel (un)
improves the record linkage of data sets without unique identifiers, et (b) evaluates the
quality of the record linkage. The machine learning algorithms are trained on a synthetic
training and evaluation data set. In an exemplary analysis, we compare the evolution of the
employment status of female and male doctorate recipients in Germany.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

1. RECORD LINKAGE OF INTEGRATED EMPLOYMENT BIOGRAPHY DATA

Au cours des dernières années, the availability of comprehensive new administrative data sets on individual
labor market biographies has enabled numerous studies in economics and other social sci-
ences covering a wide range of labor market topics. Cependant, administrative labor market re-
cords comprise a limited set of variables, thus narrowing the scope of potential research
questions that can be addressed. Only scarce information is available about the career out-
comes of doctorate recipients in Germany (BuWiN, 2013). This holds particularly for those
doctorate recipients who pursue careers in the nonacademic sector. Knowing more about their
labor market biographies is not only important for universities and policymakers. Without
knowledge about potential career outcomes, students cannot make an informed decision on
whether to start doctoral training or leave academia (Benderly, 2018; Blank et al., 2017).

The objective of the IAB-INCHER project of earned doctorates (IIPED) is to construct a com-
prehensive data set on labor market biographies of German doctorate recipients. The Integrated
Employment Biographies (IEB) of the Institute for Employment Research (IAB) cover labor market
records of about 80% of the German workforce. They comprise detailed individual-level infor-
mation on sociodemographic characteristics, qualification levels, and job characteristics.
Cependant, there is no information about earned doctoral degrees. The catalog of the
German National Library (DNB) provides this information. The DNB covers almost all
German universities’ doctorate recipients from 1970 to today. The DNB only provides

un accès ouvert

journal

Citation: Heinisch, D. P., Koenig, J., &
Othon, UN. (2020). A supervised machine
learning approach to trace doctorate
recipients’ employment trajectories.
Études scientifiques quantitatives, 1(1),
94–116. https://est ce que je.org/10.1162/
qss_a_00001

EST CE QUE JE:
https://doi.org/10.1162/qss_a_00001

Reçu: 12 Avril 2019
Accepté: 3 Août 2019

Auteur correspondant:
Johannes Koenig
Koenig@uni-kassel.de

Éditeur de manipulation:
Ludo Waltman

droits d'auteur: © 2019 Dominik P. Heinisch,
Johannes Koenig, and Anne Otto.
Publié sous Creative Commons
Attribution 4.0 International (CC PAR 4.0)
Licence.

La presse du MIT

Doctorate recipients’ employment trajectories

sufficient information for conventional record linkage (par exemple., exact dates of birth) for a minority
of individuals. To be able to link both data sets on a large scale, we apply a record linkage
procedure that utilizes supervised machine learning algorithms, which are trained on a syn-
thetic training and evaluation data set.

Numerous prior studies have used record linkage methods (Rapide, 2013) to supplement
administrative labor market data. In many cases, the record linkage could be based on unique
identifiers available in both data sets (par exemple., name–surname combination, exact birth date, sex).
If identifiers are incomplete or not fully reliable, more advanced “Merge Toolboxes” are avail-
capable, which utilize string-comparison functions to calculate similarities between key words (par exemple.,
employer’s name) in both data sets (Schnell et al., 2004). Even if conventional approaches are
able to successfully link two data sets, a proper evaluation of the linked data set’s quality (dans
terms of recall and precision) would be advisable, rather than only reporting the number of final
matched entities. Multiple matches between entries are another problem that our approach is
able to take into account.

To overcome the limitations of existing record linkage methods, we develop and assess a
set of supervised machine learning algorithms. This approach has several advantages: D'abord, it is
not restricted to data with high-quality identifiers. Deuxième, the quality of the linked data set is
assessable and comparable across different algorithms, as well as to conventional record link-
age approaches. Troisième, our approach is applicable under strict data security requirements and
ensures the rigorous anonymity of individual records, which are indispensable requirements in
any use of social security data in Germany. Fourth, we utilize a synthetic training and evalu-
ation data set, which allows us to evaluate the quality of the record linkage in the absence of
external training and evaluation data.

Even though unique identifiers are absent in both data sets, the final linked data set meets
high quality standards in terms of precision and recall. All tested supervised machine learning
algorithms outperform heuristic (rule-based) approaches. Achieving a high recall rate not only
allows researchers to address questions requiring larger and more complete samples, it also
enables differentiation among subgroups. En outre, as the algorithm uses multiple features to
predict true-positive matches, it is less likely to introduce bias into the sample. While the syn-
thetic test and evaluation data set might by itself act as a source of bias, we do not find any
distortions on observables. Depending on the parameter settings, the quality of the linked data
sets can vary for each algorithm, which highlights the necessity of independent training and
test data for selecting the best parameter specifications.

The obtained linked data set allows us to investigate the labor market trajectories of German
doctorate recipients from 1975 à 2015 before, pendant, and after their graduation. As a practical
application, we use the final data set to analyze the employment status of doctorate recipients
at different points of time in their careers. En particulier, we analyze gender-specific differences
in the share of full-time and part-time employment during doctorate recipients’ careers. Nous
find that few doctorate recipients are unemployed after graduation. Cependant, a substantial
share of female doctorate recipients work part time. While female and male doctorate recip-
ients show similar employment patterns during their graduation period, the share of part-time
and full-time employed women diverges after that.

Our study is not solely limited to Germany. From a methodological point, the introduced
method could be applied by further studies to improve the quality of record linkage ap-
proaches for the combination of micro data sets. From an empirical standpoint, Germany is
one of the biggest “producers” of doctorate recipients among the OECD countries (OECD,
2018) and a huge labor market with a great variety of job positions for graduates.

Études scientifiques quantitatives

95

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

Investigating the career trajectories of doctorate recipients in Germany contributes to increasing
the required transparency for graduates’ potential career outcomes in the academic and private
sector. This evidence can thus also help students in other countries to make better informed
decisions for the planning of their further careers.

The paper is structured as follows: In Section 2, the data sets of the record linkage approach
are described. Section 3 presents the supervised machine learning algorithms in detail, aussi
as their implementation, and evaluates the different approaches we tested. In Section 4, le
linked data set is used to investigate the employment status of doctorate recipients over time.
In Section 5, we discuss some limitations of the proposed approach and draw implications for
further research. Section 6 concludes.

2. DATA SOURCES

Dans cette section, we introduce the two data sets that are integrated by record linkage: the Integrated
Employment Biographies (German: Integrierte Erwerbsbiographien [IEB]) and the data set of
doctorate recipients from the German National Library (Deutsche Nationalbibliothek [DNB]).
Both data sets provide a nearly complete picture of the corresponding populations: Le
German workforce (subject to social security payments) is represented in the IEB and doctorate
recipients who graduated from German universities are represented in the DNB. Par conséquent, le
DNB data provide a suitable supplement for the IEB, where information about tertiary education
is incomplete. Both data sets are collected by public institutions following standardized proce-
dures and regularities in the data preparation process, which makes them highly reliable and
suitable for research purposes. While the DNB data have only been merged via record linkage
with publication data (Heinisch & Buenstorf, 2018), the IEB data have been merged via record
linkage with a number of external micro databases in the past (par exemple., Antoni & Seth, 2012;
Dorner et al., 2014; Wydra-Somaggio, 2015; Teichert et al., 2018).

2.1. Doctorate Recipients Data of the German National Library (DNB)

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

The DNB catalog covers almost the entire population of individuals who completed doctoral
training at German universities—doctorate recipients—encompassing about one million au-
thors of dissertations.1 Two peculiarities lead to this. D'abord, all German publications (published
in Germany or by Germans) are held by the German National Library, which is “entrusted with
the task of collecting, permanently archiving, bibliographically classifying and making avail-
able to the general public all German and German-language publications from 1913” (DNB,
2018). According to §§14 to 16 of the Act on the German National Library, media works are to
be delivered to the library if a holder of the original distribution right has their registered office,
a permanent establishment, or the main place of residence in Germany. Deuxième, en Allemagne,
doctoral students are obliged to publish their thesis in order to be awarded a doctorate from a
German university, and the German National Library tracks thesis publications.2

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Within the catalog of the German National Libraray, a separate note provides additional
information on the type of publication, the year of submission, and the corresponding univer-
sity name. Since data are selected by librarians for the purpose of archiving and classifying

1 The German National Library makes its data accessible under the Creative Commons Zero license (CCO

1.0).

2 The DNB data set has been used for various analyses. Par exemple, Buenstorf and Geissler (2014) studied
advisor effects based on laser-related dissertations, and Heinisch and Buenstorf (2018) identified the doctoral
advisors of doctorate recipients. Both studies confirm the high reliability and completeness of the DNB data.

Études scientifiques quantitatives

96

Doctorate recipients’ employment trajectories

Tableau 1. Illustration of the DNB data

dnb_id
87640472

nom
Marta

surname
Musterfrau

birth_year
NA

genre
female

nationalité
German

uni_name
Kiel

publication_year
2010

sujet
Economics

12342124

Max

Maulwurf

07986678

Martine

Mustermann

1979

NA

male

male

German

Jena

Italian

Kassel

2008

1993

Medicine

Engineering

Note: The table provides fictitious examples of the DNB data set.

these publications, bibliographic information is documented with an overall high degree of
accuracy.3 The coverage is almost complete for all years and disciplines.

Depuis 1995 à 1997 onwards, the DNB created the Personennormdatei, a data set compris-
ing all authors as separated entities. This additional catalog improves the information available
on authors. Beginning in 1997, the year of birth is recorded for the majority of authors in the
data set, as well as additional information on authors’ nationality. Cependant, most of these var-
iables cannot be used as identifiers (variables) for the linkage procedure, because the coverage
rates vary strongly over time. A stylized example of the DNB data is provided in Table 1.

2.2.

Integrated Employment Biographies (IEB)

The IEB unites data from five different historic data sources, each capturing a different segment
of the German social security system.4 It contains detailed information on all individuals who
are liable to social insurance contributions in Germany (c'est à dire., employees, unemployed individ-
uals, job seekers, recipients of social benefits and participants in active labor market pro-
grams). Civil servants, self-employed, family workers, and doctoral candidates financed
solely by scholarships etc. are not part of the social security system and therefore not reported
in the IEB. Taken together, the data cover approximately 80% of the German workforce.

The IEB data comprise the starting and ending dates of all spells (c'est à dire., periods of unemploy-
ment, benefit receipt, employment) pour chaque individu (vom Berge et al., 2013). En plus, pour
each individual a range of sociodemographic characteristics is documented (par exemple., sex, date of
birth, nationalité, qualification level, job features [type of employment, occupation, industry af-
filiation, region of workplace]). While, although incomplete, information of vocational training
certificates obtained, or bachelor’s and master’s degrees, is part of the IEB, no information on
doctoral degrees exists. Information is available on a daily basis from 1975 to the most current

3 Néanmoins, some effort was necessary to clean the data. The names of the individuals were standardized.
Par exemple, name information was coded in UTF-8 and separated by commas into first and last names.
Plus loin, all variants of misspelled university names were checked manually and assigned to the correspond-
ing institution. Year information in the database was corrected for nonplausible cases. Electronic resources
were also added to the database. Au cours des dernières années, some dissertations have been included exclusively as elec-
tronic resources. Cependant, many electronic doctorate theses are also listed as a physical book. Plus loin, dif-
ferent versions of the same work are possible, such as university deposit copies and commercial publisher
editions, with possible later new editions. The database was cleaned for these duplicates, which were iden-
tified by two different approaches. If a reference is made in the DNB’s title holdings to an identical publi-
cation other than the original publication, these publication are considered identical. Cependant, an explicit
reference to identical publications is not given for all double-listed publications. Donc, duplicates were
detected based on title information. Titles and subtitles were standardized (c'est à dire., punctuation marks, upper
and lower case, les espaces, etc.. were removed) and cleaned (c'est à dire., names and other nontitle information were
removed) and a fuzzy string comparison was used to take care of small variations. Plus loin, we excluded all
authors with incomplete name information (par exemple., entries with missing first name or surname). See also
Heinisch and Buenstorf (2018) for further details.

4 These five data sources are the Employee History, Benefit Recipient History, Unemployment Benefit II

Recipient History, Participants-in-Measures History, and the Jobseeker History.

Études scientifiques quantitatives

97

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

D
o
c
t
o
r
un
t
e

r
e
c
je
p
je
e
n
t
s
'

e
m
p
je
o
oui
m
e
n
t

t
r
un
j
e
c
t
o
r
je
e
s

iab_id
92240472

employment
Mini-job

begin_date
01/01/1996

end_date
31/12/1996

place_work
Kiel

school_degree
A level

apprenticeship
No qualification

class_econ_activity

49.32 Taxi operation

Tableau 2. Illustration of the IEB data

92240472

Part-time

01/01/1997

31/12/1997

92240472

Part-time

01/01/1998

31/12/1998

92240472

Unemployed

01/01/1999

31/01/1999

Kiel

Kiel

Kiel

92240472

Full-time

01/02/1999

31/12/1999

Berlin

A level

A level

A level

A level

University degree

85.42 Tertiary education

University degree

85.42 Tertiary education

University degree

University degree

72.11 Research and

experimental development
on biotechnology

92240472

Full-time

01/01/2000

31/12/2000

Berlin

A level

University degree

72.11 Research and

experimental development
on biotechnology

32134444

Mini-job

01/06/2003

31/08/2003

Buxtehude

No qualification

No qualification

55.20 Holiday and other

short-stay accommodation

32134444

Mini-job

01/07/2004

31/09/2004

Jena

École primaire

No qualification

55.10 Hotels and similar

accommodation

32134444

Part-time

01/01/2007

31/12/2007

Jena

32134444

Full-time

01/01/2008

31/12/2008

Halle

A level

A level

University degree

86.10 Hospital activities

University degree

86.10 Hospital activities

20347523

Part-time

01/08/1980

31/12/1980

Frankfurt

École primaire

Vocational training

4.11 Central banking

20347523

Full-time

01/01/1981

31/12/1981

Frankfurt

École primaire

Vocational training

66.11 Administration of

financial markets

Note: The table provides fictitious examples of the IEB data set.

Q
toi
un
n

t
je
t

un

je

t
je
v
e
S
c
e
n
c
e
S
toi
d
e
s

t

je

9
8

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

Chiffre 1. Overview of the data processing and record linkage procedure.

year for West Germany, et de 1993 for East Germany. Ainsi, the IEB enables labor market
biographies of individuals in the public and the private sector to be tracked over time.

The IEB data are highly reliable for all variables that are directly relevant for social insur-
ance contributions. Cependant, some information in the data, such as information on secondary
schooling, is less reliable, as it is transmitted by the employer solely for statistical purposes
(Fitzenberger et al., 2005). En outre, some variables contain missing values, which vary
au fil du temps (par exemple., Antoni et al., 2016). Confidential information that would make individuals
identifiable (par exemple., name and address) is not accessible for researchers (Rapide, 2013). An anon-
ymized system-independent individual identifier links social security registers and administra-
tive data of the Federal Employment Agency (Dorner et al., 2014).5 Tableau 2 shows a fictitious
example of the preprocessed IEB data.

3. CLASSIFYING DOCTORATE RECIPIENTS IN THE GERMAN LABOR MARKET DATA

3.1. Problem Description

Dans cette section, we first describe the general record linkage problem, and then expand on it in
terms of its applicability to social security data, where researchers have to deal with large vol-
umes of highly sensitive data. The record linkage procedure aims at identifying as many entries
in both data sets that belong to the same entity. This target function is optimized under the
constraint of keeping the number of incorrect matched entries as low as possible. To achieve
this target, a two-step procedure is applied: D'abord, entries in both data sets are matched by using
an imperfect identifier (c'est à dire., the names of individuals). Deuxième, falsely matched combinations
are eliminated. Chiffre 1 presents an overview of the record linkage approach described in this
section.

5 The IEB and its scientific use file have been extensively discussed in the past. Voir, Par exemple, Dorner et al.
(2010) for a brief discussion of the IEB, Oberschachtsiek et al. (2009) for a more detailed description of the
IEB sample, and Zimmermann et al. (2007) for the scientific use file.

Études scientifiques quantitatives

99

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

The first step aims to match as many entries as possible of both data sets that might belong
to one entity. Autrement dit, in the first step, the data sets are actually linked. This can be
achieved, Par exemple, by exact string matching between entries’ names, or by calculating
distances between the entries’ names using a fuzzy string matching algorithm. The second step
aims to identify as reliably as possible true linked entities among the matched entry pairs. Dans
autres mots, in the second step, correctly linked entries that belong to one entity are filtered
from incorrectly matched entries. As social security data comprise large volumes of data with
many homonyms (in our case the entire German workforce), the filtering of true-positive
matched entries is a more serious problem, in particular, as incorrectly spelled names are less
frequent in administrative data. Donc, this paper is primarily focused on improving the
second step of the record linkage procedure.

The linked entries of both data sets by a specific identifier will result in 0-to-n possible com-
binations of matched entries, of which 0-to-1 combinations truly belong to one entity. In those
cases, where multiple entries match into one entity, many-to-many (n-to-m) matched entries
occur. Identifying the true matched entities in a set of n-to-m matched entries can be described
as a classification problem. The following description of the classification problem is based on
Gareth et al. (2013) and Bishop (2006). Officiellement, the classification task is to find a function f(X)
that correctly classifies two matched entries of both data sets as one entity. With a quantitative
response variable Y 2 c(Same, Different) and using a set of p different predictors,
(cid:1)

(cid:3)

X ¼ X1; X2; ; Xp

où (cid:2) is the error term.

Y ¼ f Xð Þ þ (cid:2);

In practice, there are numerous restrictions that complicate the estimation of the classifica-
tion function f: Unique entity identifiers (or keys) and reliable predictors such as combinations
of name, birthday, and birthplace may be lacking. Even if the available data are generally of
high quality, information may be imprecise, misreported, or incomplete for individual entries.
Even in cases where reliable predictors exist, privacy requirements may restrict the number of
predictor variables X that are accessible to researchers.

If the reliability of a single or multiple predictors cannot be ascertained, or if only a set of
weak predictors is available, machine learning algorithms can improve the record linkage
qualité. Machine learning algorithms have been applied to a number of record linkage prob-
lems and several solutions are available (par exemple., Christen, 2012b). In this paper, we use machine
learning algorithms to solve the classification problem described above in accurately filtering
true matched entries. In this case the classification problem can be described as the best com-

bination of available input variables X that predict

^
Oui:

Y ¼ ^
^

f Xð Þ;

avec

^
Y as classification output and

^
f as our estimation equation for the classification function f.

The accuracy of
and irreducible errors:

^
Y depends on two aspects, as the following equation shows: the reducible

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

h

¼ E f Xð Þ þ (cid:2) − ^

f Xð Þ

¼ f Xð Þ − ^

f Xð Þ

þ Var (cid:2)ð Þ:

i2

h

i2

(cid:3)2

(cid:1)
E Y − ^
Oui
h
f Xð Þ − ^

The reducible error

i2

results from

^
f not being a perfect estimation for f. As the

f Xð Þ

name implies, the reducible error can be reduced by more sophisticated statistical learning

Études scientifiques quantitatives

100

Doctorate recipients’ employment trajectories

methods or by increasing the input variables’ X predictive power. In contrast, the irreducible
^
error Var((cid:2)) would persist even if
f were a perfect approximation of f. The set of input variables
X entering into function f cannot predict (cid:2) by definition, as they result from errors in measuring
X. A suitable classification procedure identifies the best functional relation of X in
que

^
F

approximates f, by minimizing the reducible error

h
f Xð Þ − ^

f Xð Þ

i2

.

Solving classification problems is a traditional field of application for machine learning
techniques. Machine learning algorithms can help to find suitable approximations of the clas-
sification function f (Christen, 2012un). Cependant, these approaches have not found much use in
research using administrative labor market data. Record linkage procedures used in this con-
text have mostly been based on heuristic approaches. Data are linked by calculating simi-
larities between names (Rapide, 2013) and “rules-based” heuristics (par exemple., information on
whether two entries originate from the same or different regions). Applying heuristic ap-
proaches requires high-quality data. Even then, heuristic approaches do not exploit the full

potential of the data, because they do not use the optimal functional form of
representation of X.

^
f or the best

^
F (X).
A wide selection of sophisticated classification algorithms is available to estimate
These can broadly be categorized into deterministic, probabilistic, et (machine) learning-
based approaches (Christen, 2012b). Higher predictive power can be expected for supervised
machine learning techniques. Supervised machine learning algorithms require training data to
approximate the best representation of f by a specific representation of the input variables X. UN
wide variety of machine learning algorithms have been developed, and the choice of specific
algorithms involves a trade-off between classification quality and computational demands. Dans
addition, not all algorithms are implemented in the statistical software packages available in
the settings where administrative data may be accessed.6 Reflecting these considerations, notre
approach utilizes three well-known machine learning algorithms: regularized logistic regres-
sions, AdaBoost, and Random Forests.7

For these machine learning algorithms, we do not know the best model specification for our
classification problem a priori. Donc, our machine learning algorithms need to be tuned to
discover the parameter setting that results in the most powerful prediction to correctly classify
entities. Different ways exist to identify the best tuning parameters. Ici, our approach is
based on trial and error. A regularized logistic regression estimates a logistic regression model
with an additional penalty term to avoid overfitting. This requires ex ante specification of both
the penalty parameter and a threshold probability value above which estimated matches are
classified as belonging to the same entity. The Random Forest algorithm uses decision trees for
classification. By randomly selecting a set of m variables, a specific number of n decision trees
is constructed. Each decision tree uses these m variables to split the data set-specific thresholds
to classify the data into matches and nonmatches. A sequence of multiple splits divides the
data into distinct decision regions. A majority vote over the n decision trees decides on the
class of each entry in the matched data set. The number of randomly drawn variables (m) et

6 The respective administrative data can only be used on secured machines available at IAB. More advanced
méthodes, such as multilayer neuronal networks, are computationally intensive and their application is not
technically feasible in our case.

7 All algorithms used are available as R Packages. We used the programming language R Version: 3.3.2 (R.
Core Team, 2017) and the following R packages: for AdaBoost the package ada (Culp et al., 2006), for reg-
ularized logistic regressions the package glmnet (Friedman et al., 2010), and for Random Forest the package
randomForest (Liaw and Wiener, 2002).

Études scientifiques quantitatives

101

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

the number of trees (n) must be specified ex ante. AdaBoost is a boosting method developed
for binary outcome variables. Similar to Random Forests, it is based on decision trees, mais le
classifiers are trained sequentially. After each iteration, the classification output is weighted by
its classification success, giving a higher weight to misclassified matched entries in the next
itération. After converging, all decision trees give a majority vote on the matched entries class.
The number of iterations and weights must be set as parameters ex ante.8

In our approach, these machine learning algorithms are tested against a heuristic (rule-
based) classification. For the heuristic classification approach the number of variables consid-
ered in classification needs to be specified ex ante. For the heuristic, we generated all possible
combinations of the matching variables used. Par conséquent, we get a number of possible deci-
sions where only one of the possible matching variables, up to all of these matching variables,
needs to take the value 1. The heuristic then classifies pairs of entries by comparing one or
several variables. Par exemple, one relative restrictive heuristic approach could classify entries
as belonging together if they have the same name and surname in both databases, they were
employed in the university region 5 years before or after graduation, they had a proper edu-
cation, and they are working at a university or research institue. A less restrictive approach
could link all entries with same name-surname combination and a proper age. The common
^
f that accurately separates the spaces
objective of all these approaches is to develop a function
of same versus different entities in both data sets. Applying different model specifications en-
ables us to select from a range of models with different properties. The aim of this task is to find
an optimum between precision and recall; c'est, to link as many entries of both data sets as
possible (high recall) while minimizing the number of false classification decisions (haut
precision).

Overfitting is a serious risk when the best algorithm is selected. Overfitting means that the
^
f follows the error term Var((cid:2)), generating estimates for f that are as close as
prediction function
possible to the observed training data, but not allowing accurate estimates for new observa-
tions outside the training data. Dans ce cas, the trained algorithm is useless, as the trained
model is an exact representation of the training data but cannot be generalized to other data.
^
f that predicts our outcome variable Y as well as
This would fail the task of finding a function
possible: Y ≈ ^

F (X) for any observation.

To overcome overfitting, out-of-bag predictions are used to evaluate the algorithms’ classi-
fication success. These require an independent data set that has not been used in training the
algorithms. The training data are split into several data sets that are specifically used first for
entraînement, second for identification of the correct parameters, and third for evaluation. For train-
ing and evaluation, data are required for which true outcomes of the quantitative response
variable Y 2 c(Same, Different) are known to the researcher.

3.2. Preprocessing and Record Linkage

Dans cette section, we discuss the application of the record linkage procedure described in
section 3.1 to classify correctly dissertation authors from the DNB data set in the IEB data set.

8 The regularized logistic regression was estimated with values for the penalty parameter of 0, 0.3, 0.5, 0.7,
et 1. For the threshold probability, we selected values ranging from 0.1, 0.2, 0.5, et 0.6 à 0.8. For the
Random Forest algorithm, the number of randomly drawn variables (m) was set to 2, 3, et 5. The number of
trees (n) was specified as 20, 100, 200, et 500. For AdaBoost, the number of iterations used for estimation
were set to 50, 100, 250, et 500 and the weights were set as parameters to 0.01, 0.2, 0.5, 0.9, 1.

Études scientifiques quantitatives

102

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

3.2.1. Data preprocessing

Even though both data sets are of high quality, several preprocessing steps were required be-
fore the actual record linkage (see footnote 3). The cleaned dissertation data set includes
984,359 doctorate recipients. In a second step, the DNB data set is merged with the IEB data.
For this step, confidential name-surname information is required, which is both not contained
in the anonymized IEB and not accessible for researchers. In the IAB, it is only possible to use this
information for data linkage with a reasoned data request and if the data linkage is conducted in
a secured technical environment, assuring data protection of the confidential information
(Dorner et al., 2014).9 For this reason, the Data Information Management (DIM) Department
of the IAB, which fulfills these technical prerequisites, is working as a data trustee for the data
linkage.10 First, the data linkage was conducted for exact name-surname combinations. Unlike
other data sets (par exemple., patent data), both data sets are of comparable high quality regarding the
spelling of names, including spellings using German umlauts. We therefore used a naïve
string-matching algorithm to minimize the number of false-positive matched pairs. With naïve
string matching for 876,927 entries, at least one corresponding individual with the same name-
surname combination was identified in the DNB data, avec 18,787,699 corresponding entries in
the IEB. The IEB includes only individuals covered by the German social security system, but not
others such as civil servants or students receiving scholarships, which could explain why some
names of doctorate recipients do not match with any entry in the IEB (see above).

To ensure data security, each researcher working with the IEB is only allowed to use a re-
stricted sample of the IEB. For this reason, the maximum number of multiple matched entries to
individuals was limited to not more than 300 namesakes in the IEB. This excludes doctorate
recipients with very common name-surname combinations (par exemple., “Werner Müller”). If we had
included all matches that exceed the threshold in the matching process, it would have been
necessary to use an extraordinarily large sample of the IEB, since some doctorate recipients
had up to 73,212 name twins. The final data set is further limited to doctorate recipients who
graduated between 1975 et 2015. East German doctorate recipients graduating before 1990
also had to be excluded because reliable IEB employment periods are only available for East
Germany beginning in 1993. To save computational power and reduce the number of false-
positive matched pairs, we deleted all matched pairs aged below 20 in the year of submission.
In Germany individuals usually receive their doctoral degree at the age of 32.5 années. If an
entry in the DNB database is connected to a number of entries in the IEB database while some
of them are aged below 20 in the year of submission, these entries most certainly do not be-
long to the same entity. Summing up, the final database contains information about 687,979
doctorate recipients from the DNB and the corresponding 15,468,638 IEB entries.

9 The IAB as a whole fulfills the legal requirements for data security, as it is a department of the Federal
Employment Agency in Germany, which in turn is obliged to ensure data security as a social service provider
in accordance with the standards of §78 Social Security Code X.

10 The DIM Department carried out the record linkage using individual identifiers (par exemple., first name, surname) dans
both data sets, and it alone stores this information. Alors, the DIM Department pseudonymized the personal
data according to the legal definition of §3 para. 6a Federal Data Protection Act and replaced them with iden-
tification numbers. The correspondence tables of this data linkage were only provided to the researchers as
anonymized data sets. The subsequent steps of data processing and matching were carried out only based upon
this anonymized data. The risk of restoration of the personal reference is countered by administering the con-
fidential personal data, which are required for the identification of the cases, only from the data trustee. In the
end, the researcher only has access at IAB to the final anonymized data set for further scientific work. Quand
publishing results, care is taken to ensure that only sufficiently large case numbers that do not allow conclu-
sions to be drawn about individuals are presented.

Études scientifiques quantitatives

103

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

3.2.2. Generation of synthetic test and training data

Supervised (machine) learning algorithms require training data to approximate the best predic-
tive model. Par conséquent, for training and evaluation of the algorithm, a set of reliable observa-
tions is necessary where matched entries belonging to one entity (true-positive matches) peut
be distinguished from false-positive matched entries (true-negative matches). Several strategies
can be applied to identify a “gold standard” sample that can be used to train and evaluate the
algorithme (Christen, 2012un). An ideal solution would require surveying a selection of doctorate
recipients asking about their realized career paths, or asking them to identify which career
trajectory belongs to them among all the matched entries. The responses would provide the
“gold standard” data set, which can be generalized to predict other matched entries. Cependant,
data security and practical reasons make this infeasible. D'abord, social security data are subject to
stringent data privacy requirements. The data are strictly anonymized, and contacting individ-
uals based on their private addresses is restricted as well. Deuxième, even if individuals could be
directly asked, mistakes as well as low response rates might reduce the representativeness of
the sample obtained. Donc, we created a synthetic training and evaluation data set from
the available data. One important aspect in creating a synthetic training and evaluation data
set is its representativeness of the overall (matched) population. It should contain the same
variables, which should moreover follow a similar frequency distribution and similar error
characteristics. In our approach, we use name-surname combinations, as we believe the fre-
quencies of name-surname combinations are independent of the variables used as classifiers.

For training the algorithm, we need both true-positive matches and true-negative matches.
For our synthetic training and evaluation data set, our true-positive matches (Oui 2 c(Same)) sont
based on unique name-surname combinations. These are doctorate recipients whose name-
surname combination appears only once in both databases: the Integrated Employment
Biography Data and the data set received from the catalog of the German National Library.
Since both data sets cover the underlying populations almost completely, these matched en-
tries are expected to belong to the same entity.11 For this approach, it is of only limited im-
portance that the IEB data only contain information for individuals that are liable to social
insurance contributions in Germany. Since it is expected that during their employment trajec-
tories the overwhelming majority of people are captured at least once in one of the different
segments of the German social security system, potential pairs are collected from the almost
complete underlying population. Par conséquent, entries that are linked based on name-surname
combination in both databases and where exactly one-to-one possible name-surname combi-
nation occurs, can be expected to be very likely to belong to the same entity.

For our true-negative matches, these uniqe DNB entries were merged with a random set of
entries from the IEB data set. As the name of an individual is highly gender dependent, we limit
the randomly matched sample to entries with the same name but different surname. This leads
to a sample where individuals were linked on same surname but different name. This proce-
dure leads to a large number of wrongly matched entries. To specify a representative number
of true-negative matched entries, we follow the overall distribution of matched entries and
randomly draw a similar number of matched entries for each wrongly matched DNB entry.

11 We performed a number of plausibility checks, which provided support to our conjecture. For example on an
aggregated level, we investigated the career paths of this unique name-surname combinations for different
sujets, genre, and years and compared their career paths to known career paths of doctorate recipients
from previous studies (par exemple., BuWiN, 2017). The identified career trajectories indicate plausibility of these
matches on an aggregated level.

Études scientifiques quantitatives

104

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

Nom
spell_research

spell_hospital

prop_educ

age_sub

right_age

Tableau 3. Variables for machine learning

Description

Dummy, valeur 1 if individual has/had a spell at a university or research institute.

European statistical classification for economic activities was used.
Values were extended by record linkage for research institutions and universities.

Dummy, valeur 1 if individual has a spell in a hospital/medical practice.
European statistical classification for economic activities was used.

Source
IEB

IEB

Dummy, valeur 1 if education of individual belongs to university entrance qualification.

IEB

Continuous, age in submission year.

Dummy, valeur 1 if individual is between 25 et 40 years old in submission year.

Used for heuristic approach instead of age_sub.

same_ror_y5

Dummy, valeur 1 if individual was employed in university region 5 années

before/after graduation.

first_spell_before

Continuous, first year in IEB subtracted from year of submission.

right_first_spell_before

Dummy, valeur 1 if first_spell_before is between −10 and 5.
Used for heuristic approach instead of first_spell_before.

year_diss

eastern

Continuous, year of submission.

Dummy, valeur 1 if individual graduated in new federal states.

social science

Dummy, valeur 1 if individual graduated in social science.

natural science

Dummy, valeur 1 if individual graduated in natural science.

engineering

medicine

Dummy, valeur 1 if individual graduated in engineering.

Dummy, valeur 1 if individual graduated in medicine.

law/economics

Dummy, valeur 1 if individual graduated in economics/business studies/law.

nbr

Continuous, number of common namesakes in IEB Data.

Using this strategy, we obtain a synthetic training and evaluation data set, for which the true
matching status is known and which is representative of the overall matched population.12

3.2.3. Classification variables

Three types of variables are created that are used as classifications. The first set of variables
contains information on entries in the IEB data set (par exemple., an employment spell at a university);
the second one contains information on entries in the DNB data set (par exemple., the year of submis-
sion), and the third one contains information calculated from both data sets (par exemple., the lag be-
tween dissertation submission and the first employment spell). Tableau 3 gives an overview of

^
Oui. In Table 4, a stylized sample illus-
the classification variables X, which are used to predict
trates the final data set. Tables A1, A2, A3 (in Appendix A) provide descriptive statistics for an

12 The creation of the artificial training and evaluation database was technically executed by the DIM

Department of the IAB, which was working as a data trustee. See also footnote 10.

Études scientifiques quantitatives

105

IEB/DNB

IEB/DNB

IEB/DNB

IEB/DNB

IEB/DNB

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

DNB

DNB

DNB

DNB

DNB

DNB

DNB

IEB

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

Tableau 4. Illustration of DNB-IAB record linkage

d

je

b
n
d

d

je

b
un
je

12342124

92240472

12342124

32134444

12342124

20347523

87640472

08898092

87640472

90980983

h
c
r
un
e
s
e
r

je
je
e
p
s

1

0

0

0

1

je
un
t
je

p
s
o
h

je
je
e
p
s

0

1

0

0

0

c
toi
d
e

p
o
r
p

1

1

0

0

1

b
toi
s

e
g
un

40

29

45

66

31

5
oui

r
o
r

e
m
un
s

0

1

0

0

1

e
r
o
F
e
b

je
je
e
p
s

t
s
r
je
F

−11

−5

−27

5

−10

s
s
je
d

r
un
e
oui

2007

2007

2007

2010

2010

n
r
e
t
s
un
e

1

1

1

0

0

.
s

je
un
je
c
o
s

0

0

0

0

0

.
s

je
un
r
toi
t
un
n

0

0

0

0

0

s
c
je
m
o
n
o
c
e
/
w
un
je

0

0

0

1

1

e
n
je
c
je
d
e
m

1

1

1

0

0

g
n
je
r
e
e
n
je
g
n
e

0

0

0

0

0

Note: The table shows the stylized IAB-DNB linkage in fictitious examples.

r
b
n

3

3

3

2

2

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tableau 5. Descriptive statistics for the classification variables in the synthetic training and evaluation
data separated for true-negative and true-positive

Variable
spell_research

spell_research

spell_hospital

spell_hospital

prop_educ

prop_educ

age_sub

age_sub

right_age

right_age

same_ror_y5

same_ror_y5

first_spell_before

first_spell_before

right_first_spell_before

right_first_spell_before

Same
1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

Median

1

0

0

0

1

0

31

36

1

0

1

0

−6

−11

1

0

Mean
0.6368

0.0657

0.3745

0.1008

0.9507

0.3238

32.5199

37.8844

0.8996

0.4546

0.7297

0.0156

−6.9672

−11.4541

0.7112

0.4242

Min
0

Max
1

0

0

0

0

0

20

20

0

0

0

0

−40

−45

0

0

1

1

1

1

1

91

102

1

1

1

1

37

39

1

1

Note: Descriptive statistics on the distribution of features used to classify true-positive matched entries in the IEB
and DNB data in the synthetic training and evaluation data set. The data are split into two samples: true-positive
matches based on unique name-surname combinations and true-negative matches based on entries with the
same name, but different surname. The true-positive matches are indicated by “Same” = 1.

Études scientifiques quantitatives

106

Doctorate recipients’ employment trajectories

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 2. Recall-precision plots for estimated algorithms under different tuning parameters.

assessment of the representativeness of the synthetic training data set and the full (matched)
population.

Tableau 5 reports the general descriptive statistics for the classification variables separately for
the true-positive and true-negative matched entries in the synthetic training and evaluation
data set. Par exemple, à propos 63.68% of the individuals in the true-positive sample had one
employment spell at a university or other research institution (spell_research), as compared to
6.57% of the individuals in the true-negative sample, indicating a high predictive power for
the spell_research variable. This synthetic training and evaluation data set contains some
50,000 matched doctorate recipients with up to 300 potential matched IEB entries. We di-
vided this data set into two equal parts: a training data set and an evaluation data set. A block
randomization was applied to divide the data set into the two subsets. Block randomization is
a technique that reduces bias and balances the allocation of individuals into different subsets.
This increases the probability that each subset contains an equal number of multiple matched
entries.

Tableau 6. Classification results − best parameter settings (on training data set)

Model
Logistic

Random Forest

AdaBoost

Heuristic

Precision
0.9328

0.9457

0.9246

0.8991

+1 (best parameter)
Recall
0.7099

F1
0.8062

0.8520

0.8964

0.8602

0.8912

0.6786

0.7734

Accuracy
0.9860

Precision
0.9644

0.9919

0.9914

0.9826

0.9616

0.9268

0.8991

+1 (min recall 0.6)
Recall
0.6558

F1
0.7807

0.8287

0.8902

0.8534

0.8886

0.6786

0.7734

Études scientifiques quantitatives

Accuracy
0.9848

0.9916

0.9912

0.9826

107

Doctorate recipients’ employment trajectories

Tableau 7. Evaluation of the classification results − best parameter settings

Model
Logistic

Random Forest

AdaBoost

Heuristic

Precision
0.9410

0.9584

0.9196

0.9110

3.2.4. Model selection and evaluation

+1 (best parameter)
F1
0.8040

Recall
0.7018

0.8337

0.8605

0.6742

0.8917

0.8891

0.7749

Accuracy
0.9847

0.9910

0.9904

0.9825

For model selection, each classification algorithm was trained and tested for various parameter
specifications. Algorithms were trained on three quarters of the training data set and evaluated
(by recall and precision) on the remaining quarter. The results are shown in Figure 2, lequel
shows the recall-precision curve separately for alternative classification algorithms and model
specifications. Tableau 6 shows the best training results for our evaluation measures.

All algorithms achieve satisfactory classification results and would generally be applicable.
The heuristic approach also achieves sufficiently high values in terms of precision. In some
specifications it outperforms most of the more advanced and computationally demanding al-
gorithms.13 However, the more demanding algorithms outperform the heuristic approach in
that they reach comparable rates of precision but achieve substantially higher recall.
Depending on the parameter settings, the classification success of the specific algorithms var-
ies substantially (par exemple., results for the logit model vary from a recall/precision of 0.5683/0.8805
à 0.9840/0.5219). This illustrates the advantage of using a supervised learning approach, as it
allows the evaluation of the record linkage quality not only by how many individuals are
linked, but also by the achieved quality of linked entities.

We next selected those specifications of the algorithms that achieved the highest average
values in recall and precision and those with the highest precision and a recall of at least 0.6.
For the evaluation, we took the best parametrized models and trained them again on the full
training data set. Then we evaluated the trained models on the evaluation data set. Tableau 7
shows the further evaluation results. All models show qualitatively similar results. The Random
Forest algorithm outperforms the other algorithms. The best performing algorithm was then
used to classify true-positive matched entries in the full (matched) data set.

Based on the approach outlined above, the Random Forest algorithm identifies 552,459

^
Y = c(Same). If the Random Forest algorithm identifies more than one entry
individuals as
in the IEB that matches one entry in the DNB (or vice versa), then we decided to exclude
respective cases from the final data set. Ainsi, the final data set for the IAB-INCHER project
of earned doctorates (IIPED) consists of a total of 447,606 doctorate recipients, and the overall
matching quota amounts to 45.47%.

13 Par exemple, one heuristic classified matched entries as belonging to the same entity if a matched IEB entry
had a spell in a hospital/doctor’s office, a spell at a university/research institute, one spell in the university
region at least −5/5 years before/after submission, is aged between 25 et 40 at submission, and has a labor
market entry at least 10 years before or at least 5 years after submission. This heuristic reached a precision of
0.9889. Cependant, while being very precise, the heuristic is only able to link a very selective sample of doc-
torate recipients with the IEB data set, with a recall of 0.0962.

Études scientifiques quantitatives

108

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

Tableau 8. Additional quality assessment

year of birth

genre

Same value in IEB and DNB data
95.33%

Different value in IEB and DNB data
4.67%

99.08%

0.92%

4. APPLICATION

Dans cette section, we evaluate data from the IAB-INCHER project of earned doctorates (IIPED) dans
two ways. D'abord, we assess how representative the linked data set is of the total population of
doctorate recipients in Germany. Deuxième, we present an exemplary analysis of the employ-
ment status of female and male doctorate recipients over time. This example is used to check
whether the empirical results obtained with the linked data set are consistent with existing
empirical evidence. Ce faisant, we explore whether the data can be used to analyze research
questions related to the labor market biographies of doctorate recipients in Germany.

4.1. The Labor Market Sample of Doctorate Recipients

Figure B1 depicts the share of linked doctorate recipients in the total population of doctorate
recipients over time. This share increases strongly from 34.51% in the starting year 1975 à
61.70% dans 2015. For doctorate recipients in the period before and after German reunification,
the matching quota lies at 39.61% et 57.43% respectivement. À 33.08%, the share of female
doctorate recipients in the merged database is comparable to the 33.51% share in the popu-
lation of doctorate recipients received from the DNB. Reliable information on domestic and
foreign doctorate recipients is available for selected years in the DNB catalog. Dans 2013, le
share of domestic doctorate recipients in the DNB was 85.37%, while the respective share
in the merged database is 87.62%, indicating that domestic-born doctorate recipients are
slightly overrepresented. Figure B2 illustrates the average shares of merged doctorate recipi-
ents by discipline over the entire observation period. Dans l'ensemble, average matching rates vary
across fields, with values ranging from 42.81% for sports to 60.88% for sciences and mathe-
matics. As additional evidence of matching quality, we compared variables in both data sets
(IEB and DNB) that were not employed in the matching procedure. Tableau 8 depicts the con-
sistency of linked entries for year of birth and gender, which were both not used as classifica-
tion variables because of limited coverage in the DNB data set. Both variables indicate high
accuracy for our record linkage procedure on an aggregated level. Néanmoins, in some cases
the identified linked entries were not correctly matched.

4.2. The Employment Status of Doctorate Recipients

We now investigate how the employment status of doctorate recipients changes before, dur-
ing, and after their doctoral studies. We differentiate among five types of employment status:
full-time job, part-time job, mini-job,14 vocational training, and unemployment. Chiffre 3
shows the employment status of all linked doctorate recipients in the final data set at different
points in time throughout their careers. As the exact date of graduation is unknown, our point
of reference (année 0) is the final day of the year in which the dissertation was published. La plupart
doctorate recipients hold full- or part-time positions, with only small shares of graduates being
unemployed, in vocational training, or holding mini-jobs at any point in time. Doctoral

14 The monthly income in a mini-job does not exceed A 450, and the number of working hours is limited to 15

per week.

Études scientifiques quantitatives

109

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

Chiffre 3. Employment status over time before/after graduation.

students are often employed in part-time positions at universities or public research organi-
zations. The shares of part-time employment range between 44.71% et 34.70% pour 3 années
à 1 year before graduation, whereas postsubmission employment changes from part-time to
full-time positions in academia, other parts of the public sector, or the private sector.

The share of full-time jobs increases from 78.29% in year 0 to a maximum of 89.59%
3 years later, and then diminishes to 86.46% in year 10 after graduation. À son tour, the share
of part-time employment increases from 8.50% 3 years after to 11.28% 10 years after gradu-
ation. This change can be explained by male and female doctorate recipients following differ-
ent career patterns over time (voir la figure 4).15 While the majority of male graduates constantly
work full-time after their doctorate education, a larger share of women also have a part-time
position after graduation. This gender-specific full-time gap increases over time. While
94.34% of men are full-time employed 10 years after graduation, the corresponding share
among female doctorate recipients declines to 62.51% après 10 années. These results are in line
with existing evidence on gender-specific employment patterns, where female part-time em-
ployment is often attributed to an uneven distribution of family-related responsibilities, tel que
childcare and care of elderly family members among men and women (Wanger, 2015). These
results clearly demonstrate that the data from IIPED is representative of the overall doctorate
recipient population entering the German labor market, particularly in more recent cohorts.
The exemplary analysis of doctorate recipients’ employment status over time is in line with
previous findings. This data set can therefore be employed to study a wide range of research
questions related to the postdoctoral careers of doctorate recipients.

5. LIMITATIONS

As shown above, machine learning provides a suitable approach to overcome the limitations
of traditional record linkage methods. Cependant, machine learning comes with limitations of its

15 For the analysis, we used a sample of the fully linked data set, but we imposed some restrictions on the data.
Since data were collected for administrative purposes, we had to correct some spell information in the data
(Kaul et al., 2016) to construct the sample for the subsequent analysis. Plus loin, we dropped unreliable very
short (un-) employment episodes (below seven days). For the analysis, we use information on all graduates at
the end of a given year (Décembre 31) pour 3 years prior to and 10 years after the publication year of the
dissertation.

Études scientifiques quantitatives

110

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

Chiffre 4.

Employment status over time before and after graduation, separately for male and female doctorate recipients.

propre, which are the focus of this section. Le plus important, as noted above, the linkage is
based on a synthetic training and evaluation data set. Ici, unique name-surname combina-
tions were merged with individuals sharing the same name but a different surname to receive
the true-negative sample of matched entries. While this method allows us to create a database
for training the algorithm that is as close as possible to the original database, this method is
biased if the characteristics of surnames are dependent on (some of) the classification
variables.

De plus, we carefully controlled the plausibility of the linked data for the unique name-
surname combinations. Néanmoins, this check was only possible at an aggregated level of
different disciplines and years before and after graduation. While the results were comparable
to other findings about the labor market trajectories of doctorate recipients at these aggregate
levels of analysis (for example to information from BuWiN, 2013, 2017), the chosen approach
could nevertheless lead to misclassifications in individual cases. En outre, the algorithm was
only used for doctorate recipients with equal to or fewer than 300 namesakes. Even if it is
expected that the algorithm would work sufficiently well for more than 300 potential matches
^
f for a
for each entity, more linkage variables X would be advisable for training the function
precise classification.

De plus, the IEB does not capture individuals who are not liable to social security con-
tributions (par exemple., fonctionnaires, self-employed individuals, and family workers). Donc, le
final database may be biased towards those doctorate recipients who are part of the German
social security system. Par exemple, certain occupations (par exemple., physicians and lawyers) are tra-
ditionally self-employed or employed as civil servants (par exemple., pastors, teachers). These graduate
groups are underrepresented in the database. En outre, the DNB only contains records of
published doctoral theses for German universities, while foreign doctorate recipients from
non-German universities are not covered.

6. CONCLUSIONS

In this paper, we describe our approach using machine learning techniques to improve the
record linkage of two sets of administrative data: a list of almost all German doctorate

Études scientifiques quantitatives

111

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

recipients collected in the catalog of the German National Library (DNB), and the Integrated
Employment Biographies (IEB) of the Institute for Employment Research (IAB). Linking these
data sets was motivated by an interest in studying the labor market trajectories of German
doctorate recipients at different stages of their careers. We show that supervised machine
learning algorithms can be fruitfully applied to the linkage of social security data with other
data. The proposed method has several advantages over traditional methods. On the one
main, its application is not restricted to micro data with overall high quality (où, for exam-
ple, name-surname combinations and exact birth dates or social security numbers are avail-
able as unique identifiers). En outre, the quality of the matching algorithm can be assessed
and compared to simple heuristics. D'autre part, the approach is applicable to contexts
with strong privacy requirements, as is the case for anonymous social security data.

Bearing in mind a number of limitations, an evaluation of the method provides the follow-
ing insights, which may help inform further work. D'abord, machine learning algorithms can be
trained on a synthetic training and evaluation data set if a “gold standard” sample is not fea-
sible, and a supervised machine learning algorithm can be used for classifying individuals in
administrative data. Deuxième, in our specific application, simple heuristics (as have been used
in prior record linkage approaches for German social security data) reach sufficiently high
rates of precision. Cependant, machine learning algorithms combine comparably high precision
with drastically improved recall. Troisième, depending on the tuning parameters used, each algo-
rithm can have a number of potential classification outcomes. This indicates a need to eval-
uate results from different algorithms.

The final database allows us to investigate the labor market trajectories of German doctor-
ate recipients before, pendant, and after their graduation from 1975 up to 2015. A first evalua-
tion of the database provides the following insights: While only a few doctorate recipients are
unemployed, we find a substantial share of female doctorate recipients working part time.
While female and male doctorate recipients show similar employment states during their grad-
uation period, the shares of part-time and full-time employment diverge over the career paths
of men and women.

REMERCIEMENTS

We thank Guido Bünstorf and the entire WISKIDZ-Team, Rasmus Bode, Tom Hanika, Andreas
Rehs, and Igor Asanov for their valuable and constructive suggestions during the planning and
development of this research work, as well as Judith Heinisch for her helpful comments.

CONTRIBUTIONS DES AUTEURS
Dominik P. Heinisch: Conceptualisation; Méthodologie; Logiciel; Visualisation; Writing—
brouillon original; Writing—review & édition. Johannes König: Analyse formelle; Enquête;
Méthodologie; Gestion de projet; Logiciel; Validation; Visualisation; Writing—original
brouillon; Writing—review & édition. Anne Otto: Conservation des données; Ressources; Logiciel;
Visualisation; Writing—original draft; Writing—review & édition.

COMPETING INTERESTS

The authors have no competing interests.

INFORMATIONS SUR LE FINANCEMENT

We gratefully acknowledge support from the German Federal Ministry of Education and
Research (BMBF) under grant number 16FWN016.

Études scientifiques quantitatives

112

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

DATA AVAILABILITY

Data used in this manuscript are subject to strict requirements on social data protection in
Germany and cannot be made available in a data repository. For further details, see Section 3.2.

RÉFÉRENCES

Antoni, M., & Seth, S. (2012). ALWA-ADIAB—Linked individual

survey and administrative data for substantive and
methodological research. Schmollers Jahrbuch, 132(1),
141–146.

Antoni, M., Ganzer, UN., & vom Berge, P.. (2016). Sample of
integrated labour market biographies (SIAB) 1975–2014.
FDZ-Datenreport, 4/2016.

Benderly, B. L. (2018). A trend toward transparency for Ph.D.
career outcomes? Science. https://doi.org/10.1126/science
.caredit.aat5250

Gareth, J., Witten, D., Hastie, T., & Tibshirani, R.. (2013). Un

Introduction to Statistical Learning. New York, New York: Springer.
Heinisch, D. P., & Buenstorf, G. (2018). The next generation (plus
un): An analysis of doctoral students’ academic fecundity based
on a novel approach to advisor identification. Scientometrics,
117(1), 351–380.

Kaul, UN. Neu, N., Othon, UN., & Schieler, M.. (2016). Karrierestart,
Mobilität und Löhne von Absolventen der Informatik. IAB-
Regional—Berichte und Analysen aus dem Regionalen
Forschungsnetz. 03/2016.

Bishop, C. M.. (2006). Pattern recognition and machine learning.

Konsortium Bundesbericht Wissenschaftlicher Nachwuchs

New York, New York: Springer.

Blank, R., Daniels, R.. J., Gilliland, G., Gutmann, UN., Hawgood, S.,
Hrabowski, F. UN., Pollack, M.. P., Prix, V. & Schlissel, M.. S.
(2017). A new data effort to inform career choices in
biomedicine. Science, 358(6369), 1388–1389.

Buenstorf, G., & Geissler, M.. (2014). Like doktorvater, like son?
Tracing role model learning in the evolution of German laser
recherche. Jahrbücher für Nationalökonomie und Statistik,
234(2–3), 158–184.

Christen, P.. (2012un). Data matching: Concepts and techniques for
record linkage, entity resolution, and duplicate detection. Nouveau
York, New York: Springer.

Christen, P.. (2012b). A survey of indexing techniques for

scalable record linkage and deduplication. IEEE Transactions
on Knowledge and Data Engineering, 24(9), 1537–1555.

Culp, M., Johnson, K., & Michailidis, G. (2006). Ada: An R package
for stochastic boosting. Journal of Statistical Software, 17(2),
1–27.

Deutsche Nationalbibliothek (DNB). (2018, Novembre 13). Le
German National Library in brief. Extrait de http://www
.dnb.de/EN/Wir/ueberblick

Dorner, M., Cintreuse, S., Harhoff, D., Hoisl, K., & Scioch, P.. (2014).
The MPI-IC-IAB-Inventor data 2002 (MIID 2002): Record-linkage
of patent register data with labor market biography data of the
IAB. FDZ-Methodenreport, 06/2014.

Dorner, M., Heining, J., Jacobebbinghaus, P., & Seth, S. (2010). Le
sample of integrated labour market biographies. Schmollers
Jahrbuch, 130(4), 599–608.

Fitzenberger, B., Osikominu, UN., & Völter, R.. (2005). Imputation

rules to improve the education variable in the IAB employment
subsample. ZEW-Centre for European Economic Research
Discussion Paper (05-010).

Friedman, J., Hastie, T., & Tibshirani, R.. (2010). Regularization
paths for generalized linear models via coordinate descent.
Journal of Statistical Software, 33(1), 1–22.

(BuWiN). (2013). Bundesbericht Wissenschaftlicher Nachwuchs
2013. Bielefeld: W. Bertelsmann Verlag.

Konsortium Bundesbericht Wissenschaftlicher Nachwuchs

(BuWiN). (2017). Bundesbericht Wissenschaftlicher Nachwuchs
2017. Bielefeld: W. Bertelsmann Verlag.

Liaw, UN., & Wiener, M.. (2002). Classification and regression by

randomForest. R News, 2(3), 18–22.

Oberschachtsiek, D., Scioch, P., Seysen, C., & Heining, J.. (2009).
Stichprobe der Integrierten Erwerbsbiografien. Handbuch für die
IEBS in der Fassung 2008. FDZ-Datenreport, 03/2009.
Organisation for Economic Co-operation and Development

(OECD). (2018). Education at a Glance 2018: OECD Indicators.
Paris: OECD.

R Core Team. (2017). R.: A Language and Environment for Statistical

Computing. R Foundation for Statistical Computing.

Rapide, R., Bachteler, T., & Cintreuse, S. (2004). A toolbox for record

linkage. Austrian Journal of Statistics, 33(1), 125–133.
Rapide, R.. (2013). Getting big data but avoiding big brother.

German Record Linkage Center Working Paper Series, 2/2013.
Teichert, C., Niebuhr, UN., Othon, UN., & Rossen, UN. (2018). Graduate
migration in Germany: New evidence from an event history
analyse. IAB-Discussion Paper, 3/2018.

Vom Berge, P., Konig, M., & Seth, S. (2013). Sample of integrated
labour market biographies (SIAB) 1975–2010. FDZ-Datenreport,
1/2013.

Wanger, S. (2015). Frauen und Männer am Arbeitsmarkt:

Traditionelle Erwerbs- und Arbeitszeitmuster sind nach wie vor
verbreitet. IAB-Kurzbericht, 4/2015.

Wydra-Somaggio, G. (2015). Das Ausbildungspanel Saarland:
Dokumentation der Datenaufbereitung. IAB-Regional. IAB
Rheinland-Pfalz-Saarland, 3/2015.

Zimmermann, R., Kaimer, S., & Oberschachtsiek, D. (2007).
Dokumentation des “Scientific Use Files der Integrierten
Erwerbsbiographien” (IEBS-SUF V1), Version 1.0. FDZ
Datenreport, 1/2007.

Études scientifiques quantitatives

113

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

APPENDIX A: ASSESSMENT OF THE TRAINING DATA SET

To assess the representativeness of the synthetic training and evaluation data set, we present
descriptive statistics for both data sets. Results for the number of multiples matched entries per
entity can be seen in Table A1. Table A2 shows descriptive statistics of the variable distribu-
tions for the synthetic training and evaluation data set. Table A3 shows descriptive statistics of
the variable distribution for the full (matched) data set.

Table A1. Distributions for multiple matches of the synthetic training and evaluation data set and of
the full (matched) data set

Feature
Artificial training/evaluation data set

Full (matched) data set

Min
1

1

1stQ Median

1

1

4

4

Mean
22.0889

3rdQ Max
296

20

22.4841

20

299

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Table A2. Descriptive statistics for the synthetic training and evaluation data set

Feature
spell_research

spell_hospital

prop_educ

age_sub

same_ror_y5

first_spell_before

year_diss

eastern

nbr

social science

natural science

engineering

medicine

law/economics

Median

0

0

0

35

0

−11

2001

0

90

0

0

0

0

0

Mean

0.0911

0.1130

0.3517

37.6489

0.0475

−11.2539

2000

0.1658

103.1859

0.1048

0.2564

0.0833

0.4001

0.1187

Min
0

0

0

20

0

−45

1975

0

1

0

0

0

0

0

Max
1

1

1

102

1

39

2015

1

296

1

1

1

1

1

114

Études scientifiques quantitatives

Doctorate recipients’ employment trajectories

Table A3 Descriptive statistics for full (matched) data set

Feature
spell_research

spell_hospital

prop_educ

age_sub

same_ror_y5

first_spell_before

year_diss

eastern

nbr

social science

natural science

engineering

medicine

law/economics

Median

0

0

0

35

0

−10

1999

0

94

0

0

0

0

0

Mean

0.0846

0.0964

0.3319

37.3718

0.0573

−10.2697

1998

0.1677

106.8058

0.0855

0.2550

0.0882

0.4171

0.1118

Min
0

0

0

20

0

−62

1975

0

1

0

0

0

0

0

Max
1

1

1

115

1

40

2015

1

299

1

1

1

1

1

APPENDIX B: ASSESSMENT OF MERGED DATA SET

The following figures were created to check the quality of the matched IIPED data.

Figure B1.

Successfully identified doctorate recipients by graduation year.

Études scientifiques quantitatives

115

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Doctorate recipients’ employment trajectories

Figure B2.

Successfully identified doctorate recipients by subject field.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

/

e
d
toi
q
s
s
/
un
r
t
je
c
e

p
d

je

F
/

/

/

/

/

1
1
9
4
1
7
6
0
7
9
9
q
s
s
_
un
_
0
0
0
0
1
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Études scientifiques quantitatives

116RESEARCH ARTICLE image
RESEARCH ARTICLE image
RESEARCH ARTICLE image
RESEARCH ARTICLE image
RESEARCH ARTICLE image
RESEARCH ARTICLE image
RESEARCH ARTICLE image
RESEARCH ARTICLE image

Télécharger le PDF