RESEARCH ARTICLE - IA de Investigación especializada en el MIT

ARTÍCULO DE INVESTIGACIÓN

A longitudinal analysis of university rankings

Friso Selten1

, Cameron Neylon2

, Chun-Kai Huang2

, and Paul Groth1

1Informatics Institute, University of Amsterdam, Ámsterdam, Los países bajos
2Centre for Culture and Technology, Curtin University, Perth, Australia

un acceso abierto

diario

Palabras clave: comparative analysis, factor analysis, longitudinal analysis, principal component analysis,
university rankings

Citación: Selten, F., Neylon, C., Huang,
C.-K., & Groth, PAG. (2020). A longitudinal
analysis of university rankings.
Estudios de ciencias cuantitativas, 1(3),
1109–1135. https://doi.org/10.1162/
qss_a_00052

DOI:
https://doi.org/10.1162/qss_a_00052

Recibió: 29 Agosto 2019
Aceptado: 25 Abril 2020

Autor correspondiente:
Paul Groth
p.groth@uva.nl

Editor de manejo:
Juego Waltman

Derechos de autor: © 2020 Friso Selten,
Cameron Neylon, Chun-Kai Huang, y
Paul Groth. Published under a Creative
Commons Attribution 4.0 Internacional
(CC POR 4.0) licencia.

La prensa del MIT

ABSTRACTO

Pressured by globalization and demand for public organizations to be accountable, efficient,
and transparent, university rankings have become an important tool for assessing the quality of
higher education institutions. It is therefore important to assess exactly what these rankings
measure. Aquí, the three major global university rankings—the Academic Ranking of World
Universities, the Times Higher Education ranking and the Quacquarelli Symonds World
University Rankings—are studied. After a description of the ranking methodologies, it is shown
that university rankings are stable over time but that there is variation between the three
rankings. Además, using principal component analysis and exploratory factor analysis,
we demonstrate that the variables used to construct the rankings primarily measure two
underlying factors: a university’s reputation and its research performance. By correlating these
factors and plotting regional aggregates of universities on the two factors, differences between
the rankings are made visible. Last, we elaborate how the results from these analysis can
be viewed in light of often-voiced critiques of the ranking process. This indicates that the
variables used by the rankings might not capture the concepts they claim to measure. El
study provides evidence of the ambiguous nature of university rankings quantification of
university performance.

INTRODUCCIÓN

Over the past 30 años, the public sector has been subject to significant administrative reforms
driven by an increased demand for efficiency, eficacia, and accountability. This demand
sparked the creation of social measures designed to evaluate the performance of organizations
and improve accountability and transparency (Romzek, 2000). Universities, as part of this
public sector, have also been subject to these reforms (Espeland & Sauder, 2007). Uno de
the measures taken in the higher education domain to serve this need for accountability
and transparency is the popularization of university rankings (URs). URs are “lists of certain
groupings of institutions […] comparatively ranked according to a common set of indicators in
descending order” (Ujier & Savino, 2006, pag. 5).

The idea of comparing universities dates to the 1980s when the US News & World Report
released the first ranking of American universities and colleges. The process however gained
major attention in 2003 with the release of the Shanghai league table (Stergiou & Lessenich,
2014). Many new URs have been established since then, with the most notable being the
THE-QS and the Webometrics Ranking of World Universities in 2004, the NTU (HEEACT)
Ranking, and the CWTS Leiden Ranking in 2007. En 2009, the THE and QS rankings split, y
they have published separate rankings since 2010. These rankings make it easy to quantify the

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
3
1
1
0
9
1
8
6
9
8
6
7
q
s
s
_
a
_
0
0
0
5
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

A longitudinal analysis of university rankings

achievements of universities and compare them to each other. Universities therefore use the
rankings to satisfy the public demand for transparency and information (Ujier & Savino, 2006).
Además, rankings were met with relief and enthusiasm by policy-makers and journalists,
and also by students and employers. Students use them as a qualification for the value of their
diploma, employers to assess the quality of graduates, and governments to measure a univer-
sity’s international impact and its contribution to national innovation (Hazelkorn, 2007; Van
Parijs, 2009). También, the internationalization of higher education has increased the demand for
tools to assess the quality of university programs on a global scale (Altbach & Caballero, 2007).
From this perspective the increasing importance of URs can be understood because they provide a
tool for making cross-country comparisons between institutions. The impact of rankings is un-
mistakable. They affect the judgments of university leaders and prospective students, así como
the decisions made by policy-makers and investors (Marginson, 2014). For certain politicians,
having their country’s universities at the top of the rankings has become a goal in itself (Billaut,
Bouyssou, & Vincke, 2009; Saisana, d’Hombres, & Saltelli, 2011).

University rankings also quickly became subject to criticism (Stergiou & Lessenich, 2014).
Fundamental critiques of the rankings are twofold. Primero, some researchers question whether
the indicators used to compute the ranking are actually a good proxy for the quality of a uni-
versity. It is argued that the indicators that the rankings use are not a reflection of the attributes
that make up a good university (Billaut et al., 2009; Huang, 2012). Además, researchers rea-
son that URs can become a self-fulfilling prophecy; a high rank creates expectations about a
university and this causes the university to remain at the top of the rankings. Por ejemplo, previo
rankings influence surveys that determine future rankings, they influence funding decisions,
and universities conform their activities to the ranking criteria (Espeland & Sauder, 2007;
Marginson, 2007).

Other criticisms focus on the methodologies employed by the rankings. This debate often
revolves around the weightings placed on the different indicators that comprise a ranking.
The amount of weight placed on certain variables is decided by the rankings’ designers, pero
research has shown that making small changes to the weights can cause a major shift in ranking
positions. Por lo tanto, the position of a university is largely influenced by decisions made by the
rankings’ designers (Dehon, McCathie, & Verardi, 2009; Marginson, 2014; Saisana et al., 2011).
También, the indicator normalization strategy used when creating the ranking can influence the
position of a university (Moed, 2017). Normalization is thus, next to the assignment of weight-
ings, a direct manner in which the ranking designers influence ranking order. Además, Tiene
been suggested that rankings are biased towards universities in the United States or English-
speaking universities, Por ejemplo, by using a subset of mostly English journals to measure the
number of publications and citations (Pusser & Marginson, 2013; Van Raan, 2005; Vernon,
Balas, & Momani, 2018). Last, there is evidence that suggests that there are major deficiencies
present in the collection of the ranking data; that is the data used to construct the rankings are
incorrect (Van Raan, 2005).

The aim of this research is to better understand what it is that URs measure. This is studied by
examining the data that are used to compile the rankings. We assess longitudinal patterns,
observe regional differences, and analyze whether there are latent concepts that underly the
data used to build the rankings: Can the variables used in the ranking be categorized into broader
conceptos? The relation between the result of these analyses and the various criticisms described
above will also be discussed. Three rankings are analyzed in this study: the Academic Ranking of
World Universities (ARWU), the Times Higher Education World University Ranking (THE) y
the Quacquarelli Symonds World University Rankings (QS). These rankings are selected
because they are seen as the most influential and they claim international coverage. Ellos son

Estudios de ciencias cuantitativas

1110

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
3
1
1
0
9
1
8
6
9
8
6
7
q
s
s
_
a
_
0
0
0
5
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

A longitudinal analysis of university rankings

also all general in that they measure the widest variety of variables, as they focus not only on
research but also on teaching quality (Aguillo et al. 2010; Scott, 2013).

1.1. Trabajo relacionado

We take a data-driven approach to our analysis, which is somewhat uncommon in the literature.
The most notable works that study URs using such an approach are Aguillo et al. (2010), Dehon
et al. (2009), Docampo (2011), Moed (2017), Safón (2013), and Soh (2015).

Aguillo et al. (2010) study the development of the ARWU and the, THE-QS rankings (at that
time still publishing a ranking together). This research shows that rankings differ quite extensively
from each other, but that they do not change much over the years. This is also confirmed by the
research of Moed (2017), which shows that, when analyzing five URs (besides the ARWU,
THE, and QS this paper also considers the Leiden and U-Multirank rankings), solo 35 universidades
appear in the top 100 of every ranking. Además, this research examines relations between
similar variables that the rankings measure. This analysis proves that citation measures between
the different rankings in general are strongly correlated. También, variables that aim at measuring
reputation and teaching quality show moderate to strong correlation (Moed, 2017). Where
Moed (2017) explores the relation between the ranking variables using correlations, estos
relations have also been analyzed using more sophisticated techniques: principal component
análisis (PCA) and exploratory factor analysis (EFA). Dehon et al. (2009) use this first technique
to study the underlying concepts that are measured by URs. Their research provides insights
into the ARWU ranking by showing that the 2008 edition of this ranking measured two distinct
conceptos: the volume of publications and the quality of research conducted at the highest level.
This is also found by Docampo (2011), who applies PCA to data from the ARWU and shows that
the extracted components can be used to assess the performance of a university at a country level.
Safón (2013) and Soh (2015) both apply EFA to URs. Safón (2013) shows that the ARWU ranking
measures a single factor, while the THE ranking measures three distinguishable factors. Asimismo,
the study by Soh (2015) suggests that the ARWU ranking only measures academic performance,
while the THE and QS rankings also include nonacademic performance indicators.

We take inspiration from this prior work, but move beyond it by performing our analysis
longitudinally, over three rankings, using multiple analysis approaches as well as performing
geographic and sample comparisons. Específicamente, the contribution of this paper is fourfold:

It describes the evolution of, and gives a comparison between, the three major URs over
the past 7 años.
It shows the results of a multi year robust PCA and EFA of the UR data, expanding on the
work of Dehon et al. (2009), Safón (2013), and Soh (2015).
It provides evidence that URs are primarily measuring two concepts and discusses the
implications of this finding.
It demonstrates a new visualization of how the position of specific (groups of ) universidades
in the rankings changes over time.

The structure of this paper is as follows. Primero, a general explanation of the rankings meth-
odologies and data collection is given in Sections 2 y 3. Entonces, our exploratory analysis of
the ranking data is discussed in Section 4. This section also studies longitudinal stability and
cross-ranking similarity. This is followed by the presentation of our analysis of the latent con-
cepts underlying the rankings using PCA and EFA (Sección 5). Finalmente, the implications of the
results and limitations of the study are discussed.

Estudios de ciencias cuantitativas

1111

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
3
1
1
0
9
1
8
6
9
8
6
7
q
s
s
_
a
_
0
0
0
5
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

A longitudinal analysis of university rankings

2. RANKING METHODOLOGIES

This section briefly outlines what concepts the rankings use to compute ranking scores and the
variables they use to evaluate these concepts. In the next three sections, after each concept the
weight assigned to this concept when calculating a university’s overall ranking score is indi-
cated between parenthesis.

2.1. ARWU Ranking

The Academy Ranking of World Universities aims to measure four main concepts: Quality of
Educación (Alumni, 0.1), Quality of Faculty (Award, 0.2; HiCi, 0.2), Research Output (NS, 0.2;
PUB, 0.2) and Per Capita Performance (PCP, 0.1). Quality of Education is operationalized by
counting the number of university graduates that have won a Nobel Prize or Fields Medal.
Awards won since 1911 are taken into account, but less value is assigned to prizes that were
won longer ago. Quality of Faculty is similarly measured by counting (desde 1921) the Nobel
Prizes in physics, chemistry, medicine, and economics, and Fields Medals in mathematics won
by staff working at the university at the time of winning the prize. Además, the number of staff
members that are listed on the Highly Cited Researchers list compiled by Clarivate Analytics is
used as an input variable. Research output is measured using the number of papers published in
Nature and Science and the total number of papers indexed in the Science Citation Index-
Expanded and Social Science Citation Index. The per capita performance variable is a construct
of the other five measured variables and—depending on the country a university is in—this
construct is either divided by the size of the academic staff to correct for the size of a university
or is a weighted average of the five other variables (Academic Ranking of World Universities,
2018). For a more in-depth overview of the ARWU methodology see the official ARWU website
(http://www.shanghairanking.com), or the articles from Billaut et al. (2009), Dehon et al.
(2009), Docampo (2011), Vernon et al. (2018), and Marginson (2014).

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
3
1
1
0
9
1
8
6
9
8
6
7
q
s
s
_
a
_
0
0
0
5
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

2.2. THE Ranking

The Times World Ranking of Universities is constructed from the evaluation of five different
conceptos: Teaching (0.3), Investigación (0.3), Citations (0.3), International Outlook (0.075), y
Industry Income (0.025). Half of the Teaching indicator is constructed using a survey that aims
to measure the perceived prestige of institutions in teaching. The other half is made up by the
staff-to-student ratio, doctorate-to-bachelor’s ratio, doctorates-awarded-to-academic-staff ratio
and institutional income. Research is mostly measured using a survey that seeks to determine a
university’s reputation for research excellence among its peers. Además, research income
and research productivity (the number of publications) are taken into account when constructing
the Research variable. Citations are measured by averaging the number of times a university’s
published work is cited. The citation measure is therefore normalized with regard to the total
papers produced by the staff of the institution. Además, data are normalized to correct for
differences in citation rates; Por ejemplo, in the life sciences and natural sciences average cita-
tions are much higher than in other research areas, such as arts and humanities. Data on citations
have been provided by Elsevier using the Scopus database since 2015. Prior to 2015 this infor-
mation was supplied by WoS. International Outlook is measured by evaluating the proportion of
international students and international staff and the amount of international collaborations.
Industry Income is measured by assessing the research income that an institution earns from
industria (THE World University Ranking, 2018). For a more in-depth overview of the THE

Estudios de ciencias cuantitativas

1112

A longitudinal analysis of university rankings

methodology, see the official THE website (www.timeshighereducation.com), or the articles by
Vernon et al. (2018) and Marginson (2014).

2.3. QS Ranking

The QS evaluates six different concepts: Academic Reputation (0.4), Employer Reputation (0.1),
Faculty/Student Ratio (0.2), Citations per faculty (0.2), International Faculty Ratio (0.05), y
International Student Ratio (0.05). Academic Reputation is based on a survey of 80,000 individuals
who work in the higher education domain. Employer Reputation is measured by surveying 40,000
empleadores. The Faculty/Student Ratio variable measures the number of students per teacher and is
used as a proxy to assess teaching quality. Citations are measured, using Elsevier’s Scopus data-
base, by counting all citations received by the papers produced by the institution’s staff across
a 5-year period and dividing this by the number of faculty members at that institution. As in the
THE ranking, desde 2015 the citation scores are normalized within each faculty to account for
differences in citation rates between research areas. International Faculty and Student ratios
subsequently measure the ratio of international staff and ratio of international students (QS
World University Ranking, 2018). For a more in-depth overview of the QS methodology see
the official QS website (www.topuniversities.com), or articles from Huang (2012), Docampo
(2011), Vernon et al. (2018), and Marginson (2014).

The three rankings use overlapping concepts (teaching quality, research quality) but diverse
input variables to evaluate these concepts—see Table 1. Next to these overlapping concepts
the rankings also have unique characteristics. Noticeable is the inclusion of internationality in
the THE and QS ranking. This is absent from the ARWU ranking. También, the THE is the only
ranking to include a university’s income from industry. Además, the THE and QS rankings
apply corrections to normalize the citation scores with respect to the size of a university, mientras
the ARWU includes uncorrected counts to measure research quality and quantity. This ranking
only corrects for university size for institutions in specific countries using the PCP variable. En
general, it can be stated that the methodologies of the THE and QS ranking are quite similar.
They use comparable concepts for assessing the quality of a university and similar methodol-
ogies for measuring them. The ARWU ranking, while partly measuring the same concepts,
uses different variables and input data to operationalize these concepts.

3. DATA COLLECTION

Data for this study have been collected from the official websites of the three URs. Data have
been retrieved for all variables that form the rankings described in the previous section by
scraping the university ranking websites.

Mesa 1.

Comparing the indicators in the three rankings

Teaching Quality

ARWU
Alumni & PCP

THE

Teaching

Faculty Student Ratio

Research Quality

Award & HiCi

Investigación & Citations

Citations per faculty

Research Quantity

NS & PUB

–

Internationality

Industry Reputation

–

International outlook

International faculty ratio & International student ratio

Industria

Employer reputation

Estudios de ciencias cuantitativas

1113

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
3
1
1
0
9
1
8
6
9
8
6
7
q
s
s
_
a
_
0
0
0
5
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

A longitudinal analysis of university rankings

Año
2012

2013

2014

2015

2016

2017

2018

ARWU
500

500

Mesa 2. Number of universities measured per year

500

498

497

498

497

THES
400

400

401

800

981

1,103

1,258

364

367

381

763

981

1,103

1,258

QS
869

903

888

918

936

980

1,021

392

400

395

140

129

498

Todo
324

326

413

405

414

419

This research focuses on the ranking years 2012 a 2018 because for these years it was
possible to obtain data from the website of all selected rankings. This ensures that for all years
analyzed, official data about all three rankings are available. Mesa 2 shows the number of
universities present in the rankings per year. The lambda (λ) column shows the number of
universities present in each respective ranking for which all data measured by the rankings
is available. The last column (Todo) shows the number of institutions that are present in all three
of the rankings in that specific year.

Different rankings use different names for universities, and also within a ranking name changes
were observed over the year. Después, to compare universities between years and rankings it
was necessary to link all universities to their associated Global Research Identifier Database entry
(GRID) (Digital-science, 2019). Records were linked using data retrieved from Wikidata (Wikidatos
Colaboradores, 2019). Wikidata includes the IDs that are assigned by the three rankings for many
universities alongside the related GRID. By linking an institution’s unique ranking ID to the
Wikidata database and extracting the relevant GRID, it was possible to match almost all univer-
ciudades. This linkage proved effective; manual inspection of several universities did not detect mis-
matches. A small number of missing GRIDs were linked by hand.

4. EXPLORATORY ANALYSIS
A comparison of the changes in overall positions of universities’ rankings is now presented. Two
distinct aspects are assessed: changes in the rankings over time and the dissimilarities of the
three rankings in the same year with respect to each other.

Three different measurements are used to evaluate these relationships. The first is the number
of overlapping universities (oh) (the number of universities that are present in both rankings). El
second is the Spearman rank correlation coefficient (F), which measures the strength of
the association between overlapping universities (Gravetter & Wallnau, 2016). To assess the
relationship between rankings including nonoverlapping universities, a third test, the inverse
rank measure (METRO), as formulated by Bar-Ilan, Levene, and Lin (2007), is calculated. This test is
also used to compare rankings in the research of Aguillo et al. (2010). The M-measure assesses
ranking similarity while factoring in the effect of nonoverlapping universities. This is accom-
plished by assigning nonoverlapping elements to the lowest rank position +1. En el caso de
two URs with size k, if a university appears in ranking A but does not appear in ranking B, entonces
the university is assigned to rank k + 1 in ranking B. The M-measure subsequently calculates a
normalized difference between the two rankings (Aguillo et al., 2010). The resulting M-scores

Estudios de ciencias cuantitativas

1114

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
3
1
1
0
9
1
8
6
9
8
6
7
q
s
s
_
a
_
0
0
0
5
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

A longitudinal analysis of university rankings

should be interpreted as follows: Abajo 0.2 can be considered weak similarity, entre 0.2 y
0.4 low similarity, entre 0.4 y 0.7 medium similarity, entre 0.7 y 0.9 high similarity
and above 0.9 very high similarity (Bar-Ilan et al., 2007). Some universities were assigned the
same position in the rankings because of a tie in their scores. These universities were assigned to
the mid position (es decir., two universities that are ranked fifth are both assigned to place 5.5).

4.1. Longitudinal Ranking Stability

Primero, changes within rankings over the past 7 years are reviewed. En mesa 3 the number of
overlapping institutions, Spearman correlation coefficients and the M-measure scores are
listed for the top 100 institutions in each ranking. This table shows, for each ranking, the years
de 2013 a 2018, as indicated in the left column. For each of these years, each ranking is

Mesa 3.

Similarity between ranking years (oh: Overlap; F: Spearman correlation coefficient, METRO: M-measure)

2012
F

METRO

2013
F

METRO

2014
F

METRO

2015
F

METRO

2016
F

METRO

2017
F

METRO

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

1.0

0.97

0.98

0.97

0.91

0.9

0.89

0.98

0.93

0.96

0.89

0.88

0.89

0.90

0.78

0.91

0.71

0.91

0.69

0.98

0.91

0.96

0.88

0.90

0.81

0.90

0.8

0.88

0.79

0.88

0.77

0.98

0.96

0.98

0.95

0.93

0.91

0.9

0.89

0.90

0.98

0.94

0.88

0.86

0.90

0.75

0.89

0.7

0.90

0.67

0.98

0.9

0.91

0.87

0.90

0.82

0.89

0.8

0.88

0.78

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

0.99

0.95

0.93

0.94

0.92

0.93

0.91

0.97

0.93

0.96

0.92

0.95

0.91

0.97

0.94

0.96

0.93

0.97

F
/

1
3
1
1
0
9
1
8
6
9
8
6
7
q
s
s
_
a
_
0
0
0
5
2
pag
d

0.90

0.85

0.92

0.74

0.91

0.68

0.92

0.66

0.97

0.84

0.96

0.77

0.95

0.76

0.98

0.91

0.97

0.88

0.98

0.95

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

0.92

0.82

0.91

0.81

0.90

0.79

0.89

0.77

0.98

0.93

0.97

0.9

0.96

0.87

0.99

0.96

0.98

0.93

0.98

0.95

Measure
ARWU

2013

2014

2015

2016

2017

2018

THE

2013

2014

2015

2016

2017

2018

2013

2014

2015

2016

2017

2018

Nota: All Spearman correlations (F) were significant: pag < .001. Quantitative Science Studies 1115 A longitudinal analysis of university rankings compared to the data for the years 2012 to 2017 (first row) of the same ranking. Here a com- parison of only the top 100 universities is presented because the ARWU ranking assigns a singular rank only to universities in the top 100 of the ranking. These analyses show that all three rankings are stable over time. A large portion of universities are overlapping for every year. Furthermore, all Spearman correlation coefficients are significant, with large effect sizes. This signifies that there is not much change in the ranking positions of overlapping universities between years. This is also demonstrated in Figure 1. The M-measures provide some more insights into the changes in the rankings over time. For the ARWU ranking this measurement shows strong similarities. Even when comparing the ranking from 2012 with that from 2018 the M-measure is very high, and only 15 universities do not overlap. The THE ranking is more volatile. For example, similarities between the 2018 and 2017 rankings and those from 2012, 2013, and 2014 are less strong. This may be connected with the shift from using Web of Science (WoS) to Scopus as a source of citation data between 2014 and 2015, and if so is indic- ative of a sensitivity to data sources. However, when considering that the number of universities ranked by the THE ranking is three times higher in 2018 than in the earlier years, the relationship between them is still quite high. Also, the M-measure between consecutive years shows strong similarities. The change that is present is thus subtle. The QS ranking is also very similar over the years. Consecutive years show very high similarity. But the latest ranking also shows high simi- larity with all previous years, with an M score of 0.77 indicating high similarity when comparing the 2012 ranking with the one from 2018. Overall, our conclusion is that the rankings are very stable over time. The top 100 institutions of all rankings are significantly correlated between all years and the M-measure also shows very strong similarities between most years. From the three rankings the THE showed most (albeit subtle) change over time; it is the only ranking in which the M-measure showed a medium sim- ilarity between some years. From these results the conclusion can be drawn that universities in the top 100 are largely fixed. There are not many new institutions that enter, and consequently few institutions that drop out of, the top 100. Additionally, within the top 100 of each ranking there is little change in position between years. A comparison where more institutions are taken into account can be found in Section A of the supplementary material; see Selten (2020). In gen- eral, the results of this analysis do indicate that rankings are stable beyond the top 100. However, as is explained there, these results should be interpreted with care. This stability can be ex- plained by the fact that the rankings use rolling averages to measure publications and citations. Furthermore, the ARWU ranking includes prizes won since 1911. In all rankings, subsequent ranking years are thus partly based on the same data. The fact that it is hard to move positions Figure 1. Similarity between ranking years. Quantitative Science Studies 1116 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings at the top of the rankings, despite the differences between them, is also consistent with the idea that the rankings may be having a circular effect, reinforcing the positions that universities hold. This effect is likely to be strongest in the THE and QS rankings because these use reputation surveys, which are likely to be influenced by previous rankings. However, research by Safón (2019) shows that previous rankings might also influence research performance, indicating there might an additional circular effect present in the ARWU ranking. These processes are further elaborated on in Section 6.1. 4.2. Similarity Between Rankings Next, we review the similarity between rankings. For each year, the three rankings are compared to each other. The same three measurements are used to test these relationships. However, as well as analyzing the top 100 universities, the similarities between the top 50 and 50 to 100 range of the rankings are independently examined. The results of this analysis are shown in Table 4. Comparisons are given for each year analyzed between the THE and ARWU, and between the QS and ARWU, and the QS and THE rankings. We observe no large discrepancies between years in how similar the rankings are with respect to each other. This is as expected because each ranking does not change much over time. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 There is, however, a difference between the top 50 and positions 51 to 100. The overlap measurement in the top 50 of each ranking shows that 60–70% of the universities overlap between rankings. The ranks of these overlapping universities are also significantly correlated. However, the M-measure shows medium similarity, caused by the relatively high amount of nonoverlapping universities. In the 50 to 100 range the similarity between the rankings is very weak. Not even half of the universities overlap and the correlations between the rankings in all years, except for the correlation between the ARWU and THE rankings in 2013, are not significant. The M-measure also shows weak to very weak similarities between rankings in this range. In the top 100 the THE and ARWU rankings and THE and QS rankings overlap for more than 70 universities. Between the ARWU and QS there is a little less overlap. This also results in an M-measure that is lower than that between the QS and THE and the ARWU and THE rankings. However, all M-measures can be classified as being of medium strength. Furthermore, the Spearman correlation is significant for all comparisons for the top 50 and top 100. The M-measure indicates more similarity at the top 100 level than at the top 50 level. This is caused by the fact that the M-measure assigns more importance to the top of the rankings, and when comparing the top 100 range there are fewer nonoverlapping universities ( i.e., universities that are in the top 50 of one ranking but not in the top 50 of the other ranking are likely to be in the top 100 of the other ranking). Generally, the top 50 and top 100 between all rankings are quite similar. The M-measure points out medium relationships, but the correlations between the ranks of overlapping univer- sities are strong and significant. The 50 to 100 range displays much more difference between the rankings. Not even half of the universities are overlapping, the ranks of overlapping univer- sities are not significantly correlated, and the M-measures show very weak similarity; this is also visible in Figure 2. Finally, no two rankings were clearly more similar to each other than to one other ranking. Comparing these two sets of plots clearly demonstrates that different years of the same ranking are very similar. There is much more variance when comparing ranks of similar universities in different rankings, especially amongst the higher ranking positions. In Section B of the supplementary material (see Selten, 2020) we show a similar analysis for the top 400 insti- tutions. These results need to be interpreted with care but show that in the top 400 there is also Quantitative Science Studies 1117 A longitudinal analysis of university rankings Table 4. Similarity between different rankings (O: Overlap; F: Spearman correlation coefficient; M: M-measure) Top 50 ARWU Measure O F M O 50–100 ARWU M O F M O THE F Top 100 ARWU M O F M O THE F THE F M 2012 THE QS 2013 THE QS 2014 THE QS 2015 THE QS 2016 THE QS 2017 THE QS 2018 THE QS 37 0.83*** 0.57 19 0.22 0.13 70 0.81*** 0.59 28 0.62*** 0.48 36 0.77*** 0.49 16 0.0 0.17 15 −0.27 0.05 64 0.68*** 0.51 74 0.74*** 0.53 35 0.82*** 0.58 21 0.46* 0.13 73 0.83*** 0.60 29 0.66*** 0.51 36 0.73*** 0.52 15 0.28 0.10 14 −0.02 0.08 63 0.69*** 0.53 72 0.72*** 0.56 38 0.83*** 0.60 22 −0.19 0.15 71 0.78*** 0.62 32 0.72*** 0.48 36 0.70*** 0.50 16 0.22 0.16 15 0.06 0.21 62 0.65*** 0.50 72 0.71*** 0.54 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . 35 0.75*** 0.51 20 0.24 0.18 68 0.79*** 0.54 29 0.7*** 0.57 35 0.80*** 0.55 16 0.34 0.09 17 0.34 0.13 61 0.70*** 0.57 76 0.72*** 0.58 37 0.83*** 0.54 21 −0.35 0.12 74 0.71*** 0.57 31 0.67*** 0.58 38 0.77*** 0.56 17 −0.21 0.12 17 0.2 0.13 66 0.63*** 0.58 76 0.75*** 0.59 36 0.81*** 0.55 20 −0.1 0.09 73 0.75*** 0.57 30 0.66*** 0.57 37 0.73*** 0.55 13 0.43 0.05 13 −0.2 0.07 63 0.64*** 0.58 72 0.75*** 0.58 36 0.83*** 0.56 22 −0.3 0.11 75 0.73*** 0.59 31 0.68*** 0.56 40 0.78*** 0.56 14 0.14 0.08 17 0.3 0.12 64 0.64*** 0.57 74 0.8*** 0.6 / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Note: * p < .05, ** p < .01, *** p < .001. Figure 2. Similarity between different rankings. Quantitative Science Studies 1118 A longitudinal analysis of university rankings quite strong similarity between rankings. At the same time, however, a rather large number of nonoverlapping institutions is present, resulting in medium M-scores. 5. FACTOR EXTRACTION The three rankings use overlapping concepts (teaching quality, research quality) but diverse in- put variables to evaluate these concepts; see Table 1. The above findings show that the rankings do not vary much over time but that the similarity between rankings is less and differs according to the ranking position analyzed. We now take a more in-depth look at the input measures of the rankings. Previous research suggests that there are two latent factors underlying the ARWU ranking of 2008 and three underlying the THE ranking in 2013 (Dehon et al., 2009; Safón, 2013). To further examine the similarities and differences between rankings, we analyze whether these factors are stable in the rankings over time. This was done using two techniques: PCA, which has been employed by Dehon et al. (2009) and Docampo (2011), and EFA as used by Safón (2013) and Soh (2015). The studies of Dehon et al. (2009) and Safón (2013) only reviewed a subset of the ranking data by studying the top 150 or a group of overlapping universities. We are inter- ested in comparing the overall structure of the rankings over multiple years. Therefore, all universities present in the rankings are analyzed. Only universities for which the rankings do not provide information on all input measures are removed, because PCA and EFA cannot be applied to missing values. The number of universities that were analyzed each year can thus be seen in the lambda columns in Table 2. All input measures analyzed were scaled to have unit variance. Although PCA and EFA are related and often produce the same results, the application of both techniques has two advantages. First, the university ranking data show multivariate outliers. Results from both the PCA and EFA will be influenced by this. Therefore, for both analyses robust techniques are implemented. By applying two methods we can have more confidence that the extracted factors are genuine. Furthermore, PCA and EFA measure different relationships. PCA describes how much of the total variance present in the data can be explained by the extracted components. EFA tries to explain the correlations between variables and only considers shared variance (Osborne, Costello, & Kellow, 2008). Therefore, when observing correlated variables using EFA that together explain a substantial part of the variance as indicated by PCA, there is a strong indications that the input measures are related to a latent concept. 5.1. Principal Component Analysis PCA is implemented using a robust method as formulated by Hubert, Rousseeuw, and Vanden Branden (2005) using the rrcov package for R (Todorov, 2012). All R and Python scripts used to perform the analyses in this study can be found at Selten (2020). This method is robust against the influence that outliers will have on the PCA results. The loadings of the PCA were oblique rotated, because the analyzed variables are expected to be correlated and to make the results easier to interpret. To confirm that this method produces sensible results, the robust PCA method was tested on the top 150 universities from the ARWU ranking of 2008 in an effort to reproduce the results from the analysis of Dehon et al. (2009). Using this PCA method a comparable loading structure to that of Dehon et al. was found, namely that the ARWU consists of two components where the first component is loaded by the Alumni and Award variables and the second com- ponent by the other three variables (NS, HiCi and Award). This confirms that the method we use in this study is comparable to that used by Dehon et al. (2009). Quantitative Science Studies 1119 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings The same analysis was carried out for all years from 2012 to 2018 and for all three rankings. For each ranking the results of the analysis on the 2018 rankings are described in depth. Following this, the changes in structure in the other years with respect to the structure observed for 2018 are discussed (the loading structures for these years can be found in Section C of the supplementary material; see Selten, 2020). Because the components can change order we do not refer to them numerically but name them component A, B, and C. Furthermore, for the ARWU ranking we decided to remove the PCP variable from the analysis because of how this variable is constructed: It does not measure a separate concept but is a composite of the other ARWU variables and applies a correction for the size of a university. However, this correction is only performed for universities in specific countries (see Academic Ranking of World Universities (2018) for a list of these countries). For universities outside these countries, the PCP variable measures a weighted score of the other five variables. Removing this variable for these reasons is common because interpretation of the variable is not feasible; it does not measure a single concept and has different meanings for universities in different countries (Dehon et al., 2009; Docampo, 2011). A rule often used in PCA is to keep only components with eigenvalues exceeding one, be- cause this indicates that the component explains more variation than a single variable (Kaiser, 1960). Extracting eigenvalues for the ARWU ranking proved that this was only true for the first principal component. However, this rule does not always provide a good estimate for the number of components to retain (Cliff, 1988; Velicer & Jackson, 1990). Inspection of the scree plots, prior research, and assessment of the results when keeping one and two components justified extracting the first two components from this ranking (Dehon et al., 2009). For the THE ranking, the first two components had an eigenvalue higher than one, and for the QS ranking the first three components had an eigenvalue exceeding one. Scree plots and analysis of the results confirmed that extracting two and three principal components respectively was justified. The results of this analysis for the 2018 ranking data can be seen in Table 5. These results show a clear structure in the ARWU ranking. The Alumni and Award variables represent component B and the HiCi and PUB variables component A. The NS variable loads on both. This structure is also observed in the years 2016 and 2017. In the years 2012, 2013, 2014, and 2015 the Alumni, Award, NS, and HiCi variables load on one component, while only the Pub variable loads on the other component. In the THE ranking we also observe two components. One input variable (Research) loads on both components, while the other four variables load distinctively on one of the two components. The Research variable loads components A and B. Component A is also influ- enced by the Citations and International Outlook variables. Component B gets additionally influenced by the Teaching and Industry Income variables. This structure is also observed in the years 2016 and 2017. Before 2016 there is variability in the loading structure. In the years 2012, 2013, and 2014 the Teaching and Research variables load strongly together on component A and the International Outlook and Industry Income variables load on compo- nent B. Citations load on both components. The year 2015 is divergent from the other years: In this year the Citation and International Outlook variables influence component B, and Industry Income explains a big proportion of the variance in the other component. Teaching and Research in that year load on both components. For the QS ranking, a clearer distinction between components can be observed. The Academic and Employer Reputation variables represent component A. International Faculty and Students represent component B. Finally, the Faculty Student and Citations variables form component C. The QS ranking also showed the most stability over time. The first components Quantitative Science Studies 1120 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings Table 5. Rotated PCA Loadings on Components 2018 Measure ARWU 1. Alumni 2. Award 3. HiCi 4. NS 5. PUB THE 1. Teaching 2. Research 3. Citations 4. Industry Income 5. International Outlook QS 1. Academic reputation 2. Employer reputation 3. Faculty Student 4. International Faculty 5. International Student 6. Citations PC-A 0.04 −0.03 −0.66 −0.36 −0.84 −0.28 −0.42 −0.89 0.07 −0.96 −0.99 −0.92 −0.15 0.01 0.02 −0.55 PC-B 0.66 0.48 −0.02 0.35 −0.01 −0.45 −0.45 −0.12 −0.88 0.12 −0.05 0.04 0.04 0.94 0.95 0.14 PC-C 0.03 0.10 0.92 −0.08 0.10 −0.58 Note: Loadings larger than .40 are in bold. A and B are the same in all years analyzed. However, the Faculty student variable in 2016 also loads on component A. The Citation variable is most volatile and loads differently across years. For each of the three rankings, the robust PCA showed that it is possible to reveal structure in the data. Some variables are stable and load on the same component in all years. However, there are also variables that show more variation. 5.2. Exploratory Factor Analysis To explore the factorial structure of the data further, an EFA using oblique rotations was performed. First, for all three rankings in all years the Kaiser-Meyer-Olkin measure (KMO) must be verified to test sampling adequacy, and Bartlett’s test of sphericity (χ2) needs to be performed to analyze whether the correlation structure of the data is adequate for factor analyses. The tests indicate that all years of all rankings are adequate for factor analysis. For ARWU in all years KMO > 0.80 and Barlett’s χ2 test is significant ( pag < 0.001). For THE in all years KMO >
0.55 and Barlett’s χ2 test is significant ( pag < 0.001). For all years of the QS KMO > 0.52 and χ2

Estudios de ciencias cuantitativas

1121

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
q
s
s
/
a
r
t
i
C
mi
–
pag
d

F
/

1
3
1
1
0
9
1
8
6
9
8
6
7
q
s
s
_
a
_
0
0
0
5
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

A longitudinal analysis of university rankings

test is significant ( pag < 0.001). The KMO values for the THE and QS ranking are quite low. This indicates the existence of relatively high partial correlations between the variables in these two rankings (Field, 2013; Pett, Lackey, & Sullivan, 2003). This shows that there is less unique variance in the THE and QS rankings compared with the ARWU. The existence of high partial correlation in URs is to be expected. The ranking variables attempt to measure university per- formance, so it is therefore not surprising that the ranking variables, at least partly, account for common variance. Higher KMO values for the ARWU indicate that fewer partial correlations exist in this ranking. The variables in the ARWU, thus, capture more unique variance. Here, it should be noted that KMO assumes normally distributed data and the ranking data diverts from this. It is useful to test the KMO statistic, but one should not place too much emphasis on this test. That being said, given that this research is of an exploratory nature and in all years the KMO values exceed a minimum value of 0.50, it is possible to perform factor analysis on the data for all years of all three of the rankings (Field, 2013; Hair, Black, Babin, & Anderson, 2014; Kaiser, 1974). The principal axis factors (PAF) extraction method was used because the data deviate from multivariate normality. PAF is the preferred extraction method in this situation (Osborne et al., 2008). The noniterated version was used because the iterated solution yielded Heywood cases, a common problem when using the iterated version of this method (Habing, 2003). The same number of factors were extracted as the number of extracted components in the PCA. Scree tests are also a viable strategy for determining the number of factors to retain in factor analysis and a parallel analysis supported the amount of factors to extract. The results of this analysis for the 2018 ranking data can be found in Table 6; for the other years see Section D in the supplementary material (Selten, 2020). These results generally follow those obtained with PCA, with the structure being more clear. The ARWU consist of two distinct factors: Factor A is loaded by the Alumni and Award vari- ables and factor B strongly by the HiCi, NS, and PUB variables. This structure is visible in all years. The THE ranking is also made up of two factors. Factor A is loaded by the Teaching, Research, and Industry Income variables, whereas factor B is constructed of the Citations and International Outlook variables. This structure is visible in 2015, 2016, and 2017. In 2012, 2013, and 2014 factor A is not loaded by the Industry Income variable. Factor B in those years is only loaded on by the Citations variable and not by International Outlook. The QS ranking is made up of three factors. Factor A is loaded by the Academic and Employer Reputation variables, factor B by the International Faculty and Students variables, and factor C by the Faculty Student and Citations variables. The QS ranking shows more volatility than the other two rankings. In all years analyzed, Factors A and B are respectively loaded on by the reputation variables and the two variables measuring internationality, but there is variation in how the Faculty Student and Citations variables load. In the years 2012, 2013, and 2014, factor C was loaded on only by the Citations variable and the Faculty Student variable did not load on any of the factors. In 2015 and 2016, both the Faculty Student and Citations variables loaded on factor A together with the Academic and Employer Reputation variables. In 2017, both Citations and Faculty Student did not load higher than .4 on either of the three factors. 5.3. Explaining the Factors The structure in the three rankings was evaluated using two different methods: robust PCA and EFA. The first method is robust against the presence of outliers in the data, while the second is resistant against the data being nonnormally distributed. We now examine whether the factors Quantitative Science Studies 1122 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings Table 6. NIPA loadings on factors in 2018 Measure ARWU 1. Alumni 2. Award 3. HiCi 4. NS 5. PUB THE 1. Teaching 2. Research 3. Citations 4. Industry Income 5. International Outlook QS 1. Academic reputation 2. Employer reputation 3. Faculty Student 4. International Faculty 5. International Students 6. Citations Note: Loadings larger than .40 are in bold. PA-A 0.84 0.86 0.01 0.41 −0.09 0.92 0.86 0.16 0.63 −0.04 0.88 0.83 0.24 −0.02 0.02 0.26 PA-B PA-C 0.00 0.01 0.79 0.58 0.75 −0.02 0.15 0.66 −0.20 0.73 −0.06 0.08 −0.01 0.75 0.76 0.13 0.07 −0.08 −0.40 0.07 −0.06 0.44 that were empirically found by these two analyses, are also theoretically explainable and what underlying concepts these factors measure. In the ARWU ranking, two distinct factors can be observed in the EFA, whereas the PCA shows more volatility. Generally, however, it can be stated that the HiCi, PUB, and N&S variables appear to form a factor together and the Alumni and Award variables form a second factor. This structure was also found in the research of Soh (2015) and Dehon et al. (2009). The first factor measures the number of citations and publications and together is weighted 60% on the ranking. The variables that form the second factor, Alumni and Award, measure the number of Nobel Prizes and Fields Medals won by a university’s employees or alumni and is weighted 30% in the ARWU ranking. Safón (2013) came to a different conclusion, showing that all ARWU variables load on the same factor. This study, however, used a specific subset of the data, which had a significant effect on the extracted structure. In the THE ranking, two distinct factors also are extracted in both the PCA and EFA. The first factor is composed of the Teaching and Research variables. These two variables are measured by multiple subvariables, as described in Section 2. Only in the years 2016 to 2018 do we see Quantitative Science Studies 1123 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings this reflected in the results of the robust PCA. In these years the research variable loads on both components. This may be a reflection that this variable, when correcting for the influence of outliers, is derived from two quite different notional indicators. However, when assessing all years and the EFA results, we, in accordance with the interpretation of Moed (2017), expect that the Teaching and Research variables loading together is caused by the influence of the surveys, and the other variables used to construct these variables have little impact because of the low weights assigned to them. This component is therefore mainly a representation of a university’s reputation and accounts for 60% of the ranking. The second component, when considering all years, is influenced mainly by the Citations variable, which provides 30% of the final ranking. There is quite some variation in how the Industry Income and International Outlook variables load. These are not clearly related to a single factor, and both weigh only 5% on the ranking. The research of Safón (2013) and Soh (2015) shows comparable results. However, in these studies the Citations variable loaded with the Research and Teaching var- iables. Our results suggest that, when taking the whole ranking into account over multiple years, the Citation measure is a separate factor in the THE ranking. The QS ranking is the only ranking for which the extraction of three factors proved useful according to scree plots and parallel analysis. However, when considering multiple years, only two are consistent. The Academic and Employer Reputation variables load together in both PCA and EFA. This suggests, as in the THE ranking that they are a measure of the general reputation of a university. This factor provides 50% of the ranking. Also, the International Faculty and International Students variables form a construct together. This factor accounts for 15% of the weight in the ranking. The last extracted factor in the QS ranking was not consistent. Both Citations and Student to Staff ratio thus appear to be separate components in this ranking when analyzing multiple years of the QS ranking. They both provide 20%. These results differ quite a bit from those obtained by Soh (2015), which might be caused by the fact that that study only extracted two factors. Reviewing these results and assessing what the variables that form the concepts measure shows that in all three rankings in all years there are two overlapping underlying concepts that contribute substantially to the rankings: (a) reputation and (b) research performance. In the ARWU ranking, we observed that the N&S, HiCi, and PUB variables often load together. These variables are all proxies for the research performance of a university. The second compo- nent is composed of the Alumni and Award variables. Both these variables measure the same achievements but in different groups, and can be seen as a proxy for or influencer of, as indicated by the work of Altbach (2012), a university’s reputation. In the THE ranking, reputation is mea- sured by the Teaching and Research variables, while the Citations variable is measuring research performance. In the QS ranking, Academic and Employer Reputation comprise the reputation factor, whereas research performance is measured by the Citations variable. Also, some nonoverlapping concepts were found. PCA and EFA showed that internationality is a separate concept in both the THE and QS rankings, and in the ARWU this concept is not represented. Also, in the QS ranking the student-to-staff ratio plays quite an important role. In the other two rankings, this concept is not assigned much importance. When taking the weights assigned to the variables into account, 90% of the ARWU ranking, 85% of the THE ranking and 70% of the QS ranking are accounted for by the two concepts. Reputation and research performance, are thus very influential in all three rankings. A final dif- ference that can be observed in the rankings is that in the ARWU ranking indicators of research performance are more important, while in the THE and QS rankings the indicators associated with reputation are the most influential. Quantitative Science Studies 1124 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings Table 7. Spearman-Brown scale reliability ARWU 1 0.86 0.87 0.87 0.87 0.87 0.87 0.87 2 0.88 0.88 0.89 0.88 0.84 0.83 0.83 THE 1 0.95 0.95 0.96 0.95 0.95 0.95 0.95 QS 1 0.82 0.83 0.83 0.88 0.84 0.84 0.89 Scale 2012 2013 2014 2015 2016 2017 2018 5.4. Reliability of the Concepts The analysis described above concluded that there are two overlapping concepts: (a) reputation and (b) research performance, that represent most of the weight in each of the rankings. In the ARWU ranking, both concepts are a combination of multiple variables. In the THE and QS rankings, only the reputation measurement is a multi-item concept. To confirm that the vari- ables that measure one concept together form a reliable scale, the internal validity of the scales was verified. The Spearman-Brown split-half reliability test was used for this because some con- cepts are composed of two variables (Eisinga, Te Grotenhuis, & Pelzer, 2013). Spearman-Brown reliability is calculated by splitting and correlating the items that form the scale. This reliability can be interpreted similarly to Cronbach’s alpha: Scores closer to one demonstrate that the scale is internally more reliable (Field, 2013, pp. 1044–1045). The results of these tests can be found in Table 7. They confirm that in all years the scales are internally reliable. This sup- ports the assertion that for all three rankings the factors that consist of multiple variables are reliable scales measuring the same concept across multiple years. Furthermore, for the THE and QS ranking it can be observed that these scales are more internally reliable when compared to internal reliability for the whole ranking tested using Cronbach’s alpha (see supplementary material Section E [Selten, 2020]), and in the ARWU the reliability of the scales is comparable to the internal consistency of the complete ranking. This indicates that, while our analysis shows the existence of two internal reliable scales in the ARWU ranking, these concepts are more interrelated than is the case for the THE and QS rankings. This is consistent with the finding that the ARWU ranking is mostly a one-dimensional scale assessing academic perfor- mance, while the other two rankings are more multidimensional (Safón, 2013; Soh, 2015). 6. INVESTIGATING THE SCALES Based on our analysis to this point, we conclude that two concepts underlie all three rankings. To further investigate what these concepts measure, the variables of which they consist were combined. For each ranking, this creates a two-dimensional representation of each ranking describing the reputation and research performance of the universities. 6.1. Testing Scale Relationships To assess the relationship between these concepts a Spearman correlation test for each year was performed. Results can be found in Section F of the supplementary material (Selten, 2020). Quantitative Science Studies 1125 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings These show that all concepts in all years are significantly correlated with each other. Across years the THE and QS reputation performance measurements seem to be correlated most strongly, but also the THE reputation and ARWU research performance concepts show strong correlation. These differences are, however, only minor—in general the reputation perfor- mance concept of each ranking is not evidently correlated more strongly with the reputation performance concept of the other rankings than with the research performance concept of the other rankings, and vice versa. We identify two potential explanations for the existence of a relation between these different concepts. First, it is important to note that, while the rankings aim to quantify similar aspects of a university’s performance, this performance is measured using different methodologies. For example, as noted in Section 2, the THE and QS rankings correct the research performance measurements for the size of universities, while the ARWU ranking does not normalize these measurements for university size. Furthermore, while the THE and QS rankings use direct measures to capture reputation, the reputation concept for the ARWU ranking is more ambiguous. The number of Nobel Prizes and Fields Medals won was in this study interpreted as a measurement of reputation, but these prizes are only an indirect indicator of reputation, as they also demonstrate scientific accomplishments. Similarly, when a university’s staff often publish in high-impact journals or are cited a lot this can also be seen as a proxy for the reputation of a university. This leads us to the second explanation: The existence of a circular effect being present in the rankings. Safón (2013) demonstrates this effect by showing the existence of a reputation-survey-reputation relation in the THE and QS rankings as well as in the ARWU ranking, even though it does not include reputation surveys. However, Robinson-Garcia, Torres-Salinas, et al. (2019) in their study reverse this argument. They hypothesize that the answers people give on surveys are influenced by publication and citation data. Both interpretations can help explain the results found in this study. We identify the exis- tence of two latent concepts in all rankings: reputation and research performance. However, these two latent concepts might be influencing each other. In the next section the relationship between the concepts is further investigated. 6.2. Plotting the Scales The correlation coefficients themselves do not provide much insight into how the different components and factors (scales for the rest of this discussion) relate to each other. However, they are a two-dimensional reflection of the most important concepts in all three rankings. We were interested in whether using these scales as coordinates to map the relationship of research performance and reputations over time for each ranking would provide insight. In particular, we are interested in the question of whether there are differences in the progress made by univer- sities in different regions and by language spoken, whether this could provide evidence for or against claims of bias in the rankings, and if it could provide evidence for or against the circular reinforcement effects discussed above. To ease interpretation of the plots, they were created using a subset of universities that are present in all rankings. This results in a subset of 87 high-ranked universities. Also, there is a difference in the number of universities that are ranked per region. This would skew the com- parison between regions. Because in the rankings the top institutions are most important, we chose to only aggregate the results of the top five institutions in each region. Figures 3–5 show the movement of (aggregates) of universities on the two scales: reputation and research perfor- mance. Each arrow indicates the data point for a given year, starting in 2012 and ending in 2018. The arrow direction shows the movement from year to year. Quantitative Science Studies 1126 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings Figure 3. Longitudinal developments per geographical region. Figure 3 shows how the three rankings behave on a regional level. There are differences in the relative rankings of universities from different regions. We therefore plot the average of the top five institutions from each region. North America, in the ARWU, is far ahead of all other regions on both the reputation and research performance scales. South-Eastern Asia, Eastern Asia, Western Europe, Australia, and New Zealand are all far behind. Northern Europe appears right in the middle. The THE and QS rankings both also show that the top institutions in North America perform best on both scales. However, the advantage with respect to the other regions is much smaller. Northern Europe performs second best on both scales in both rankings, but in the THE and especially QS ranking, Asian universities also perform very well on the reputation measurement. Another interesting observation from this figure is that in all rankings Asian universities are climbing fast on the research performance scale. Finally, universities in Western Europe and Australia and New Zealand in the THE and QS rankings seem to have quite a low reputation score when compared to their score on the research performance scale, whereas in the ARWU ranking this is the case for institutions in Eastern Asia; performance on the research scale for universities in this area rose quickly, but they continue to lag behind on the reputation measurement. The ARWU shows strikingly lower movement on the reputation scale than do the other rankings, indicating the slow accumulation of prizes compared to the volatility or responsiveness of a survey-based measure. Figure 4. Longitudinal developments per language region. Quantitative Science Studies 1127 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings Figure 5. Longitudinal developments for a sample of universities. In the second set of plots (Figure 4), an aggregate of the top five universities within certain language regions is displayed. This shows in all three rankings that universities in English- speaking countries are ahead on both the reputation and research performance scales. Of all three rankings, the ARWU shows the biggest difference between English-speaking countries and the other language regions. In the THE ranking on the research performance scale, universities in Dutch-, French-, and German-speaking countries perform equally and are around 20 points behind English-speaking countries. However, on the reputation scale they are substantially further behind. For universities in China, the opposite is the case. They score well on the reputation scale, but are behind on the research performance scale. The QS ranking shows that institutions in German- speaking countries perform quite well on the reputation scale, whereas Dutch-, French-, and Swedish-speaking countries lag behind on this measurement. Chinese institutions have increased their performance substantially on the research performance scale over the years. There is, however, no effect of this increase visible on the reputation scale, on which they already performed well. In Figure 5, five universities from diverse countries that are all on average ranked in the 50 to 150 range are compared. When comparing the different plots against each other it can be ob- served that LMU Munich, the University of Southern California, and KTH Stockholm perform similarly in all rankings. An interesting case is a comparison of the Universities of Nanyang and Liverpool. The first performs very well on the reputation performance scale when this is measured using surveys, as in the THE and QS rankings. In the ARWU ranking Nanyang performs poorly on this scale. This difference might be caused by the fact that this institution was established in 1981, hence it has fewer alumni or university staff that have won a Nobel Prize or Fields Medal. The University of Liverpool, in contrast, scores very well on the reputation scale in the ARWU. However, seven out of the nine Nobel Prizes acquired by the University of Liverpool were won before Nanyang University was founded. This shows how the use of Nobel Prizes and Fields Medals by the ARWU ranking to measure reputation can favor older institutions. Also, the behavior of Nanyang University on the research performance scale is noteworthy. In all rankings in 2012 this institution is ranked as one of the lowest on this scale when compared to the other four universities in this plot, but in seven years it climbs to be among the top performers. In the QS ranking, where the reputation score of this university is also very good, Nanyang University climbs from being ranked position 47 to 12 in this period. This shows that, while the results in Section 4 indicate that the rankings are stable over the years, there are specific universities that manage to climb rapidly to the top of the rankings. Quantitative Science Studies 1128 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings The plots show that the ARWU ranking assigns high scores on both the research perfor- mance and reputation scales to institutions in English-speaking countries and particularly in the United States and United Kingdom. Asian universities in the ARWU ranking perform worst on both scales. This in contrast with the other two rankings, in which English institutions are also ranked highest but Asian universities are often are among the best performers on the reputation scale. Finally, the figures show that on the research performance scale the rankings have more in common than on the reputation scale—there is more variation visible between the plots when comparing the aggregates or universities on the reputation scale. Furthermore, we see little correlation overall between reputation and performance scales for any of the groups in any of the rankings. Substantial changes in the performance scale (both positive and negative) are generally not correlated with similar movements in reputation, even with some delay. The exception to this may be East Asian and Chinese-speaking universities for which there is some correlation between increasing research performance and reputation, primarily in the THE rankings. However, this may also be due to an unexpected confounder. Increasing publications and visibility, and in particular the global discussion of the importance of the increase in volume and quality of Chinese research performance, might lead to more researchers from those universities being selected to take part in the survey. This is impossible to assess without detailed longitudinal demographic data on the survey participants. In general, however, these plots show little evidence of strong relationships between reputa- tion and research performance. This could be consistent with circular reinforcement effects on reputation, where proxy indicators for reputation are largely decoupled from research perfor- mance. Overall, examining single universities or groups does not provide evidence for or against circular reinforcement effects. As shown earlier in this paper, there is little change in the rank- ings. Circular effects are therefore hard to observe, because for most universities performance on the rankings is quite stable. 7. DISCUSSION Accelerated by the increased demand for accountability and transparency, URs have started to play a major role in the assessment of a university’s quality. There has been substantial research criticizing these rankings, but only a few studies have performed a longitudinal data driven comparison of URs. This research set out to take an in-depth look at the data of the ARWU, THE, and QS rankings. Based on this analysis, we draw out five key findings. 7.1. Rankings Primarily Measure Reputation and Research Performance Dehon et al. (2009), Safón (2013), and Soh (2015) showed that by using PCA and EFA on university ranking data it is possible to reveal structures that underlie these rankings. In this research, these techniques are applied to multiple years of the ARWU, THE, and QS rankings. The results of these analyses provide empirical evidence that all three major URs are predom- inantly composed of two concepts: reputation and research performance. Research perfor- mance is measured by the rankings using the number of citations and, in the ARWU, also the number of publications. Reputation is measured in the ARWU by counting the Nobel Prizes and Field Medals won by affiliated university employees and graduates. The THE and QS rankings mainly measure reputation using surveys. The high weights placed on these two concepts by the rankings are problematic. Surveys are a factor that a university has little to no control over and the measurements used to assess research performance are often claimed to be biased (Vernon et al., 2018). Quantitative Science Studies 1129 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings Moed (2017) shows that individual citation and reputation indicators are strongly correlated. Building upon this, we examined the correlation between the reputation and research performance concepts across the rankings. This showed that all concepts are significantly correlated, but corre- lations within the “same” concept across rankings are not stronger than with the divergent concept. There are multiple explanations possible for this absence of a strong correlation between notionally overlapping concepts. Previous studies have argued that reputation and research performance might influence each other (Robinson-Garcia et al., 2019; Safón, 2019). A university having a high number of citations can positively affect its reputation, and publications written by scholars working at a prestigious university might get cited more often. This is a plausible assertion and the correla- tions we identify between nonoverlapping concepts are consistent with this argument. However, when we directly visualized the relationships between research performance and reputation scales for a range of universities and groups we did not see evidence for this, as can be seen in Section 6.2. It is nonetheless worthwhile to further explore this effect in future research to gain more insights into the relation between a university’s reputation and research performance. It is also interesting to explore a different explanation: that the underlying concepts in different rankings are not measuring the same thing. Although these rankings are measuring overlapping concepts, the way they measure these concepts might be more influential in the outcome of the ranking order than the actual concepts that the rankings are attempting to measure. This notion is further elaborated on in Section 7.5 of this discussion. Furthermore, the question is what information do these rankings actually provide if similar concepts between them do not corre- late? This uncertainty is problematic considering the influence that URs have on society (Billaut et al., 2009; Marginson, 2014; Saisana et al., 2011). This also leads us into the next point: the complications that arise when measuring reputation. 7.2. Reputation Is Difficult to Measure Measuring reputation in itself is not unimportant, because graduating from or working at a pres- tigious university can improve a student’s or researcher’s job prospects (Taylor & Braddock, 2007), even though the relevance of using surveys to rank universities is debated (Vernon et al., 2018). The rankings should therefore look critically at the methodology used to measure this concept. The THE and QS rankings both use two different surveys to measure reputation. The results from the PCA and EFA showed that in both rankings these surveys are highly related. This suggests that these surveys do not in practice provide information on the (separate) quality of education and research, but actually measure a university’s general reputation. This then raises the question of what people base their judgment on regarding the reputation of a university. It is not unlikely that the rankings themselves play an important role in this, reinforcing the idea that rankings become a self-fulfilling prophecy (Espeland & Sauder, 2007; Marginson, 2007). The use of Nobel Prizes and Fields Medals as a substitute might appear more objective. However, we have shown that this leads to favoring older universities because it includes alumni who graduated since 1911 and prizewinners since 1921. This is seen in the example of Nanyang University. Furthermore, these prizes are mostly science oriented. The Nobel Prizes in physics, chemistry, and medicine are focused on the natural and medical sciences. The Nobel Prizes in economics, peace, and literature are not specifically science oriented, but the latter two are only counted in the staff variable (i.e., they are less influential) and are arguably quite decoupled from the quality of the university that the winners attended to begin with. This measure, along with many others, therefore favors science-oriented universities. Another concern with reputation measurement in the ARWU and THE rankings is that these rankings use the variables measuring this reputation concept as proxies for a university’s Quantitative Science Studies 1130 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings education and research quality (Academic Ranking of World Universities, 2018; THE World University Ranking, 2018). The variables loading together and forming a reliable scale reflect the notion that it is doubtful whether these variables are a good representation of these unique qualities. Especially in the THE ranking case, it seems that the reputation surveys have such a big influence on the variables that these are mainly a reputation measurement. For the QS ranking, while the problem is the same, there is at least the merit that it is explicitly noted in the method- ology that the ranking is measuring reputation directly (QS World University Ranking, 2018). 7.3. Universities in the United States and United Kingdom Dominate the Rankings Section 6.2 shows that universities in English-speaking countries are ahead of universities in other regions. This seems to support the critique that the ranking methodologies benefit Western, especially English-speaking, universities (Pusser & Marginson, 2013; Van Raan, 2005; Vernon et al., 2018). For all rankings, we see a substantial advantage for English-speaking universities on the research perfor- mance scale, even though more and more universities in non-English-speaking countries publish predominantly in English (Altbach, 2013; Curry & Lillis, 2004). However, despite the fact that both the THE and QS rankings employ methodologies to account for the fact that non-English articles receive fewer citations, universities in English-speaking countries still lead the rankings. There is also a strong regional effect between the rankings on the reputation component. Eastern Asian, especially Chinese, universities score highly on the reputation measurement in the THE and QS ranking. Non-English-speaking European universities and institutions from Australia and New Zealand perform substantially worse on this scale, even when the research performance component is the same or higher. Reputation measures for Australian and New Zealand universities appear particularly volatile in the QS ranking. This may indicate that the THE and QS rankings reputation measurements favor Asian universities. This could be due to increasing profile and marketing, more effective gaming of the survey by top East Asian and Chinese universities, or some other difference in the methodology. More research is thus needed to draw definitive conclusions on this matter. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 7.4. Rankings Are Stable over Time but Differ from Each Other Our analysis shows that for all three rankings consecutive years of the same ranking are strongly correlated and are very similar according to the M-measure. This is in accordance with results of Aguillo et al. (2010). This means that it is hard to change position within a ranking. This year-to- year similarity can be explained by different choices made by the ranking designers. All three rankings use rolling averages to calculate research performance indicators. Also, 30% of the ARWU ranking is constructed by variables that measure prizes won since 1921 and which are therefore very stable. For the THE and QS stability can be explained by the high weighting assigned to reputation surveys. A university’s reputation is not likely to change substantially within one, or even a small number, of years. Generally speaking, all URs employ conservative methodologies, which results in the rankings being very stable. For another perspective on stability, we refer readers to Gingras (2016). Circular effects between ranking years, as described by Safón (2019) and Robinson-Garcia et al. (2019), could also results in the rankings being stable. The plots created in this research did not indicate the existence of such effects, but the correlation between reputation and research performance can be taken as evidence for the claim made by Safón (2019) that research performance is also influenced by prior rankings. The rankings were also compared to each other. These analyses showed that in the top 50 and top 100 the different rankings are correlated strongly; however, the M-measure indicated only medium similarity, showing substantial variation between the rankings. These results are Quantitative Science Studies 1131 A longitudinal analysis of university rankings in accordance with the findings of Aguillo et al. (2010) and the overlap measurements of Moed (2017). Given this stability, one would expect that the rankings would be more similar; thus, it is surprising that they are not. It is even more noteworthy that there is dramatically more similarity in the top 50 than in positions 51–100. This is most likely caused by the fact that performance differences are only minor between lower ranked universities. Designer choices are, as will be shown next, influential for ranking order and become more influential when the differences in performance are becoming smaller. 7.5. Ranking Designers Influence Ranking Order The relative absence of similarity between the three rankings is noteworthy, since in this article it was established that they to a large extent measure similar concepts. Several reasons can be used to explain the differences between the rankings. First, the rankings assign different weights to the variables that compose these concepts, which, as been shown in multiple studies, has a large effect on a university’s ranking position (Dehon et al., 2009; Marginson, 2014; Saisana et al., 2011). It should also be noted that there are also nonoverlapping measurements that can explain differ- ences between rankings. For example, the QS assigning a substantial weight to the student-staff ratio and the decision of the ARWU not to include internationality are the most important. Second, rankings use different methods to normalize their data. The THE and QS correct their research performance measurement for university size, while in the ARWU raw numbers are used. Choices made by the ranking designers for specific weighting and normalization schemes are thus important determinants of the final ranking order (Moed, 2017). Perhaps most importantly, our paper, in agreement with previous work, shows that the majority of the ranking variables are attempts to quantify two specific concepts of university performance. The differences between the rankings is therefore not what they are trying to measure but how they seek to measure it. Two limitations of this research should be addressed. First, we concluded that the reputation variables in the THE and QS ranking loading together is caused by the fact that these are both measuring a general reputation concept. However, it is possible that these variables do actually measure distinct reputation properties, but that teaching quality and research quality are extraor- dinarily highly correlated. While there is a likely connection between teaching and research qual- ity, we are skeptical that (a) this correlation would be so high and (b) that survey respondents are in a position to distinguish between details of education and research provision, especially in a con- text where they are being asked about both. Attempts to distinguish between teaching quality and research quality, such as in the UK’s Teaching and Research Excellence Framework, show low correlation between highly evaluated institutions. It is thus reasonable to expect that their judg- ment is, at least partially, caused by more general reputation attributes, for example the number of Nobel Prizes and Fields Medals won (Altbach, 2012). More research is needed to identify what influences survey respondents’ judgment of a university’s reputation and how the selection of respondents and questions might influence that. This could be studied by reviewing the questions used to measure the reputation variables and analysis of the raw data collected from these ques- tionnaires. It may also be interesting to see how external data sources relate to these measure- ments, for example, by measuring the impact of a university appearing in popular or social media (Priem, Taraborelli, Groth, & Neylon, 2010). Our results might be seen as supportive of the INORMS statement that surveys should not form the basis of rankings (INORMS Research Evaluation Group, 2019). In any case, greater transparency on the sample selection and questions posed (as well as how they may have changed) would be of value in probing this issue. Second, some qualifications should be made when interpreting the extracted loading struc- tures in the PCA and EFA. In the QS ranking in some years a number of universities had to be Quantitative Science Studies 1132 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings removed from the analysis because of missing data elements. However, since the loadings in the PCA and EFA for the QS were similar across the years, we are quite confident that a genuine structure was extracted from this ranking. Nonetheless, the large number of missing values in the QS makes it unclear how the overall ranking score for a wide range of universities was constructed and makes it hard to study and verify the QS data. We would urge the QS ranking to provide more transparency in this area. Furthermore, there is research that suggests that all ARWU variables measure one concept (Safón, 2013). The results of our PCA also showed most variables loading on one component in some years and the second component’s eigenvalue did not exceed one. Of the three rankings, the ARWU therefore appears to be measuring the most singular concept. This is most likely caused by the fact that there are a substantial number of universities that score very low on the Alumni and Award variables, which in turn is a logical result of how these variables are measured (see Section 2). For the institutions that score low on these two variables the ARWU thus only measures academic performance. But, when reviewing the EFA results and previous work by Dehon et al. (2009), we think it is reasonable that these Alumni and Award variables are actually measuring a distinct factor. This paper provided a longitudinal comparison between the three major URs. It showed that rankings are stable over time but differ significantly from each other. Furthermore, it revealed that the rankings all primarily measure two concepts—reputation and research performance— but it is likely that these concepts are influencing each other. Last, it discussed these findings in light of the critiques that have been raised on URs. This provides insights into what URs do and do not measure. Our results also show that there is uncertainty surrounding what the rankings’ variables exactly quantify. One thing is certain, however: It is impossible to measure all aspects of the complex concept that is university performance (Van Parijs, 2009). Despite this, univer- sities are focusing on and restricting their activities to ranking criteria (Marginson, 2014). But because it is unclear what the rankings quantify it is also unclear what exactly the universities are conforming to. Universities aim to perform well on ambiguous and inconsistent ranking cri- teria, which at the same time can hinder their performance on activities that are not measured by the rankings. We conclude that universities should be extremely cautious in the use of rankings and rankings data for internal assessment of performance and should not rely on rankings as a measure to drive strategy. A ranking is simply a representation of the ranking data. It does not cover all aspects of a university’s performance, but it may also be a poor measure of the aspects it is intended to cover. ACKNOWLEDGMENTS The authors would like to thank the anonymous reviewers for their comments. AUTHOR CONTRIBUTIONS Friso Selten: Conceptualization, Methodology, Investigation, Software, Formal analysis, Data curation, Visualization, Writing—original draft. Cameron Neylon: Conceptualization, Methodology, Writing—review & editing, Resources. Chun-Kai Huang: Writing—review & editing. Paul Groth: Conceptualization, Methodology, Writing—review & editing, Supervision. COMPETING INTERESTS PG is coscientific director of the ICAI Elsevier AI Lab—an artificial intelligence lab cofinanced by Elsevier. Elsevier is a URs data provider. Quantitative Science Studies 1133 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings FUNDING INFORMATION CN and KW acknowledge the funding of the Curtin Open Knowledge Initiative through a strategic initiatives of the Research Office at Curtin and Curtin Faculty of Humanities. DATA AVAILABILITY The data used in this paper were obtained from the ARWU, THE, and QS websites. All scripts used to collect data, perform the analyses, and create visualizations are available at Selten (2020). Due to a lack of clarity on the reuse rights for the data provided on rankings websites presented, it is not clear that we are permitted to publicly redistribute the data that have been scraped from the ranking websites. For all rankings, this data is publicly available at the ranking websites. REFERENCES Academic Ranking of World Universities. (2018). Arwu2018 meth- odology. Retrieved April 18, 2019 from http://shanghairanking. com/ARWU-Methodology-2018.html Aguillo, I., Bar-Ilan, J., Levene, M., & Ortega, J. (2010). Comparing university rankings. Scientometrics, 85(1), 243–256. Altbach, P. G. (2012). The globalization of college and university rankings. Change: The Magazine of Higher Learning, 44(1), 26–31. Altbach, P. G. (2013). The imperial tongue: English as the dominat- ing academic language. In The international imperative in higher education (pp. 1–6). Leiden: Brill Sense. Altbach, P. G., & Knight, J. (2007). The internationalization of higher education: Motivations and realities. Journal of Studies in International Education, 11(3–4), 290–305. Bar-Ilan, J., Levene, M., & Lin, A. (2007). Some measures for com- paring citation databases. Journal of Informetrics, 1(1), 26–34. Billaut, J.-C., Bouyssou, D., & Vincke, P. (2009). Should you believe in the Shanghai ranking? An MCDM view. Scientometrics, 84(1), 237–263. Cliff, N. (1988). The eigenvalues-greater-than-one rule and the re- liability of components. Psychological Bulletin, 103(2), 276. Curry, M. J., & Lillis, T. (2004). Multilingual scholars and the imper- ative to publish in english: Negotiating interests, demands, and rewards. TESOL Quarterly, 38(4), 663–688. Dehon, C., McCathie, A., & Verardi, V. (2009). Uncovering excel- lence in academic rankings: A closer look at the Shanghai ranking. Scientometrics, 83(2), 515–524. Digital-science. (2019). Grid release 2019-05-06. Retrieved from https://digitalscience.figshare.com/articles/GRID_release_2019- 05-06/8137970/1 Docampo, D. (2011). On using the Shanghai ranking to assess the research performance of university systems. Scientometrics, 86(1), 77–92. Eisinga, R., Te Grotenhuis, M., & Pelzer, B. (2013). The reliability of a two-item scale: Pearson, Cronbach, or Spearman-Brown? International Journal of Public Health, 58(4), 637–642. Espeland, W. N., & Sauder, M. (2007). Rankings and reactivity: How public measures recreate social worlds. American Journal of Sociology, 113(1), 1–40. Field, A. (2013). Discovering statistics using IBM SPSS statistics. Sage. Gingras, Y. (2016). Bibliometrics and research evaluation: Uses and abuses. Cambridge, MA: MIT Press. Gravetter, F. J., & Wallnau, L. B. (2016). Statistics for the behavioral sciences. Independence, KY: Cengage Learning. Habing, B. (2003). Exploratory factor analysis. University of South Carolina, October, 15. Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2014). Multivariate data analysis. Essex: Pearson Education Limited. Hazelkorn, E. (2007). The impact of league tables and ranking sys- tems on higher education decision making. Higher Education Management and Policy, 19(2), 1–24. Huang, M.-H. (2012). Opening the black box of QS World University Rankings. Research Evaluation, 21(1), 71–78. Hubert, M., Rousseeuw, P. J., & Vanden Branden, K. (2005). ROBPCA: A new approach to robust principal component analysis. Technometrics, 47(1), 64–79. INORMS Research Evaluation Group. (2019). What makes a fair and responsible university ranking? Draft criteria for comment (Technical Report). Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20(1), 141–151. Kaiser, H. F. (1974). An index of factorial simplicity. Psychometrika, 39(1), 31–36. Marginson, S. (2007). Global university rankings: Implications in general and for Australia. Journal of Higher Education Policy and Management, 29(2), 131–142. Marginson, S. (2014). University rankings and social science. European Journal of Education, 49(1), 45–59. Moed, H. F. (2017). A critical comparative analysis of five world university rankings. Scientometrics, 110(2), 967–990. Osborne, J. W., Costello, A. B., & Kellow, J. T. (2008). Best prac- tices in exploratory factor analysis. Best Practices in Quantitative Methods, 86–99. Pett, M., Lackey, N., & Sullivan, J. (2003). Making sense of factor analysis. Thousand Oaks, CA: Sage. Priem, J., Taraborelli, D., Groth, P., & Neylon, C. (2010). Altmetrics: A manifesto. Retrieved from http://altmetrics.org/manifesto/ Pusser, B., & Marginson, S. (2013). University rankings in critical perspective. Journal of Higher Education, 84(4), 544–568. QS World University Ranking. (2018). World university ranking meth- odology. Retrieved April 18, 2019 from https://www.topuniversities. com/qs-world-university-rankings/methodology Robinson-Garcia, N., Torres-Salinas, D., Herrera-Viedma, E., & Docampo, D. (2019). Mining university rankings: Publication output and citation impact as their basis. Research Evaluation, 28(3), 232–240. Quantitative Science Studies 1134 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d / . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A longitudinal analysis of university rankings Romzek, B. S. (2000). Dynamics of public sector accountability in an era of reform. International Review of Administrative Sciences, 66(1), 21–44. Safón, V. (2013). What do global university rankings really measure? The search for the X factor and the X entity. Scientometrics, 97(2), 223–244. Safón, V. (2019). Inter-ranking reputational effects: an analysis of the Academic Ranking of World Universities (ARWU) and the Times Higher Education world university rankings (THE) reputa- tional relationship. Scientometrics, 121(2), 897–915. Saisana, M., d’Hombres, B., & Saltelli, A. (2011). Rickety numbers: Volatility of university rankings and policy implications. Research Policy, 40(1), 165–177. Scott, P. (2013). Ranking higher education institutions: A critical perspective. In Rankings and Accountability in Higher Education: Uses and Misuses, (p. 113). Van Haren Publishing. Selten, F. (2020). A Longitudinal Analysis of University Rankings. Zenodo. Retrieved from https://doi.org/10.5281/zenodo. 3775251 Soh, K. (2015). What the Overall doesn’t tell about world university rankings: Examples from ARWU, QSWUR, and THEWUR in 2013. Journal of Higher Education Policy and Management, 37(3), 295–307. Stergiou, K. I., & Lessenich, S. (2014). On impact factors and uni- versity rankings: From birth to boycott. Ethics in Science and Environmental Politics, 13(2), 101–111. Taylor, P., & Braddock, R. (2007). International university ranking systems and the idea of university excellence. Journal of Higher Education Policy and Management, 29(3), 245–260. THE World University Ranking. (2018). World university rankings 2019: Methodology. Retrieved April 18, 2019 from https:// timeshighereducation.com/world-university-rankings/methodology- world-university-rankings-2019 Todorov, V. (2012). Robust location and scatter estimation and ro- bust multivariate analysis with high breakdown point. http:// www.cran.r-project.org/web/packages/rrcov Usher, A., & Savino, M. (2006). A world of difference: a global survey of university league tables. Canadian Education Report Series. Online Submission. Van Parijs, P. (2009). European higher education under the spell of university rankings. Ethical Perspectives, 16(2), 189–206. Van Raan, A. F. (2005). Fatal attraction: Conceptual and methodo- logical problems in the ranking of universities by bibliometric methods. Scientometrics, 62(1), 133–143. Velicer, W. F., & Jackson, D. N. (1990). Component analysis versus common factor analysis: Some further observations. Multivariate Behavioral Research, 25(1), 97–114. Vernon, M. M., Balas, E. A., & Momani, S. (2018). Are university rankings useful to improve research? A systematic review. PLOS ONE, 13(3), e0193762. Wikidata Contributors. (2019). Wikidata. Retrieved March 20, 2019 from https://www.wikidata.org l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u q s s / a r t i c e - p d l f / / / / 1 3 1 1 0 9 1 8 6 9 8 6 7 q s s _ a _ 0 0 0 5 2 p d . / f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Quantitative Science Studies 1135 ARTÍCULO DE INVESTIGACIÓN imagen

Descargar PDF