研究文章 - 麻省理工学院人工智能研究专业

研究文章

Just an artifact? The concordance between peer
review and bibliometrics in economics and
statistics in the Italian research
assessment exercise

开放访问

杂志

Alberto Baccini1

and Giuseppe De Nicolao2

1Department of Economics and Statistics, University of Siena, 意大利
2Department of Electrical, Computer and Biomedical Engineering, University of Pavia, 意大利

引文: Baccini, A。, & De Nicolao, G.
(2022). Just an artifact? 这
concordance between peer review and
bibliometrics in economics and
statistics in the Italian research
assessment exercise. 定量
科学研究, 3(1), 194–207. https://
doi.org/10.1162/qss_a_00172

DOI:
https://doi.org/10.1162/qss_a_00172

同行评审:
https://publons.com/publon/10.1162
/qss_a_00172

支持信息:
https://doi.org/10.1162/qss_a_00172

已收到: 26 可能 2021
公认: 1 十一月 2021

通讯作者:
Alberto Baccini
alberto.baccini@unisi.it

处理编辑器:
游戏沃尔特曼

版权: © 2022 Alberto Baccini and
Giuseppe De Nicolao. 发表在下面
创意共享归因 4.0
国际的 (抄送 4.0) 执照.

麻省理工学院出版社

关键词: 文献计量学, economics and statistics, 意大利, 同行评审, replication study, 研究
assessment exercise

抽象的
During the Italian research assessment exercise (2004–2010), the governmental agency
(ANVUR) in charge of its realization performed an experiment on the concordance between
peer review and bibliometrics at an individual article level. The computed concordances were
at most weak for science, 技术, 工程, 和数学. The only exception was
the moderate concordance found for the area of economics and statistics. 在本文中, 这
disclosed raw data of the experiment are used to shed light on the anomalous results obtained
for economics and statistics. 尤其, the data permit us to document that the protocol of
the experiment adopted for economics and statistics was different from the one used in the
other areas. 的确, in economics and statistics the same group of scholars developed the
bibliometric ranking of journals for evaluating articles, managing peer reviews and forming
the consensus groups for deciding the final scores of articles after having received the
referee’s reports. This paper shows that the highest level of concordance in economics and
statistics was an artifact mainly due to the role played by consensus groups in boosting the
agreement between bibliometrics and peer review.

介绍

1.
During the research assessment exercise for the years 2004–2010, the Italian governmental
Agency for Evaluation of Universities and Research (ANVUR) performed an experiment on
the agreement between peer review and bibliometrics at an individual article level (对于
recent review of literature see Baccini, Barabesi, and De Nicolao [2020]). 本实验
involved all the fields of science, 技术, 工程, 和数学, plus economics
and statistics. The design of the experiment was apparently very linear: A stratified random
sample of about 10,000 journal articles was evaluated by applying bibliometric indicators
and by peer review; the degree of agreement between the scores obtained with the two sys-
tems of evaluation was then estimated by using weighted Cohen’s kappa. The overall results of
the experiment were published not only in the official reports (ANVUR, 2013) but also as jour-
nal articles authored by researchers affiliated to ANVUR or appointed to carry out the exper-
奉献 (Ancaiani, Anfossi et al., 2015). For the field of economics and statistics, the results of
the experiment and a big part of the official report were published by Research Policy as a

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
问
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
1
9
4
2
0
0
8
2
9
6
问
s
s
_
A
_
0
0
1
7
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Just an artifact?

research paper authored by some of the members of the panel appointed by ANVUR to carry
out the research assessment in the field (Bertocchi, Gambardella et al., 2015).1

简而言之, the results of the experiment were generally presented in the official reports as
successful by stating that there is a “more than adequate concordance” between bibliometrics
and peer review (ANVUR, 2013). This “fundamental agreement” (Ancaiani et al., 2015) 会
support the use of the so-called “dual system of evaluation” adopted in the research assess-
蒙特, consisting in the interchangeable use of bibliometrics and peer review for evaluating
journal articles. Economics and statistics presented the highest level of agreement between
bibliometrics and peer review. The results of the Italian experiment are cited as solid evidence
of the agreement between bibliometrics and peer review at an individual article level in sci-
entometric literature and in the discussion about reliability of research assessment (之中
others by Fassin [2021], Mittermaier [2020], Rousseau and Rousseau [2021], 和托马斯,
Nedeva et al. [2020]).

Doubts about the reliability of the whole Italian experiment, especially for the field of eco-
nomics and statistics, were raised by Baccini and De Nicolao (2016A, 2016乙, 2017A, 2017乙)
on the basis of the official published data. In a first paper (Baccini & De Nicolao, 2016A), 他们
highlighted an anomalous high level of agreement reached for economics and statistics with
respect to all the other research areas. They argued that it was due to substantial modifications
of the protocol of the experiment in this field with respect to the other areas. They described
these modifications on the basis of ANVUR official documents. Bertocchi, Gambardella et al.
(2016) denied the existence of the modifications. Baccini and De Nicolao replied by confirm-
ing all their claims, but they were limited by the impossibility of verifying some conjectures on
the basis of the raw data (Baccini & De Nicolao, 2016乙). 然后, Baccini and De Nicolao
(2017乙) documented statistical problems in the experiment and factual errors in the way in
which it was reported by Ancaiani et al. (2015). They replied by correcting some errors in their
paper and by denying the relevance of the statistical problems (Benedetto, Cicero et al., 2017).
All of these issues could have been easily resolved if ANVUR or the authors of the papers had
disclosed the raw data of the experiment 2.

在三月 2019, ANVUR decided to disclose the raw data of the experiment 3. This disclo-
sure has permitted Baccini et al. (2020) to reconsider in full the experiment by providing the
correct design-based setting for it. They showed that “for each research area of science, 技术-
科学, engineering and mathematics the degree of agreement between bibliometrics and peer
review is—at most—weak at an individual article level.” They confirmed also the anomalous
high value of agreement for the area of economics and statistics.

On the basis of the raw data now available, this paper aims to finally establish (A) 如果
protocol of the experiment adopted for economics and statistics was different from that
adopted in the other areas; (乙) if this difference was responsible for the anomalous agreement
in economics and statistics; (C) if the description of the experiment published in Bertocchi et al.

1 Both in the case of the overall results and of economics and statistics, no indications are available that permit
us to distinguish between the official positions of ANVUR and the views expressed by the authors of the
published articles.

2 The authors of this paper requested the data from the President of ANVUR (at that time Professor Stefano

Fantoni [mail sent on February 10th 2014]). They never received a reply.

3 The mail from one of the authors to Professor Paolo Miccoli, President of ANVUR, containing the request is
dated from March 12, 2019. The decision to disclose the data was communicated by mail dated March 26,
2019; access to the data was granted on April 9, 2019. The data can be downloaded from https://doi.org/10
.5281/zenodo.3727460.

定量科学研究

195

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
问
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
1
9
4
2
0
0
8
2
9
6
问
s
s
_
A
_
0
0
1
7
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Just an artifact?

(2015) is correct; 和 (d) if the claims about the experiment contained in Bertocchi et al.
(2016) are true of false. 在部分 2 of the paper a short description of the experiment is pro-
vided. 部分 3 illustrates the interventions of the so-called consensus groups for scoring the
articles of the experiment. 在部分 4 the effect on the agreement between peer review and
biblometrics of the different protocol adopted in Area 13 is estimated. 最后, 部分 5 迪斯-
cusses the results and the general lessons that can be drawn for research assessment and
research policy.

2. A SHORT DESCRIPTION OF THE PROTOCOL OF THE EXPERIMENT

The ANVUR experiment involved 10 research areas of science, 技术, 工程, 和
数学, plus economics and statistics. For each area, a panel managed the evaluation.
Each panel was composed of a number of scholars proportional to the size of the area. 为了
each area, ANVUR selected a random sample of journal articles. These articles were scored
both by bibliometrics and by peer review. There were four possible letter scores: A (Excellent),
乙 (好的), C (Acceptable), 和D (Limited).

For all the areas, except economics and statistics (区域 13), the bibliometric scores were
attributed according to an algorithm. It combined the number of citations of an article and a
bibliometric indicator of the impact of the journal in which it was published. If the two indi-
cators were coherent (例如, high number of citations and high impact factor) the articles
received a score. If the two were incoherent (例如, high number of citation and low impact
因素) the algorithm returned an inconclusive score “IR.” While in the research assessement
IR papers were scored by peer review, in the experiment, they were simply dropped from the
样本 (for a discussion of the statistical problems induced by this procedure see Baccini et al.
[2020]).

For Area 13 仅有的, the bibliometric algorithm consisted in scoring a paper on the basis of the
journal in which it was published. 为此, the Area 13 panel directly developed a ranking
of journals organized in four classes from A to D 4. 作为结果, differently from the other
地区, in Area 13 there were no articles with inconclusive bibliometric scores and no papers
were dropped from the experiment. 列 1 和 2 桌子 1 报告, stratified by research
地区, respectively the size of the experiment sample and the size of the subsample after the
removal of the IR papers.

As for the peer review, each article was assigned to two of the members of the area panel.
They formed a so-called Consensus Group (CG). 反过来, each of the two members of the CG
selected a referee who evaluated the article by assigning a numerical score according to a
predefined format. The format required that the referee evaluate a paper according to three
标准: 关联; originality/innovation; and internationalization. Each criterion received a
partial score; the sum of the three scores represented the final score assigned by a referee
to a paper. The two referee’s reports were indicated as “P1” and “P2.” In Areas 2, 3, 6, 7,
8A, 和 13, referees were required to score each criterion on a scale from 1 到 9 点; 因此
the total score assigned by a referee to a paper ranged from 3 到 27 点. In Areas 1, 4, 5, 和
9, referees were required to score each criterion on a scale from 0 到 3 点; hence the total
score assigned by a referee to a paper ranged from 0 到 9 点. CGs received the two

4 The methodology adopted for the classification is available in Bertocchi et al. (2015). An Italian adminis-
trative court conclusively invalidated the procedure and methodology adopted for the journal ranking,
because of “failure to carry out an investigation, misinterpretation of facts and failure to state reasons” (三-
bunale Amministrativo del Lazio, 30/10/2017, n. 10805/2007; https://tinyurl.com/y6sqwo4p).

定量科学研究

196

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
问
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
1
9
4
2
0
0
8
2
9
6
问
s
s
_
A
_
0
0
1
7
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Just an artifact?

Sample and subsample size, number of articles with a final score coincident with the scored agreed by two referees (P = P1 = P2),

桌子 1.
number of articles for which two referees indicated nonconcordant scores (P1 ≠ P2), number of articles for which the consensus groups
changed two concordant referees’ scores P ≠ P1 = P2, and total number and share of articles scored after an intervention of a consensus group

Scientific areas
区域 1: 数学
和信息学

区域 2: 物理

区域 3: 化学

区域 4: Earth Sciences

区域 5: 生物学

区域 6: 药品

区域 7: Agricultural and
兽医科学

Area 8a: Civil Engineering

区域 9: Industrial and

Information Engineering

区域 13: Economics and

统计数据

Areas 1–9

All areas

来源: Elaboration on ANVUR data.

样本
(1)
631

Subsample
(2)
438

P = P1 = P2
(3)
207

P1 ≠ P2
(4)
230

P ≠ P1 = P2
(5)
1

Scored
by CG
(6 = 4 + 5)
231

Scored by
CG (%)
(7 = 6:2)
52.7

1,412

1,212

927

458

1,310

1,984

532

225

1,130

590

8,609

9,199

778

377

1,058

1,603

425

198

919

590

7,008

7,598

513

339

149

433

607

134

378

255

696

438

228

623

994

290

112

540

326

699

439

228

625

996

291

113

541

335

2,845

3,100

4,151

4,477

4,163

4,498

57.7

56.4

60.6

59.1

62.1

68.5

57.1

58.9

56.8

59.4

59.2

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
问
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
1
9
4
2
0
0
8
2
9
6
问
s
s
_
A
_
0
0
1
7
2
p
d

referee’s reports P1 and P2 and “synthesized [他们] in a final evaluation” (磷) (see ANVUR
[2013, 附录B, p. 5]). For all the areas, except for Area 13, this final evaluation was based
on “algorithms specifically defined by each Area panel” (see ANVUR [2013, 附录B,
p. 5]). It appears that the final evaluation P was simply the average of the two numerical scores
P1 and P2 assigned by the two referees. This average was then converted to one of the four
final scores P, according to the two “conversion grids” reported in Table 25.

区域 13 also adopted a conversion grid (see note 24 of Bertocchi et al. [2015]), 但, 在
同时, it also adopted a more elaborate protocol for CGs’ decisions by permitting a more
flexible treatment of the referees’ reports. This protocol is described in the official report as
如下:

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

The opinion [原文如此] of the external referees was then summarized by the internal Consensus
团体: in case of disagreement between P1 and P2, the P index is not simply the average of
P1 and P2, but also reflects the opinion of two (and occasionally three) members of the
GEV13 (see ANVUR [2013, 区域 13 报告, 附录A, p. 52]).

5 The descriptions of the procedures adopted by each Area panel are in ANVUR (2013) (see Appendix A of

each area report).

定量科学研究

197

Just an artifact?

The conversion grids adopted for transforming the numerical scores to the final letter
桌子 2.
score P. Numerical scores are computed by averaging the scores P1 and P2 resulting from peer
审查. The ranges of numerical scores are indicated as intervals

磷
A

乙

Areas 2, 3, 6, 7, 8A, 和 13
Score range
[8–9]

[6–8]

[5–6]

[0–5]

Areas 1, 4, 5, 和 9
Score range
[23–27]

[18–23]

[15–18]

[3–15]

The point was stressed in more than one part of the official reports:

The Consensus Groups will give an overall evaluation of the research product by using the
informed peer review method, by considering the evaluation of the two external referees,
the available indicators for quality and relevance of the research product, and the Consen-
sus Group competences (ANVUR, 2013).

And again:

The consensus groups in some cases evaluated also the competences of the two referees,
and gave “more importance to the most expert referee in the research field” (see ANVUR
[2013, Area Report, p. 15; translation from Italian by the authors]).

According to Baccini and De Nicolao (2016A), the main difference between the protocol of
the experiment for Area 13 and for the other areas consisted properly in allowing the consen-
sus group to consider so many elements for the final decisions.

而且, the information available to the members of the CGs was different in Area 13
with respect to the other areas: (A) the members of the CGs in Area 13 knew that the journal
articles for which they had to arrange a peer review were those selected for the experiment.
的确, all the articles submitted to the research assessment and published in journals listed in
the ranking developed by the area panel received an automatic score. This was not the case in
the other areas, where panels had to arrange peer reviews not only for the articles of the exper-
iment but also for those submitted to the research assessment and classified as IR (inconclusive
等级) by the bibliometric algorithm. (乙) The CG members knew the final bibliometric score of
文章, while in the other areas, the CGs might know only the bibliometric indicators infor-
ming the bibliometric algorithm. The information about the bibliometric score of each article
might have been used by the CGs when they chose the referees and when they decided the
final peer review score of each article.

此外, there were also differences regarding the information available for the referees.
The referees of Area 13 were possibly aware that they were participating in the experiment, 为了
the same reason discussed above for the members of the panel. 的确, in Area 13, all journal
articles submitted to the research assessment were automatically scored according to the jour-
nal rank. So if a referee received a journal article for evaluation, it was obviously one of the
sample extracted for the experiment. In the other areas, as anticipated, referees received many
journal articles because they had an inconclusive bibliometric rating. 因此, it was impossible

定量科学研究

198

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
问
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
1
9
4
2
0
0
8
2
9
6
问
s
s
_
A
_
0
0
1
7
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Just an artifact?

for referees in the other areas to know if an article was part of the sample of the experiment.
Differently again from the other areas, the referees of Area 13 also knew the bibliometric clas-
sification of the articles:

The referees were provided with the panel journal classification list and the actual or
imputed values of IF, IF5 [5 years impact factor] and AIS [Article influence score] (Bertocchi
等人。, 2015).

By having access to the ANVUR raw data, it is possible to verify in detail how these mod-
ifications of the protocol impacted on the experiment conducted in Area 13. 尤其, 这是
possible to verify if and how the more active role of the consensus groups impacted on the
results of the experiment with respect to the other areas.

3. THE ROLE OF CONSENSUS GROUPS: HOW MANY PAPERS HAVE THEY EVALUATED?

The first question is how many papers required an intervention by the CGs. To answer this, 这
total number of papers can be partitioned into three nonoverlapping subsets. The three sets are
reported in Table 1, stratified by scientific areas. The three sets are composed of the following:

1. Papers for which two referees indicated a concordant score that was also confirmed as

the final score (桌子 1, 柱子 3: P = P1 = P2);

2. Papers for which two referees indicated discordant scores (桌子 1, 柱子 4: P1 ≠ P2);
3. Papers for which the final score was different from the one agreed by the two referees

(桌子 1, 柱子 5: P ≠ P1 = P2).

如前所述, when the two referees’ reports did not coincide, the final peer review score of an
article required an intervention by the CGs. Then the total number of articles for which the
final score was obtained after a CG intervention can be obtained by summing up columns 4
和 5 桌子 1: The sum is reported as column 6 桌子 1. The expression “Scored by CG”
used in this paper simply indicates that the final score was decided after a CG intervention.
This intervention might have consisted in confirming the average between P1 and P2, as cal-
culated by the algorithm; or in deciding the final letter score by modifying the scores indicated
by the referees.

In the whole experiment, the share of papers finally requiring a CG intervention was
59.2% 样本. In Area 13 this share was 56.8%, only a little lower than the average.
从这个角度来看, on the whole, in Area 13, CGs did not intervene more actively than
in the other areas. 尽管如此, 区域 13 shows the highest number of articles for which CGs
changed a concordant score of the two referees. In Area 13 CGs changed a concordant
score of the two referees for nine articles out of 590, 代表 1.52% of the total articles
in the area sample. In all the other areas, CGs changed just 16 articles out of 7,598, 代表-
senting 0.16% of the total experiment sample. This may be considered as a first clue to the
attitude of the Area 13 panel to intervene in scoring papers more actively than the panels of
the other areas.

桌子 1 finally shows that Baccini and De Nicolao (2016A) even underestimated the num-
ber of 326 papers scored after a CG intervention in Area 13. Bertocchi et al. (2016) had con-
tested this estimate by stating the following:

one could argue that at most 15 文件 (不是 326) were evaluated the panel itself.

定量科学研究

199

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
问
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
1
9
4
2
0
0
8
2
9
6
问
s
s
_
A
_
0
0
1
7
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Just an artifact?

How is it possible to have this very big misalignment on a basic fact? Bertocchi et al. limited

their attention to 15 articles for which they argued that the CGs

effectively graded the paper. This occurred when (我) the two reports were so different that
one referee assigned the minimum score (D) and the other the maximum score (A), 和 (二)
the CGs disagreed on the arithmetic average of the score (the default solution). (Bertocchi
等人。, 2016)

This very restrictive claim about the intervention of the CGs is at odds with official reports
and Bertocchi et al.’s own description of the experiment (Bertocchi et al., 2015). 桌子 1 最后
shows that it is also strictly falsified by the data, because in addition to the 15 articles for which
the referees were in maximum disagreement, as we have seen, CGs “directly graded” nine
other papers for which the two referees indicated a concordant score6.

It remains to clarify how invasive the interventions of the consensus groups were in defining
the final score P. The most invasive CG intervention consisted obviously in changing a score
agreed by two reviewers. But CGs might have adopted a less invasive strategy by assigning a
final P score without applying rigidly the “conversion grid” reported in Table 2. If ANVUR had
disclosed the numerical scores of the referees’ reports instead of P1 and P2 only, 这将是
possible to trace precisely the intervention of the CGs by comparing the score P with the aver-
age of the numerical scores attributed by two referees. 在每种情况下, the disclosed data permit
us to show that CGs, especially in Area 13, graded some articles outside the rules as defined in
the official reports.

的确, it is possible to roughly define for the final scores P lower and upper bounds within
which CG interventions respect the rules of the assessment by considering the conversion grid
adopted in each area. Lower bounds for CG decisions are computed as follows. The first step
consisted in calculating the minimum average associated with each possible combination of
P1 and P2. 考虑, 例如, that a first referee assigns a score P1 = A and a second
referee a score P2 = B. The minimum average is calculated under the hypothesis that both
reviewers assigned the minimum numerical score to the paper: The first referee judged the
paper as A by assigning a numerical score of 23; while Referee 2 judged the paper as B by
assigning the minimum numerical score of 18. 所以, the minimum possible average for a
paper judged A by one referee and B by the other is (23 + 18)/2 = 20.5, corresponding in the
conversion grid to a final score B. 因此, to respect the rules of the assessment, the final score
P of a paper receiving P1 = A and P2 = B should be A or B. Upper bounds are analogously
计算. The maximum average can be calculated by considering that both referees assign
the maximum scores for A and B, 分别 27 and a bit less than 23. Therefore the max-
imum possible average for a paper judged A by one referee and B by the other is (27 + 23)/2 =
25, corresponding in the conversion grid to the letter score A. This is the upper bound for the
CG decision. Table A1 in the Supplementary material reports the upper and lower bounds
computed for the two conversion grids adopted in different areas. Note that in the case of
P1 = A and P2 = C the final letter score P is necessarily B because the minimum possible
average is (23 + 15)/2 = 19 and the maximum possible average is (27 + 18)/2 = 22.5, and both
numerical scores correspond to the letter score B; analogously for P1 = A and P2 = D the upper

6 尤其: Four papers, concordantly scored B by two referees, were classified as A by the CGs; 二
papers respectively scored C and D by two concordant referees were finally classified as B by the CGs;
and three papers concordantly scored D by referees were classified C by the CGs.

定量科学研究

200

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
问
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
1
9
4
2
0
0
8
2
9
6
问
s
s
_
A
_
0
0
1
7
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Just an artifact?

桌子 3. Number of papers scored by consensus groups out of the bounds of the research
评估, as reported in the Table A1 of Supplementary material. The percentage is calculated
over the total number of papers requiring the intervention of the consensus groups (桌子 1, 柱子
标记: “Scored by CG”)

Final P score
区域 1: Mathematics and Informatics

区域 2: 物理

区域 3: 化学

区域 4: Earth Sciences

区域 5: 生物学

区域 6: 药品

区域 7: Agricultural and Veterinary Sciences

Area 8a: Civil Engineering

区域 9: Industrial and Information Engineering

区域 13: 经济学和统计

Areas 1–9

All areas

来源: Elaboration on ANVUR data.

A
0

乙
0

C
1

D
0

全部的
1

%
0.43

0.86

0.23

0.44

0.64

1.10

0.69

0.88

0.18

6.87

0.67

1.13

bound is P = B, because the maximum average score is (27 + 15)/2 = 21 corresponding to a
letter score B.

桌子 3 reports the number of papers scored by CGs out of the bounds of Table A1 (IE。, 这
number of papers scored not respecting the declared rules of the assessment). The consensus
groups of Areas 1–9 did not respect the bounds for 28 在......之外 4,163 文件 (0.67%). In Area 13
the consensus groups did not respect the bounds for 23 在......之外 335 文件 (6.87%). 专业-
ity of these papers received a final score P = A. This is a second clue indicating that in Area 13
consensus groups have a more active attitude in deciding the final letter score P, 尊重地
to the other areas.

更普遍, 数字 1 permits us to visually compare the interventions of CGs in decid-
ing the final score of a paper in the different areas of the experiment. The graph is organized in
40 facets representing 10 地区 (列) and the four final P-scores (行). Each panel rep-
resents the scores P1 (x-axis) 和P2 (y-axis) in a given area for a given final peer review score
磷. The size of each point indicates the proportion of articles finally scored P in the given area.
Blue points indicates articles scored by respecting the declared bounds of the assessment,
while red points indicates articles scored not respecting the bounds. Tables A5.1 and A5.2
of the Supplementary material report the data used to create Figure 1.

Consider the top left panel. In Area 1 (Mathematics and Informatics) most of the articles
with a final P-score A in the experiment were concordantly classified as A by the two reviewers
P1 and P2; a few of these articles were also scored A by one of the referees and B by the other.
No article scored less than B by one of the referee was finally scored A by the CG. 考虑
now the top right panel. In Area 13, there were many papers scored less than B by one of the
two referees that were finally classified as A by the CGs. It is apparent that CGs scored A some

定量科学研究

201

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
问
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
1
9
4
2
0
0
8
2
9
6
问
s
s
_
A
_
0
0
1
7
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Just an artifact?

数字 1. Visual comparison of the interventions of CGs in deciding the final score of a paper in the different areas of the experiment. 这
graph is faceted according to 10 disciplinary areas (列) and the four final P scores (A, 乙, C, D). Each panel represents the scores attributed
by P1 (x-axis) 和P2 (y-axis) in a given area for a given final peer review score P. The size of each point indicates the proportion of articles
finally scored P in the given area. Blue points indicate articles with a final score respecting the bounds of the assessment as reported in Table A1;
red points indicate articles for which CGs did not respect the bounds.

papers for which concordantly the two referees had indicated a score B and also some papers
for which no referee indicated a score A. From visual inspection of the Figure 1 it is apparent
that the CGs of Area13 behave differently from those in the other areas, by adopting greater
flexibility than in the other areas in the conversion of the referees’ scores P1 and P2 to the final
score P.

总共, the consensus groups of Area 13 managed a share of papers similar to all the other
地区. The data documented that they had a more active attitude both in modifying the scores
agreed by the referees and in scoring the papers outside the bounds defined in the rules of the
研究评估. 而且, they tended to interpret more flexibly than in the other areas
the rules for converting the referees’ reports to the final P score.

4. HOW MUCH OF THE AGREEMENT BETWEEN PEER REVIEW AND BIBLIOMETRICS WAS
INDUCED BY CG DECISIONS?

On the basis of ANVUR data, it is now possible to shed light also on the central question about
实验: How much of the agreement between peer review and bibliometrics
depended on the decisions of the CGs (IE。, how much of the agreement was induced by
the scores defined by the members of the panel). From Table A1, it is evident that even while
respecting the bounds, CGs had a good margin of flexibility in deciding the final score P. 为了
实例, after having received two discordant peer review reports indicating P1 = B and P2 =
D, a CG can decide a final P score B or C or D, perhaps in accordance with the bibliometric
分数.

To measure the role of CGs’ decisions in determining the agreement between peer review

and bibliometrics, it is possible to build two indicators.

定量科学研究

202

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
问
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
1
9
4
2
0
0
8
2
9
6
问
s
s
_
A
_
0
0
1
7
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Just an artifact?

桌子 4.
total papers scored by consensus groups, stratified according to final P scores

Percentage of papers scored by consensus groups in agreement with bibliometrics over

区域
区域 1: Mathematics and Informatics

区域 2: 物理

区域 3: 化学

区域 4: Earth Sciences

区域 5: 生物学

区域 6: 药品

区域 7: Agricultural and Veterinary Sciences

Area 8a: Civil Engineering

区域 9: Industrial and Information Engineering

区域 13: 经济学和统计

Areas 1–9

All areas

来源: Elaboration on ANVUR data.

A
73.3

83.1

75.6

71.4

70.6

69.2

86.4

92.3

89.7

81.0

78.5

78.9

乙
14.7

19.6

22.6

18.6

19.2

27.2

9.2

4.6

13.4

29.5

18.8

19.3

C
20.0

10.3

5.3

8.3

8.1

4.6

0.0

4.8

11.3

29.4

7.5

9.5

D
21.3

26.3

25.6

41.3

38.4

44.8

62.8

28.6

6.6

65.5

36.7

38.7

全部的
20.8

24.6

24.1

23.2

24.8

27.8

20.6

17.7

17.6

45.4

23.7

25.3

The first indicator, reported in Table 4, is the percentage of CG decisions that produced a
final score in agreement with the bibliometric score. It is calculated as the ratio between the
number of papers scored by CGs in agreement with bibliometrics and the total number of
papers scored by CGs. In Area13, GCs attributed a score in agreement with bibliometrics
为了 45.4% of the papers scored by CGs (152 papers out of 335). In the other nine areas, 这
share ranges from a minimum of 17.6% in Area 9 to a maximum of 27.9% in Area 6. 在里面
other nine areas taken together, CGs attributed a score in agreement with bibliometrics for
仅有的 23.7% of the papers scored by CGs (986 concordant papers out of 4,163). 协议
induced by CG decisions was anomalous with respect to the other areas mainly for the subset
of articles scored less than A. This indicates that in Area 13 consensus groups’ interventions
boosted the agreement between peer review and bibliometrics for the set of papers receiving a
final score less than A. 尤其, in Area 13 the share of concordant C papers was almost
three times as much as in the other areas; and the share of concordant D papers was a bit less
than double with respect to the other areas.

The second indicator, reported in Table 5, is the share of papers for which the agreement
between peer review and bibliometrics was due to the decisions of the CG. It is computed as
the ratio between between the number of papers scored by CGs in agreement with biblio-
metrics and the total number of papers for which there is agreement between peer review
and bibliometrics. In Area 13 有 311 papers for which peer review and bibliometrics
were in agreement; 为了 152 of these papers (IE。, for a share of 48.9%), the final peer review
score was decided by the CGs. In the other nine areas the share was of 39.5% 仅有的 (986
papers out of 2,857). The anomaly of Area 13 was concentrated in the group of papers scored
A: in Area 13 CGs directly scored more than half (51 在......之外 98, 那是 52%) of the concordant
papers against one fifth (241 在......之外 1,160, 那是 20.8%) of the other areas. This second indi-
cator shows that more than half of the papers scored as excellent by both bibliometrics and

定量科学研究

203

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
问
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
1
9
4
2
0
0
8
2
9
6
问
s
s
_
A
_
0
0
1
7
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Just an artifact?

Percentage of papers scored by consensus groups in agreement with bibliometrics, 超过
桌子 5.
total number of papers with concordant peer review and bibliometrics, stratified according to final P
分数

区域
区域 1: Mathematics and Informatics

A
9.6

乙
61.3

C
100.0

D
38.5

全部的
26.8

区域 2: 物理

区域 3: 化学

23.4

57.1

85.0

54.1

19.1

56.6

83.3

47.6

区域 4: Earth Sciences

16.1

42.1

85.7

53.1

区域 5: 生物学

区域 6: 药品

16.8

49.5

80.0

49.6

28.8

64.9

93.3

49.2

区域 7: Agricultural and Veterinary Sciences

34.5

50.0

0.0

61.4

Area 8a: Civil Engineering

27.9

60.0

100.0

44.4

区域 9: Industrial and Information Engineering

17.2

59.7

80.0

26.7

区域 13: 经济学和统计

52.0

55.4

82.1

32.2

Areas 1–9

All areas

来源: Elaboration on ANVUR data.

20.8

56.9

85.9

49.7

23.2

56.8

84.7

46.7

38.9

35.9

42.4

38.9

48.9

47.2

34.5

31.0

48.9

39.5

40.5

peer review received the final P score after an intervention of the member of the Area 13
控制板.

5. DISCUSSION AND CONCLUSION

The results of the experiment performed by ANVUR during the Italian research assessment
exercise VQR 2004–2010 have a central role in the ongoing discussion about the agreement
between peer review and bibliometrics. 的确, it is probably the most extensive experiment
conducted so far for verifying the concordance between peer review and bibliometrics. 它是
results were presented as indicating a “fundamental agreement” between peer review and bib-
liometrics in science, 技术, 工程, 数学, and especially in economics and
统计数据. Despite the early critics of the reliability of the whole experiment, the results are
currently cited (Fassin, 2021; Mittermaier, 2020; Rousseau & Rousseau, 2021; 托马斯
等人。, 2020) as indicating solid evidence of good agreement between peer review and biblio-
metrics at an individual article level. 实际上, when the results of the experiments were rep-
licated in the correct inferential setting, they showed that for science, 技术, 工程,
and mathematics the degree of agreement between bibliometrics and peer review was at most
weak at an individual article level (Baccini et al., 2020). The only exception was economics
and statistics, where the agreement was moderate.

This work aimed to finally test whether this anomalous result for economics and statistics
was due to a substantial modification of the protocol of the experiment with respect to that
adopted in the other areas, as suggested by Baccini and De Nicolao (2016A, 2016乙, 2017A,
2017乙).

The data eventually disclosed by ANVUR reveal that the official report published by
ANVUR, the text collated from it and published by Research Policy, as well as the “final”

定量科学研究

204

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
问
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
1
9
4
2
0
0
8
2
9
6
问
s
s
_
A
_
0
0
1
7
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Just an artifact?

description provided in Bertocchi et al. (2016), contain partial or even incorrect descriptions of
the protocol of the experiment conducted in Area 13.

尤其, in Area 13, the CGs decided the final score of 335 papers out of a total of 590.
这些 335 包括 326 papers for which the two referees were in disagreement, and nine
papers that CGs scored by modifying the concordant score suggested by two reviewers. 那里-
前面, the raw data directly and conclusively falsify the statement by Bertocchi et al. (2016) 那
in Area 13 “at most 15 papers” were evaluated by CGs.

The raw data also show that for 6.87% of the papers, the Area 13 CGs did not respect the
upper and lower bounds for scoring articles stated in the official reports and in Bertocchi et al.
(2015, 2016). In the other nine areas, the share of scores not respecting the declared bounds
was just 0.67%.

而且, the ANVUR data show that CGs played a major role in boosting the agreement
reached in the experiment of Area 13. In Area 13, 45.4% of the scores given by the CGs
agreed with bibliometrics, 反对 23.7% in the other areas. 尤其, among the papers
with a concordant A score between peer review and bibliometrics, 尽可能多 52% 曾经是
scored by the CGs against 20.8% for the other areas.

总共, the disclosed raw data of the experiment document that the moderate agreement
between bibliometrics and peer review in economics and statistics was an anomalous result
produced by the active intervention of the members of the consensus groups in charge of syn-
thesizing peer review reports (ANVUR, 2017).

This conclusion is corroborated by the results of a second experiment, 由
ANVUR during the national research assessment VQR 2011–2014. In this second experiment,
the protocol “excluded the intervention” of the consensus groups in the definition of the final
peer review P scores, which were instead computed by an algorithm in all the areas (看
ANVUR [2017, 附录B, p. 8 笔记 4]). The replication of the results of this second exper-
iment in the correct inferential setting showed that “when an identical protocol was adopted
for all the areas, the agreement for Area 13 was only slightly larger, but still comparable with
the other areas” (Baccini et al., 2020). 进一步来说, in the second experiment, the degree
of agreement between bibliometrics and peer review is generally even lower than in the first
一, by indicating that the agreement between peer review and bibliometrics at the level of
individual articles is at most weak in all the considered research areas (Baccini et al., 2020).

简而言之, in Area 13 a group of scholars was called to develop a bibliometric ranking of
journals for attributing a bibliometric score to articles published in these journals. This same
group of scholars was called also to manage peer reviews for the papers published in these
期刊. 最后, they formed the consensus groups for deciding the final peer review scores of
articles after having received the referee reports. 为此, the consensus groups had not
only flexible margins for their decisions but also the freedom to deviate from the rules of
the experiment fixing the bounds for scoring articles. Given all these premises, it is hardly sur-
prising that in economics and statistics, the agreement between bibliometric and peer review
reached a level not recorded in any other area considered in the Italian experiment.

实际上, the decisions of the panel for economics and statistics simply confirmed and
strengthened the bibliometric assessment methods it had developed. Recall that for economics
and statistics, only the bibliometric score of an article was defined on the basis of the journal
ranking developed by the panel. 作为结果, a high level of agreement would indicate
that the ranking of journals developed by the panel was a good proxy of the quality of articles
as revealed by reviewers in their reports. 尤其, the choices of the consensus groups

定量科学研究

205

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
问
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
1
9
4
2
0
0
8
2
9
6
问
s
s
_
A
_
0
0
1
7
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Just an artifact?

delimited the set of excellent documents. If the experiment shows that the articles rated as
excellent by peer review are those published in journals rated as excellent by the panel, A
clean and simple criterion of excellence could finally be established: Excellent articles are
those and only those hosted by a restricted set of supposedly excellent journals.

On a more general level, the evidence of good agreement would justify the use of journal
ranking instead of peer review for evaluating papers in economics and statistics. 的确, 这
results of the experiment were used to produce policy advice about research evaluation for an
international audience (see for example Bertocchi, Gambardella et al. [2014]). The good
agreement and the consequent policy advice were especially welcome in economics, a schol-
arly environment particularly fascinated by journal rankings (赫克曼 & Moktan, 2020). 这是
well known that the use of journal rankings tends to reinforce existing hierarchies within
学科 (Corsi, D’Ippoliti, & Zacchia, 2019; 赫克曼 & Moktan, 2020; Stockhammer,
Dammerer, & 卡普尔, 2021) 和, 同时, reduce pluralism of research. A growing
文学 (李, Pham, & 古, 2013; Corsi et al., 2019; D’Ippoliti, 2021) suggests that the reduc-
tion of pluralism in economics cannot be considered as just an unintended consequence of
研究评估 (Rousseau & Rousseau, 2021).

总之, in light of the raw data disclosed by ANVUR, the current interpretation of the
Italian experiment on peer review and bibliometrics agreement should be revised and be rea-
ligned with the available evidence. The first Italian experiment showed that peer review and
bibliometrics have less than weak agreement at an individual article level for the fields of sci-
恩斯, 技术, 工程, 和数学 (Baccini et al., 2020). 而且, 更高
level of agreement in economics and statistics appears to be simply an artifact of the experi-
ment protocol adopted by the group in charge of evaluating economics and statistics. 因此,
the results of the Italian experiment cannot be considered as solid evidence of a special agree-
ment between peer review and journal ranking, even for the fields of economics and statistics.

致谢

Grants from the Institute for New Economic Thinking are gratefully acknowledged. 由于
the reviewers for their careful comments.

作者贡献
Alberto Baccini: 概念化, 方法, Formal Analysis, 写作 - 原始草稿,
Writing—Review & 编辑. Giuseppe De Nicolao: 概念化, 方法, 正式的
分析, 写作 - 原始草稿, Writing—Review & 编辑.

竞争利益

作者没有竞争利益.

资金信息

Alberto Baccini is the recipient of grants by the Institute For New Economic Thinking Grant
Institute For New Economic Thinking Grant ID INO17-00015 and INO19-00023.

数据可用性

The raw data used in the article can be downloaded from https://doi.org/10.5281/zenodo
.3727460.

定量科学研究

206

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
问
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
1
9
4
2
0
0
8
2
9
6
问
s
s
_
A
_
0
0
1
7
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Just an artifact?

参考

Ancaiani, A。, Anfossi, A. F。, 芭芭拉, A。, Benedetto, S。, Blasi, B., ……
Sileoni, S. (2015). Evaluating scientic research in Italy: The 2004–10
research evaluation exercise. Research Evaluation, 24(3), 242–255.
https://doi.org/10.1093/reseval/rvv008

ANVUR. (2013). Rapporto finale. valutazione della qualità della

ricerca 2004–2010 (科技. Rep.).

ANVUR. (2017). Valutazione della qualità della ricerca 2011–2014.

rapporto finale (科技. Rep.).

Baccini, A。, Barabesi, L。, & De Nicolao, G. (2020). On the agree-
ment between bibliometrics and peer review: 来自
Italian research assessment exercises. PLOS一个, 15(11),
e0242520. https://doi.org/10.1371/journal.pone.0242520,
考研: 33206715

Baccini, A。, & De Nicolao, G. (2016A). Do they agree? Bibliometric
evaluation versus informed peer review in the Italian research
assessment exercise. 科学计量学, 108(3), 1651–1671.
https://doi.org/10.1007/s11192-016-1929-y

Baccini, A。, & De Nicolao, G. (2016乙). Reply to the comment of
Bertocchi et al. 科学计量学, 108(3), 1675–1684. https://土井
.org/10.1007/s11192-016-2055-6

Baccini, A。, & De Nicolao, G. (2017A). Errors and secret data in the
Italian research assessment exercise. A comment to a reply. RT. A
Journal on Research Policy and Evaluation, 5(1). https://doi.org
/10.13130/2282-5398/8872

Baccini, A。, & De Nicolao, G. (2017乙). A letter on Ancaiani et al.
‘Evaluating scientific research in Italy: The 2004–10 research
evaluation exercise’. Research Evaluation, 26(4), 353–357.
https://doi.org/10.1093/reseval/rvx013

Benedetto, S。, 西塞罗, T。, Malgarini, M。, & Nappi, C. (2017). Reply
to the letter on Ancaiani et al. ‘Evaluating scientific research in
意大利: The 2004–10 research evaluation exercise’. Research Eval-
uation, 26(4), 358–360. https://doi.org/10.1093/reseval/rvx017
Bertocchi, G。, Gambardella, A。, Jappelli, T。, Nappi, C. A。, & Peracchi,
F. (2014). Assessing Italian research quality: A comparison between
bibliometric evaluation and informed peer review. www.voxeu.org.
Bertocchi, G。, Gambardella, A。, Jappelli, T。, Nappi, C. A。, & Peracchi,
F. (2015). Bibliometric evaluation vs. informed peer review: 埃维-
dence from Italy. 研究政策, 44(2), 451–466. https://doi.org
/10.1016/j.respol.2014.08.004

Bertocchi, G。, Gambardella, A。, Jappelli, T。, Nappi, C. A。, & Peracchi,
F. (2016). Comment to: Do they agree? Bibliometric evaluation
versus in formed peer review in the Italian research assessment
exercise. 科学计量学, 108, 349–353. https://doi.org/10.1007
/s11192-016-1965-7

Corsi, M。, D’Ippoliti, C。, & Zacchia, G. (2019). Diversity of back-
grounds and ideas: The case of research evaluation in econom-
集成电路. 研究政策, 48(9), 103820. https://doi.org/10.1016/j
.respol.2019.103820

D’Ippoliti, C. (2021). Many-citedness: Citations measure more than
just scientific quality. Journal of Economic Surveys, 35(5), 1271–1301.
https://doi.org/10.1111/joes.12416

Fassin, 是. (2021). Does the Financial Times FT50 journal list
select the best management and economics journals? 科学-
指标, 126(7), 5911–5943. https://doi.org/10.1007/s11192
-021-03988-X

赫克曼, J. J。, & Moktan, S. (2020). Publishing and promotion
在经济学中: The tyranny of
ECO杂志-
nomic Literature, 58(2), 419–470. https://doi.org/10.1257/jel
.20191574

the top five.

李, F. S。, Pham, X。, & 古, G. (2013). The UK Research Assessment
Exercise and the narrowing of UK economics. Cambridge Journal
经济学, 37(4), 693–717. https://doi.org/10.1093/cje
/bet031

Mittermaier, 乙. (2020). Peer review and bibliometrics. 在R中. 球
(埃德。), Handbook bibliometrics (PP. 77–90). Berlin/Boston: 的
Gruyter Saur. https://doi.org/10.1515/9783110646610-009

Rousseau, S。, & Rousseau, 右. (2021). Bibliometric techniques and
their use in business and economics research. ECO杂志-
提名调查, 35(5), 1428–1451. https://doi.org/10.1111/joes
.12415

Stockhammer, E., Dammerer, 问：, & 卡普尔, S. (2021). The Research
Excellence Framework 2014, journal ratings and the marginalisa-
tion of heterodox economics. Cambridge Journal of Economics,
45(2), 243–269. https://doi.org/10.1093/cje/beaa054

托马斯, D. A。, Nedeva, M。, Tirado, 中号. M。, & 雅各布, 中号. (2020).
Changing research on research evaluation: A critical literature
review to revisit the agenda. Research Evaluation, 29(3), 275–288.
https://doi.org/10.1093/reseval/rvaa00818

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
问
s
s
/
A
r
t
我
C
e
–
p
d

我

F
/

3
1
1
9
4
2
0
0
8
2
9
6
问
s
s
_
A
_
0
0
1
7
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

定量科学研究

207 研究文章图像

下载pdf