Book Review - Recherche en IA spécialisée au MIT

Book Review

Statistical Signiﬁcance Testing for Natural Language Processing

Rotem Dror, Lotem Peled-Cohen, Segev Shlomov, and Roi Reichart
(Technion Israel Institute of Technology)

Morgan & Claypool (Synthesis Lectures on Human Language Technologies, édité par
Graeme Hirst, volume 45), 2020, xx+98 pp; paperback, ISBN 978-1-68173-795-9, $49.95; ebook, ISBN 978-1-68173-796-6, $39.96; hardcover, $69.95;
est ce que je:10.2200/S00994ED1V01Y202002HLT045

Reviewed by
Edwin D. Simpson
University of Bristol

Like any other science, research in natural language processing (NLP) depends on the
ability to draw correct conclusions from experiments. A key tool for this is statistical sig-
niﬁcance testing: We use it to judge whether a result provides meaningful, generalizable
ﬁndings or should be taken with a pinch of salt. When comparing new methods against
others, performance metrics often differ by only small amounts, so researchers turn to
signiﬁcance tests to show that improved models are genuinely better. Malheureusement,
this reasoning often fails because we choose inappropriate signiﬁcance tests or carry
them out incorrectly, making their outcomes meaningless. Or, the test we use may
fail to indicate a signiﬁcant result when a more appropriate test would ﬁnd one. NLP
researchers must avoid these pitfalls to ensure that their evaluations are sound and
ultimately avoid wasting time and money through incorrect conclusions.

This book guides NLP researchers through the whole process of signiﬁcance testing,
making it easy to select the right kind of test by matching canonical NLP tasks to speciﬁc
signiﬁcance testing procedures. As well as being a handbook for researchers, the book
provides theoretical background on signiﬁcance testing, includes new methods that
solve problems with signiﬁcance tests in the world of deep learning and multidataset
benchmarks, and describes the open research problems of signiﬁcance testing for NLP.
The book focuses on the task of comparing one algorithm with another. At the
core of this is the p-value, the probability that a difference at least as extreme as the
one we observed could occur by chance. If the p-value falls below a predetermined
threshold, the result is declared signiﬁcant. Leaving aside the fundamental limitation
of turning the validity of results into a binary question with an arbitrary threshold, à
be a valid statistical signiﬁcance test, the p-value must be computed in the right way.
The book describes the two crucial properties of an appropriate signiﬁcance test: Le
test must be both valid and powerful. Validity refers to the avoidance of type 1 errors, dans
which the result is incorrectly declared signiﬁcant. Common mistakes that lead to type 1
errors include deploying tests that make incorrect assumptions, such as independence
between data points. The power of a test refers to its ability to detect a signiﬁcant result
and therefore to avoid type 2 errors. Ici, knowledge of the data and experiment must
be used to choose a test that makes the correct assumptions. There is a trade-off between
validity and power, but for the most common NLP tasks (language modeling, séquence
labeling, translation, etc.), there are clear choices of tests that provide a good balance.

https://doi.org/10.1162/coli r 00388

© 2020 Association for Computational Linguistics
Published under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
(CC BY-NC-ND 4.0) Licence

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
6
4
9
0
5
1
8
8
8
2
6
8
/
c
o

je
je

_
r
_
0
0
3
8
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 46, Nombre 4

Beginning with a detailed background on signiﬁcance testing, the book then shows
the reader how to carry out tests for speciﬁc NLP tasks. There is a mix of styles, avec
the ﬁrst four chapters providing reference material that will be extremely useful to
both new and experienced researchers. Ici, it is easy to ﬁnd the material related to a
given NLP task. The next two chapters discuss more recent research into the application
of signiﬁcance tests to deep neural networks and for testing across multiple datasets.
Alongside open research questions, these later chapters provide clear guidelines on
how to apply the proposed methods. It is this mix of background material and reference
guidelines that I believe makes this book so compelling and nicely self-contained.

The introduction in Chapter 1 motivates the need for a comprehensive textbook
and outlines challenges that the later chapters address more deeply. The theoretical
background material begins in Chapter 2, which introduces core concepts, y compris
hypothesis testing, type 1 and type 2 errors, validity and power, and p-values. Le
reader does not need to have any prior knowledge of statistical signiﬁcance tests to
follow this part. Cependant, experienced readers could still beneﬁt from reading this
chapter, as concepts such as p-values are widely misunderstood and misused (Amrhein,
Greenland, and McShane 2019).

The signiﬁcance tests themselves are introduced in Chapter 3, categorized into
parametric and nonparametric tests. The chapter begins with the intuitively simple
paired z-test, then builds up to more commonly-applied techniques, showing the con-
nections and assumptions that each test makes. Step-by-step algorithms help the reader
to implement each test. Although the chapter does cite uses of tests in NLP research, le
main purpose is to present the theory behind each test and point out their differences.

Chapter 4 provides perhaps the most handy part of the book for reference: a cor-
respondence between common NLP tasks and statistical tests. Each task is discussed
in terms of the evaluation metrics used, then a decision tree is introduced to guide the
reader toward a choice between a parametric test, bootstrap or randomization test, ou
sampling-free nonparametric test. Section 4.3 then links each NLP evaluation measure
to a speciﬁc signiﬁcance test, presenting a large table that helps readers identify which
test they need for a speciﬁc task. Particular considerations for each task are also pointed
out to provide more detail about the appropriate options. The ﬁnal part of this chapter
describes the issue of p-hacking, in which dataset sizes are increased until a signiﬁ-
cance threshold is reached without consideration for biases in the data (discussed, pour
example, in Hofmann [2015]). The chapter proposes a simple solution to ensure robust
signiﬁcance testing with large datasets.

Where Chapter 4 presents well-established methods, Chapter 5 introduces the
current research question of how best to apply statistical signiﬁcance testing to deep
learning. Non-convex loss functions, stochastic optimization, random initialization, et
a multitude of hyperparameters limit the conclusions we can draw from a single test
run of a deep neural network (DNN). This chapter, which is based on the authors’ ACL
papier (Dror, Shlomov, and Reichart 2019), explains how the comparison process can
be overhauled to provide more meaningful evaluations. Beginning by explaining the
difﬁculties of evaluating DNNs, the chapter then introduces criteria for a comparison
framework, then discusses the limitations of current methods. Reimers and Gurevych
(2018) have previously tackled this problem, but their approach has limited power and
does not provide a conﬁdence score. Other works, such as Clark et al. (2011), compare
DNNs using a collection of statistics, such as the mean or standard deviation of perfor-
mance across runs. This book shows how such an approach violates the assumptions of
the signiﬁcance tests. The authors propose almost stochastic dominance as the basis for a

906

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
6
4
9
0
5
1
8
8
8
2
6
8
/
c
o

je
je

_
r
_
0
0
3
8
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Book Reviews

better alternative. The chapter explains how to use the proposed method, evaluates it in
an empirical case study, and ﬁnally analyzes the errors made by each testing approach.
Large NLP models are often tested across a range of datasets, which presents
another problem for standard signiﬁcance testing. Chapter 6 discusses the challenges
of assessing two questions: (1) On how many datasets does algorithm A outperform
algorithm B? (2) On which datasets does A outperform B? Applying standard sig-
niﬁcance tests individually to each dataset and counting the number of signiﬁcant
results is likely to overestimate the total number of signiﬁcant results, as this chapter
explains. The authors then present a new framework for replicability analysis, based on
partial conjunction testing, and discuss two variants (Bonferroni and Fisher) for when the
datasets are independent or dependent. They introduce a method based on Benjamini
and Heller (2008) to count the number of datasets where one method outperforms
another, then show how to use the Holm procedure (Holm 1979) to identify which
datasets these are. Chapter 6 provides a lot of detailed background on the proposed
replicability analysis framework, and the later sections again link the process to speciﬁc
NLP case studies, and step-by-step summaries help the reader to apply the methodol-
ogy. Extensive empirical results illustrate the very substantial differences in outcomes
between the proposed approach and standard procedures.

The ﬁnal two chapters present open problems and conclude, showing that the topic
has many interesting research questions of its own, such as problems when performing
cross-validation, and the limited statistical power of replicability analysis.

Dans l'ensemble, I highly recommend this book to a wide range of NLP researchers, depuis
new students to seasoned experts who wish to ensure that they compare methods
effectively. The book is excellent as both an introduction to the topic of signiﬁcance
testing and as a reference to use when evaluating your results. For anyone with further
interest in the topic, it also points the way to future work. If one could level any criticism
at this book at all, it is that it does not deeply discuss the basic ﬂaws of signiﬁcance
testing or what the alternatives might be. For now, though, signiﬁcance testing is an
integral part of NLP research and this book provides a great resource for researchers
who wish to perform it correctly and painlessly.

Les références
Amrhein, Valentin, Sander Greenland, et
Blake McShane. 2019. Scientists rise up
against statistical signiﬁcance. Mar;
567(7748):305–307. EST CE QUE JE: https://est ce que je
.org/10.1038/d41586-019-00857-9,
PMID: 30894741

Benjamini, Yoav and Ruth Heller. 2008.
Screening for partial conjunction
hypotheses. Biometrics, 64(4):1215–1222.
EST CE QUE JE: https://doi.org/10.1111/j.1541
-0420.2007.00984.X, PMID: 18261164
Clark, Jonathan H., Chris Dyer, Alon Lavie,

and Noah A. Forgeron. 2011. Better
hypothesis testing for statistical machine
translation: Controlling for optimizer
instability. In Proceedings of the 49th Annual
Meeting of the Association for Computational
Linguistics: Human Language Technologies,
pages 176–181, Portland, OR.

Dror, Rotem, Segev Shlomov, and Roi

Reichart. 2019. Deep dominance—how to

properly compare deep neural models.
In Proceedings of the 57th Annual Meeting
of the Association for Computational
Linguistics, pages 2773–2785, Florence.
EST CE QUE JE: https://doi.org/10.18653
/v1/P19-1266

Hofmann, Marko A. 2015. Searching for

effects in big data: Why p-values are not
advised and what to use instead. Dans 2015
Winter Simulation Conference (WSC),
pages 725–736, IEEE.

Holm, Sture. 1979. A simple sequentially

rejective multiple test procedure.
Scandinavian Journal of Statistics, 6(2):65–70.
EST CE QUE JE: https://doi.org/10.1109/WSC
.2015.7408210, PMID: 24482542

Reimers, Nils and Iryna Gurevych. 2018.

Why comparing single performance scores
does not allow to draw conclusions about
machine learning approaches. arXiv
preprint arXiv:1803.09578.

907

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
6
4
9
0
5
1
8
8
8
2
6
8
/
c
o

je
je

_
r
_
0
0
3
8
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Computational Linguistics

Volume 46, Nombre 4

Edwin D. Simpson is a Lecturer in the Department of Computer Science, University of Bristol, ROYAUME-UNI.
His research focuses on natural
dans
applying interactive machine learning and Bayesian techniques to NLP problems such as argu-
mentation, crowdsourced annotation, and text ranking. His e-mail address is edwin.simpson
@bristol.ac.uk.

language processing (NLP), with particular interest

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi
/
c
o

je
je
/

un
r
t
je
c
e
–
p
d

F
/

4
6
4
9
0
5
1
8
8
8
2
6
8
/
c
o

je
je

_
r
_
0
0
3
8
8
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

908
Télécharger le PDF