FORECASTING CONDITIONAL PROBABILITIES OF BINARY OUTCOMES
UNDER MISSPECIFICATION
Graham Elliott, Dalia Ghanem, and Fabian Krüger*
Abstract—We consider constructing probability forecasts from a parametric
binary choice model under a large family of loss functions (“scoring rules”).
Scoring rules are weighted averages over the utilities that heterogeneous
decision makers derive from a publicly announced forecast (Schervish,
1989). Using analytical and numerical examples, we illustrate how different
scoring rules yield asymptotically identical results if the model is correctly
specified. Under misspecification, the choice of scoring rule may be incon-
sequential under restrictive symmetry conditions on the data-generating
processi. If these conditions are violated, typically the choice of a scoring
rule favors some decision makers over others.
IO.
introduzione
C ONSIDER the problem of forecasting an as yet unob-
served outcome represented by the random variable Y ,
which takes on values {0, 1} with a vector of observables
X. The conditional probability that Y = 1 conditional on
X = x, denoted by p[X], is equivalent to the conditional
mean and distributional forecast for the binary outcome Y .
By contrast, a point forecast in this situation equals either 0
O 1. Point forecasting thus corresponds to choosing a binary
action, and it is natural if the forecaster and the decision
maker are one entity (Elliott & Lieli, 2013; Lieli & White,
2010). In situations where the forecaster and the decision
maker are separate entities, probability forecasts are often
provided since they allow decision makers to construct their
own point forecasts using their respective loss functions.
Examples of such “public forecasting” scenarios include
recession probabilities (per esempio., forecasts by Hamilton & Chinn,
2014), sovereign default probabilities (per esempio., Deutsche Bank,
2014), or probabilities of binary weather outcomes, ad esempio
rain (per esempio., Mass et al., 2009).
It is common in practice to estimate the conditional prob-
ability forecast using a parametric model, such as logit or
probit. To estimate this model, the forecaster must choose
a loss function. If the model is correctly specified, the con-
sistency and efficiency of the maximum likelihood estimator
(MLE) justify its choice as an estimation strategy. In practice,
Received for publication February 10, 2014. Revision accepted for
publication June 22, 2015. Editor: Mark W. Watson.
* Elliott: University of California, San Diego; Ghanem: University of Cal-
ifornia, Davis, and Giannini Foundation; Krüger: Heidelberg Institute for
Theoretical Studies (HITS).
We thank the editor and two anonymous referees, as well as seminar and
conference participants at the University of Konstanz (Gennaio 2013), Hei-
delberg University (May 2013), and IAAE (London, Giugno 2014) for helpful
comments and suggestions. All numerical computations in this paper were
done using the R programming language (R Core Team, 2015), whereby the
ggplot2 package (Wickham, 2009) was used for some of the graphical illus-
trations. The third author thanks UCSD for its hospitality during a research
visit, as well as the Klaus Tschira Foundation for infrastructural support
at the Heidelberg Institute for Theoretical Studies (HITS). He gratefully
acknowledges funding from the Deutsche Forschungsgemeinschaft (grant
PO 375/13-1) and the European Union Seventh Framework Programme
(grant agreement no. 290976).
A supplemental appendix is available online at http://www.mitpress
journals.org/doi/suppl/10.1162/REST_a_00564.
the model is very likely to be misspecified, and the choice
of loss function will typically matter for the forecast, even
asymptotically. Loss functions for distributional forecasts,
such as predicted probabilities, are called “scoring rules”
(Gneiting & Raftery, 2007). The log score and the Brier
score, which give rise to the MLE and nonlinear least
piazze, rispettivamente, are widely used in practice. Tuttavia,
there are many other scoring rules that may be of interest.
A first contribution of this paper is to review the scattered
theoretical literature on scoring rules for the binary case.
In the context of a double binary decision problem (two
outcomes, two possible actions), scoring rules are weighted
averages over the utility functions of heterogeneous deci-
sion makers (Shuford et al., 1966; Schervish, 1989). Questo
heterogeneity stems from the different costs that individ-
ual decision makers face under false positives versus false
negatives. For instance, in forecasting the probability of cur-
rency crises (Inoue & Rossi, 2008), currency traders’ costs of
false positives and false negatives will vary with their degree
of exposure. Another example is forecasting default proba-
bilities of federally insured student loans (Knapp & Seaks,
1992). Lenders will tend to prefer false positives to false
negatives, where the degree to which they prefer the former
over the latter depends on their exposure and other loans on
their menu. Student borrowers will prefer false negatives to
false positives at varying degrees depending on their needi-
ness, the loan amount, and how much they value their future
credit history.
Popular scoring rules are based on a certain symmetry in
these weighted averages. This symmetry may or may not
be appropriate for a given empirical problem. Asymmetric
rules relate to situations in which false negatives are much
more costly than false positives or vice versa. Per esempio,
Lieli and Springborn (2013) analyze the environmental pol-
icy decision of whether to admit possibly invasive biological
imports. From the consumer’s point of view, it may be dev-
astating to mistakenly classify an import as safe; in contrast,
classifying a safe import as unsafe is typically less harmful.
From the importer’s perspective, Tuttavia, a false positive is
much more costly than a false negative. A symmetric scor-
ing rule, such as the log score, weighs the consumers’ and
importers’ utility functions equally, which may not be appro-
priate if the safety of the general public is at stake. There
are many other settings where symmetry may not be justi-
fied, such as forecasting natural disasters (Dire, wildfires or
earthquakes) and economic disasters (Dire, recessions).
Despite their empirical relevance, such asymmetries are
not reflected in common rules such as the log or Brier
score. We hence show how to construct proper asymmetric
scoring rules in the spirit of Buja, Stuetzle, and Shen (2005).
This involves prioritizing some decision makers over others,
The Review of Economics and Statistics, ottobre 2016, 98(4): 742–755
© 2016 by the President and Fellows of Harvard College and the Massachusetts Institute of Technology. Published under a Creative Commons Attribution 3.0
Unported (CC BY 3.0) licenza.
doi:10.1162/REST_a_00564
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
R
e
S
T
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
9
8
4
7
4
2
1
9
7
4
8
0
5
/
R
e
S
T
_
UN
_
0
0
5
6
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
FORECASTING BINARY PROBABILITIES
743
and hence resembles the aggregation of utilities in a social
planner’s problem (Lieli & Nieto-Barthaburu, 2010). In
order to actually use a wide variety of scoring rules for
parameter estimation, convergence of the resulting estima-
tors is a key concern. We therefore provide conditions for
a weak law of large numbers, allowing for time series
dependence in the data.
Under misspecification, different choices of scoring rules
may lead to different estimates of the forecast model. Match-
ing the scoring rule used for parameter estimation with the
one used for forecast evaluation has been recommended in
the literature for this reason. Tuttavia, there has been less
work on examining the magnitude of these effects. In (non-
binary) MSE-based forecasting, Weiss and Andersen (1984)
make this point with respect to using autoregressions as fore-
cast models. Granger (1993) makes this suggestion without
elaboration, while Weiss (1996) makes the point more gen-
erally, giving results. Hand and Vinciotti (2003) make the
same suggestion in examining models for binary forecasting,
while Gneiting (2011) and Patton (2015) consider several
types of point forecasts.
We contribute to this literature by providing novel ana-
lytical and numerical evidence for the binary case. With the
aid of an analytical example, we illustrate how the choice of
scoring rule is inconsequential in the case of correct spec-
ification. In the case of misspecification, we characterize
the conditions under which the choice is inconsequential.
These conditions consist of specific symmetry conditions on
the data-generating process that are likely to be violated in
practice. A Monte Carlo study illustrates that if a subset of
the conditions is violated, the choice of scoring rule affects
parameter estimates, forecasts, and—most importantly—
decisions. While these effects are qualitatively robust, their
magnitude is necessarily case specific and depends on factors
such as the true DGP, the set of scoring rules being compared,
and the preferences of the decision maker. These preferences
determine whether the differences between two predicted
probabilities (obtained under scoring rules A and B, Dire) are
relevant in the sense of leading to different decisions.
The plan of this paper is as follows. Section II presents
some theoretical results. Section IIA reviews results that link
the scoring rule and the decision-maker loss functions. Noi
present results for the consistency of estimators using a wide
variety of scoring rules in section IIB. Section IIC uses an
analytical example to illustrate why matching loss functions
for estimation and evaluation may be important in the case of
misspecification. Section III provides a Monte Carlo demon-
stration of the theoretical points. Section IV concludes.
II.
Scoring Rules, Estimation, and Evaluation
UN. Characterization of Scoring Rules
Consider the problem of forecasting a binary random vari-
able Y with outcomes y ∈ {0, 1} given some predictors
denoted by the random variable(S) X with outcomes x ∈ X,
where X denotes the support of X.1 We denote by p[X] ∈ P
models of the conditional probability that Y = 1, Dove
p0[X] is the correctly specified model. We use brackets to
distinguish p[X] from the notation p(X). The latter notation
implies that p is a function defined on the support of X. For
p defined on the support of a linear index x(cid:3)θ, P[X] = p(X(cid:3)θ).
The brackets hence allow us to subsume θ. We do not assume
that p0[X] ∈ P. For notational brevity, we will often refer to
p and p0 instead of p[·] and p0[·]. A decision maker chooses
a function f (X) from the space of X to {0, 1}. The opti-
mal choice of this function depends on both the conditional
probability that Y = 1 and the utility function of the decision
maker. The decision maker’s utility function has the form
U( sì, F , C) =
,
(1)
⎧
⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎩
0
−c
if f = 1 and y = 1
if f = 1 and y = 0
−(1 − c) if f = 0 and y = 1
if f = 0 and y = 0
0
⎫
⎪⎪⎪⎪⎬
⎪⎪⎪⎪⎭
Dove 0 < c < 1. Now the utility function can be
rewritten as
U( y, f , c) = −c1( y − f = −1) − (1 − c)1( y − f = 1)
= −c(1 − y)f − (1 − c)y(1 − f ),
(2)
where 1(A) denotes the indicator function of the event A.
Note that the utility function is normalized such that a correct
decision yields 0 utility. The utilities for incorrect decisions
are normalized to sum to 1 in absolute value. This is without
loss of generality when U( y, f , c) depends only on these
two outcomes.2 Thus, c = 0.5 indicates a decision maker’s
indifference between false positives ( f = 1 and y = 0) and
false negatives ( f = 0 and y = 1).
To motivate the problem, we use the example of forecast-
ing a storm at a coastal location. We consider two types of
decision makers, local restaurant owners and fishermen. Let
y denote whether a storm takes place or not. If f = 1, the
restaurant owners will allocate fewer staff members, since
they expect to be serving fewer customers. If f = 0, the
restaurants will hire their full staff. The fishermen will go
fishing only if f = 0. We expect restaurant owners to prefer
false negatives to false positives: c ≥ 0.5. In the case of a
false negative, tourists will be visiting the coastal location
expecting good weather. Since a storm occurs, they will be
spending more time at restaurants. Restaurant owners will
have hired their full staff, and hence are prepared to serve
many customers. In the case of a false positive, the restau-
rants do not hire additional staff, and fewer tourists will be
visiting the location. Thus, the restaurant owners’ profits will
1 We assume that outcomes from X are observable and that X does not
include lagged values of Y .
2 Elliott and Lieli (2013) consider the more general case in which utility
depends not only on the realization y and point forecast f , but also on other
state variables such as measured covariates x. In this case, the normaliza-
tions we use here are not available, and the utility function takes a more
complicated form.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
r
e
s
t
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
9
8
4
7
4
2
1
9
7
4
8
0
5
/
r
e
s
t
_
a
_
0
0
5
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
744
THE REVIEW OF ECONOMICS AND STATISTICS
be smaller. The fishermen are likely to prefer false positives
to false negatives: c ≤ 0.5. In the case of a false positive
(staying at home but no storm), even though they lose the
catch, they save on fuel. In the case of a false negative (going
fishing when there is a storm), they may lose their equipment
or even put their own life in danger. In reality, we have a
continuum of heterogeneous fishermen and restaurant own-
ers with different values of c. The exact value c ∈ [0, 0.5]
of a fisherman’s utility will depend on the value of his or
her equipment and the number of staff on his or her crew.
Similarly, a restaurant owner’s c ∈ [0.5, 1] will depend on
the restaurant size, menu, and how much additional staff he
or she hires.
Optimal forecasts for this problem are to set f (x) =
1( p0[x] > C) (see Schervish, 1989; Boyes, Hoffman, & Basso,
1989; Granger & Pesaran, 2000). This result assumes the
knowledge of the true conditional probability. In practice,
the unknown true probability p0[X] is replaced by an esti-
mate p[X] that can be frequentist (as in our analysis below)
or Bayesian (Lieli & Springborn, 2013). The utility function
can now be written as
U( sì, P, C) = −y(1 − c)1( p ≤ c) − (1 − y)c1( p > c).
(3)
It is important to note that the parameter c plays a dual
role in equation (3). In addition to determining the decision
maker’s preference over false positives and negatives, c is
part of the optimal forecasting rule and determines how the
decision maker interprets a probability forecast. If p ≤ c,
the decision maker will interpret it as f = 0; otherwise, IL
decision maker will interpret it as f = 1. Coming back to our
esempio, consider a fisherman with c = 0.25 and a restaurant
owner with c = 0.75. In questo caso, the optimal forecasting
rule has the following implications. If p < 0.25, then nei-
ther the fisherman nor the restaurant owner will interpret
the probability forecast to indicate that a storm will occur.
If 0.25 ≤ p < 0.75, then only the fisherman will interpret
the probability forecast to indicate that f = 1 and will not
go fishing. If p ≥ 0.75, then both the restaurant owner and
the fisherman will interpret the probability forecast to indi-
cate that f = 1. This shows how the preference over false
positives and negatives informs the interpretation of the con-
ditional probability forecast under the optimal forecasting
rule.
To construct a forecast, we require an estimate of the true
conditional probability of Y = 1—an estimate of p0[x] or
a procedure that directly estimates 1( p0[x] > C). Manski
and Thompson (1989) and Elliott and Lieli (2013) examine
the latter approach and show how direct estimation lessens
the need for an exact understanding of the true conditional
probability. Essentially, the function 1( p0[X] > c) is easier
to specify correctly than p0[X] since the former is a step
function. Our example shows why we might estimate the
conditional probability instead, since it gives the individual
decision makers (cioè., the restaurant owners and fishermen)
the opportunity to interpret the forecast in a manner that is
optimal based on their own preferences. Generalmente, Quando
there are users with a range of utility functions (cioè., values
for c), then provision of an estimate for p0[X] enables all
users to construct their own forecast rules (see Lieli & Nieto-
Barthaburu, 2010).
When constructing a model p for the conditional probabil-
ità, we require a scoring rule for estimation. By definition, UN
proper scoring rule S( sì, P) is a function for which E[S( sì, P)]
is finite and maximized at p = p0. It is considered to be a
strictly proper scoring rule if this maximum is unique: IL
rule is maximized only at the true value for the probabil-
ità (see Gneiting & Raftery, 2007). From an econometric
perspective, this means that the conditional probability is
identified by the scoring rule. For binary outcomes, all proper
scoring rules have the form
S( sì, P) = yf1( P) + (1 − y)f2( P).
(4)
Schervish (1989, theorem 4.2) shows that proper scor-
ing rules can be seen as weighted averages of many utility
functions, where the weights are over different cutoff values
C. Denote by ν(C) a nonnegative weighting function over c
defined on [0, 1]. By integrating the utility for a single deci-
sion maker in equation (3), we obtain the weighted average
utility function
(cid:9)
S( sì, P) = −y
1
(1 − c)1( p ≤ c)ν(C)dc
0
− (1 − y)
(cid:9)
1
0
c1( p > c)ν(C)dc.
(5)
ν(C) may be viewed as the density of the preference param-
eter c in a population of decision makers that the fore-
caster seeks to inform. Hence, it has an intuitive economic
interpretation.
Equating equations (4) E (5), we see that f1( P) =
(cid:10)
(cid:10)
1
1
0 (1 − c)1( p ≤ c)ν(C)dc and f2( P) = −
−
0 c1( p >
C)ν(C)dc. As Schervish (1989) shows, scoring rules with this
form are strictly proper if ν(C) gives a nonzero weighting
over all c ∈ [0, 1]. These results are useful in a number
of ways. Primo, through specification of ν(C), they provide
a constructive approach to designing scoring rules. Secondo,
for existing scoring rules, they show the implicit weight-
ing over decision makers’ utility functions that underlie the
construction of that particular scoring rule. Tavolo 1, Quale
extends table 1 of Gneiting and Raftery (2007), gives several
scoring rules and their implicit weights, ν(C). This includes
popular approaches. Notice that the log scoring rule is sim-
ply (pseudo) maximum likelihood for a parameterized model
of p[X]. This is the most common approach to parametri-
cally estimating models of the conditional probability, Dove
the models are typically either logit or probit. (See Lieli &
Nieto-Barthaburu, 2010, for an economic interpretation of
the weighting that underlies maximum likelihood.)
It is also possible through defining f1( P) to provide a pos-
itive approach to constructing proper scoring rules. If f1( P)
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
R
e
S
T
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
9
8
4
7
4
2
1
9
7
4
8
0
5
/
R
e
S
T
_
UN
_
0
0
5
6
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
FORECASTING BINARY PROBABILITIES
745
Table 1.—Summary of the Scoring Rules We Consider
Nome
Log
(half) Brier
Spherical
Boosting
As1
As2
See equations (4) E (5) for details.
f1( P)
ln( P)
− 1
2 (1 − p)2
p√
1−2p+2p2
(cid:11)
−
1−p
P
ln( P) − p + 1
f2( P)
ln(1 − p)
− 1
2 p2
1−p√
1−2p+2p2
(cid:11)
−
P
1−p
−p
p − 1
P + ln(1 − p)
and f2( P) are once differentiable such that ∂f1( P)/∂p > 0
for p ∈ (0, 1) E
(cid:12)
(cid:13)
∂f2( P)
∂p
= − ∂f1( P)
∂p
P
1 − p
,
(6)
then S( sì, P) in equation (4) is a proper scoring rule with
(cid:12)
ν(C) = ∂f1(C)
∂c
1
1 − c
(cid:13)
.
This result was obtained by Shuford et al. (1966), restated
in theorem 4.1 of Schervish (1989). Notice that this relates
f1( P) to f2( P) through their slopes at any p. Hence, one
can construct a proper scoring rule by defining f1( P) E
constructing f2( P) using this restriction. Per esempio, set
f1( P) = p, so ∂f1( P)/∂p = 1 > 0 for all p. Now ν(C) =
(1 − c)−1 > 0 for c ∈ (0, 1). Using this ν(C) results in a
proper scoring rule. To obtain f2( P), we solve
ν(C)
[C (1 − c)]−1
1
(1 − 2c + 2c2)−3/2
[C (1 − c)]− 3
2
1
2
c−1
(1 − c)−1
Fonte
Good (1952)
Brier (1950)
Toda (1963)
Buja et al. (2005)
Gneiting and Raftery (2007)
Figure 1.—Weighting Functions ν(C) for the Scoring Rules in Table 1
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
R
e
S
T
/
l
UN
R
T
io
C
e
–
P
D
F
/
(cid:9)
f2( P) = −
= −
(cid:9)
0
0
P
P
cν(C)dc
C
1 − c
dc.
All functions have been multiplied by a scaling factor for comparability.
Integrating, we obtain f2( P) = p + ln(1 − p) and hence
S( sì, P) = yp + (1 − y)[P + ln(1 − p)].
This is a strictly proper scoring rule. It is also worth men-
tioning that convex combinations of strictly proper scoring
rules are also strictly proper.3
For either of these directions, it is an outcome of the pro-
cess that we obtain an understanding of ν(C), the weights
over the individual decision makers. The weighting functions
for various popular scoring rules given in table 1 are pictured
3 To see this, consider the example of a convex combination of the log
score and As1 score, which we use as an example in section IIC. Each of
these scores is maximized in expectation by the true probability. Hence,
any convex combination of the two is also maximized in expectation by
the true probability and thus defines a strictly proper scoring rule itself.
Alternatively, note that a convex combination of the log and As1 scores
again satisfies the relationship in equation (6), and thus inherits strict
propriety.
in figure 1. We see that the log score and Boosting loss cor-
respond to U-shaped weighting functions, each placing very
similar weights over c. The weighting functions of the Brier
and spherical score are flat and bell shaped, rispettivamente. Tutto
of the popular rules are symmetric around c = 0.5, the point
of indifference between false negatives and false positives.
There is no obvious reason why this might be appropri-
ate in general for situations where distributional forecasts
are to be provided. In our simplified example of forecasting
storms, where we have the restaurant owners (c ≥ 0.5) E
the fishermen (c ≤ 0.5), the forecaster may prefer to weigh
the restaurant owners’ and fishermen’s utility functions, for
esempio, according to their proportion in the region’s popu-
lation or based on economic revenues. Così, this weighting
may have a social, political, or economic motivation. Questo
paper is not concerned with the justification of a particular
weighting scheme over another. Our goal is to show that the
choice of the weighting function and thereby the scoring rule
/
/
/
9
8
4
7
4
2
1
9
7
4
8
0
5
/
R
e
S
T
_
UN
_
0
0
5
6
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
746
THE REVIEW OF ECONOMICS AND STATISTICS
may have consequences on conditional probability forecasts
and individual decision making in practice.
Buja et al. (2005) and Merkle and Steyvers (2013) use
the beta distribution to parameterize the weighting function
ν(C). This leads to a flexible two-parameter family of scoring
rules. A somewhat simpler approach is to directly choose a
given shape for ν(C). This is exemplified by the As1 and As2
scoring rules shown in figure 1. In the first of these rules, we
set ν(C) so that it heavily weights small values of c relative
to large values. This would be a situation where forecasters
that are extremely averse to losses from false negatives are
heavily weighted. The specification of As2 does the reverse
of this, heavily weighting forecasters who are heavily averse
to false positives.4 By specifying ν(C) directly according to
a reasonable weighting function, the results presented above
allow us to construct economically meaningful scoring rules
that are strictly proper. This situation, which draws on the
existence of the Schervish (1989) representation for scor-
ing rules, helps to bridge the gap between economic and
statistical forecast evaluation criteria.
B. Scoring Rules and Parameter Estimation
For all of the choices of proper scoring rules, one can
consider estimating p[X] using the scoring rule as a loss func-
zione. Consider linear index models: P[X] = p(X(cid:3)θ). With data
{yt, xt}T
t=1, we can consider estimating the parameters of the
model from the maximization
ˆθ = arg max
θ∈Θ
T(cid:14)
t=1
S( yt, P(X(cid:3)
T
θ)).
T
T
θ) = ex(cid:3)
θ/(1 +
Per esempio, for the log scoring rule and p(X(cid:3)
T
ex(cid:3)
θ), this would be maximum likelihood of the logit model.
For the same model with the half Brier score as the scor-
ing rule, this would be nonlinear least squares estimation of
the logit model. Various combinations of scoring rules and
models could be employed to obtain a parameter estimate
ˆθ, and from this, an estimate of the conditional probability
that the outcome is 1, P(X(cid:3) ˆθ), for any possible x. Under fairly
general conditions, ˆθ p→ θ∗, Dove
θ))].
E[S( yt, P(X(cid:3)
θ∗ = arg max
θ∈Θ
T
The following theorem provides a set of conditions for
achieving this consistency result. As detailed below, IL
theorem can be seen as a special case of M-estimation
(Wooldridge, 1994), adapted to the situation of using strictly
proper scoring rules and binary models.
Theorem 1. Assume:
UN. S(.) is a strictly proper scoring rule.
B. θ ∈ Θ ⊂ Rk, where Θ is compact.
4 Gneiting and Raftery (2007, esempio 5) show that the As2 rule is also a
member of the Buja et al. (2005) family of scoring rules.
C. For each yt ∈ {0, 1}, xt ∈ X, f1( P), f2( P) are measur-
able and differentiable in p and p(xt, θ) is measurable
and differentiable in θ.
D. E|fi( P(X(cid:3)
T
and supxt ∈X supθ∈Θ
K < ∞.
θ))|r+δ < Δ < ∞ for i = 1, 2, δ > 0, r ≥ 1
(cid:15)
(cid:15) <
θ))−1f (cid:3)
(cid:15)
(cid:15)(1 − p(x(cid:3)
θ)
θ))p(cid:3)(x(cid:3)
t
1( p(x(cid:3)
t
t
e. {yt, xt} are strictly stationary mixing processes with uni-
form mixing size −r/(2r − 1) with r ≥ 1 or strong
mixing of size −r/(r − 1), r > 1.
F. E (cid:11)xt(cid:11)r+λ < Δ1 < ∞ for r ≥ 1 and λ > 0.
Then ˆθ p→ θ∗ where θ∗ = arg maxθ∈Θ E[S( yt, P(X(cid:3)
T
θ))].
Proof. The proof follows from using these conditions
the conditions of theorems 4.2 E 4.3
to show that
IL
of Wooldridge (1994) hold. Primo, via theorem 4.2,
conditions are sufficient that maxθ∈Θ |T −1
t=1 S( yt, P(X(cid:3)
θ))
θ))]| p→ 0, so the objective function converges
−E[S( yt, P(X(cid:3)
T
to its expected value uniformly in θ. Conditions b and c yield
the theorem’s conditions i and ii. For theorem 4.2 part iv,
notice that for all θ1, θ2 ∈ Θ, the mean value theorem and
equation (6) yield that
(cid:16)
T
T
θ1)) − S( yt, P(X(cid:3)
(cid:12)
T
T
≤
=
(cid:15)
(cid:15)S( yt, P(X(cid:3)
(cid:15)
(cid:15)
(cid:15)
1( P(X(cid:3)
(cid:15)F (cid:3)
(cid:17)
(cid:17)
(cid:17)
(cid:17)F (cid:3)
(cid:15)
(cid:15)
(cid:15)
(cid:15)F (cid:3)
1( P(X(cid:3)
(cid:15)
(cid:15)
(cid:15)
1( P(X(cid:3)
(cid:15)F (cid:3)
(cid:18)
1( P(X(cid:3)
=
≤
T
T
T
(cid:13)
(cid:15)
(cid:15)
θ2))
T
yt − p(X(cid:3)
˜θ)
T
1 − p(X(cid:3)
˜θ)
T
yt − p(X(cid:3)
˜θ)
T
1 − p(X(cid:3)
˜θ)
T
yt − p(X(cid:3)
˜θ)
T
1 − p(X(cid:3)
˜θ)
T
1
1 − p(X(cid:3)
T
˜θ))
(cid:15)
(cid:15)
(cid:15)
(cid:15) ,
X(cid:3)
T(θ1 − θ2)
(cid:17)
(cid:13)
(cid:17)
(cid:17)
(cid:17) (cid:11)θ1 − θ2(cid:11) ,
xt
(cid:13)(cid:15)
(cid:15)
(cid:15)
(cid:15)(cid:11)xt(cid:11) (cid:11)θ1 − θ2(cid:11) ,
(cid:13)(cid:15)
(cid:18)
(cid:15)
(cid:15)
(cid:15)
X(cid:3)
txt (cid:11)θ1 − θ2(cid:11) ,
˜θ))P(cid:3)(X(cid:3)
T
˜θ)
(cid:12)
˜θ))P(cid:3)(X(cid:3)
T
˜θ)
(cid:12)
(cid:12)
˜θ))P(cid:3)(X(cid:3)
T
˜θ)
˜θ))P(cid:3)(X(cid:3)
T
˜θ)
≤ K
X(cid:3)
txt (cid:11)θ1 − θ2(cid:11) ,
θ)) E
where ˜θ is an intermediate value for θ. Parts iii and ivb
(cid:18)
X(cid:3)
require that S( yt, P(X(cid:3)
txt satisfy a WLLN point-
T
wise in θ. These results follow from assumptions d through
F, which are sufficient for a WLLN via corollary 3.48 Di
White (2001). Notice that E[S( yt, P(X(cid:3)
θ))]q for any integer
T
q is equal to E[ytf1( P(X(cid:3)
θ))q + (1 − yt)f2( P(X(cid:3)
θ))q] Quale
(cid:18)
T
T
X(cid:3)
is finite given d for q ≤ r + δ. A WLLN for
txt fol-
lows directly from assumptions e and f. Consistency of the
estimate ˆθ follows from the conditions being sufficient for
theorem 4.3 of Wooldridge (1994). Condition M1 and M2
follow directly from assumptions b and c, along with the
results presented for uniform convergence of the average of
the objective function. Condition M3 follows directly from
assumption a.
Assumption a ensures that the scoring rule is strictly
proper and thus admits the decomposition in equation 4.
Inoltre, it ensures that the objective function attains a
unique maximum at θ∗. If S(.) is relaxed to be a proper
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
R
e
S
T
/
l
UN
R
T
io
C
e
–
P
D
F
/
/
/
/
9
8
4
7
4
2
1
9
7
4
8
0
5
/
R
e
S
T
_
UN
_
0
0
5
6
4
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
FORECASTING BINARY PROBABILITIES
747
(cid:20)
(cid:19)
(cid:19)
θ))
ˆθ))
S( yt, P(X(cid:3)
T
S( yt, P(X(cid:3)
T
(cid:20) p→ maxθ∈Θ E
scoring rule that is not necessarily strictly proper, Poi
a result similar to theorem 1 still holds. In questo caso,
E
, which is suf-
ficient to justify the procedure. Assumption b is the standard
requirement to ensure a maximum. Assumption c is a stan-
dard regularity condition. The conditions in d relate to the
scoring rule and model being employed, and are functions
of both of these choices. The first part of d ensures that
expected loss exists. The second part is employed as part
of the requirements for uniform consistency of the objec-
tive function. This assumption seems strong; Tuttavia, Esso
holds widely since these objects are functions of p(X(cid:3)
θ),
T
which is bounded between 0 E 1 for all xt and θ. For
notational brevity, let st = x(cid:3)
θ. Per esempio, consider the
T
half Brier score with a logit model for the conditional
probability that yt = 1. Then f (cid:3)
1( P(st)) = 1 − p(st) E
(cid:15)
(cid:15) =
P(cid:3)(st) = p(st)(1 − p(st)) so
|P(st)(1 − p(st))| ≤ 0.5; hence, the second part of assump-
tion d is satisfied. In questo caso, E|f1( P(st))|r+δ = E| −
0.5(1 − p(st))2|r+δ ≤ 0.5r+δ, and so is finite for all r, δ
finite. For the log score with a logit model for the con-
ditional probability, we have that f (cid:3)
1( P(st))P(cid:3)(st) = (1 −
(cid:15)
(cid:15) = 1. For the
P(st)) and so
spherical scoring rule, F (cid:3)
1( P(st)) = (1 − p(st))/(( P(st))2 +
(cid:15)
(cid:15) =
1( P(st))P(cid:3)(st)
(1 − p(st))2)3/2. Hence,
(cid:15)
(cid:15)
(cid:15) ≤ 2−1/2,
(cid:15)( P(st)(1 − p(st)))/(( P(st))2 + (1 − p(st))2)3/2
and so this is also bounded.
1( P(st))P(cid:3)(st)
(cid:15)
(cid:15)(1 − p(st))−1f (cid:3)
(cid:15)
(cid:15)(1 − p(st))−1f (cid:3)
(cid:15)
(cid:15)(1 − p(st))−1f (cid:3)
1( P(st))P(cid:3)(st)
The mixing assumptions in e impose a limit on the degree
of time series dependence in the data. The requirement of
strict stationarity gives meaning to the idea that we obtain
the true conditional probability, at least asymptotically, Quando
the model is correctly specified. If the data are not strictly
stationary, then we can still obtain consistency results; how-
ever, the interpretation of θ∗ changes to being a limiting value
that minimizes the average expected losses over time. È
worth noting that some strictly proper scoring rules, ad esempio
the half Brier and spherical scores considered in this paper,
are bounded. For bounded scoring rules, the assumptions of
theorem 1 can be relaxed. Primo, the technical requirements
of assumption d either become obsolete or are trivially sat-
isfied. Secondo, assumption f, which ensures a WLLN for
(cid:18)
X(cid:3)
txt, is not required.5
Conceptually, the main purpose of theorem 1 is to illus-
trate that the structure of scoring rules makes them well
suited for designing (consistent) parameter estimators in
the tradition of M-estimators (per esempio., Hayashi, 2000; see also
Gneiting & Raftery, 2007). There are many possible sets
of assumptions (per esempio., various ways of restricting time series
T
(cid:15)
(cid:15)S( yt, P(X(cid:3)
(cid:15)
(cid:15) < C, it follows that
θ))
5 To see this, note that if supθ∈Θ
(cid:15)
(cid:15)
(cid:15)S( yt, p(x(cid:3)
(cid:15) < 2C. This means that the stochastic upper
θ1)) − S( yt, p(x(cid:3)
θ2))
t
bound in the proof of theorem 1 can be replaced by a constant. Hence, a
WLLN for the upper bound is trivially satisfied, and assumption f becomes
obsolete. Furthermore, the second part of assumption d, which is used in the
(cid:15)
(cid:15), is no longer
general case to bound the term
required. Finally, the first part of assumption d is automatically satisfied
since the fi( p(x(cid:3)
t
θ)), i = 1, 2, are bounded.
θ1)) − S( yt, p(x(cid:3)
t
(cid:15)
(cid:15)S( yt, p(x(cid:3)
θ2))
t
t
dependence) that could lead to the statement of theorem 1.
Our chosen set of assumptions aims to strike a balance
between generality and clarity of presentation, although
results under alternative trade-offs between conditions are
possible. Furthermore, results under more primitive condi-
tions are available in more specialized settings. For example,
de Jong and Woutersen (2011) analyze consistency in the
important special case of a correctly specified probit model
with lagged dependent variables, estimated via maximum
likelihood. See their theorems 1 and 2 for low-level con-
ditions that guarantee limited dependence properties of the
data-generating process and their theorem 3 on consistency.
In terms of understanding the results for forecasting binary
outcomes, two results follow directly. First, if the model is
correctly specified, that is, p0[xt] = p(x(cid:3)
θ0), then for all
t
strictly proper scoring rules, θ∗ = θ0. Hence, all strictly
proper scoring rules will give the same true conditional prob-
ability asymptotically. This follows directly from the fact that
strictly proper scoring rules are uniquely maximized by the
true conditional probability. Thus, if the model is correctly
specified, the choice of the best scoring rule to use depends
not on the reasonableness of θ∗ but instead on the efficiency
of the estimator ˆθ obtained by maximizing a particular scor-
ing rule. The popularity of the MLE derives from the fact
that it is an efficient parameter estimator under correct spec-
ification. However, it should be understood that the latter
strong assumption is crucial in establishing the optimality
of the MLE.
When the model is not correctly specified, there is no
reason that θ∗ should be the same over different scoring
rules. In practice, they will differ, and then so will the esti-
mated conditional probability even asymptotically. Hence,
decisions made for any particular loss function for the deci-
sion maker (value for c) will also differ across scoring rules.
Scoring rules placing more weight on high values of c will
provide probability forecasts, which are most useful for deci-
sion makers with high values of c, and vice versa. In order
to attain (asymptotic) optimality, the scoring rule chosen for
estimating the parameters of the model should match the
scoring rule used to evaluate the probability forecast. Under
the conditions above, the magnitude of this effect depends
on how θ∗ varies with the choice of scoring rule. Since both
scoring rules and models tend to be very nonlinear, this rela-
tionship will generally be complex. The analytical example
in section IIC and the Monte Carlo results in section III
provide evidence on this issue.
In choosing between scoring rules, a forecaster needs to
trade off the loss from using a scoring rule other than the
log score under correct specification with the gains this
approach brings when the model is misspecified. The first
consideration then is how plausible it would be to assume
that the model is correctly specified. In most applications,
especially those unmotivated by any underlying economic
or scientific theory, this would be a difficult assumption
to make. Nonetheless, it would generally be considered,
and the answer is specific to the forecasting problem.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
r
e
s
t
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
9
8
4
7
4
2
1
9
7
4
8
0
5
/
r
e
s
t
_
a
_
0
0
5
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
748
THE REVIEW OF ECONOMICS AND STATISTICS
The second consideration is how large the gains are from
using the matching strategy under misspecification of the
model.
C. Misspecification: An Analytical Example
In order to illustrate how the choice of scoring rule mat-
ters, we will give an analytical example where we examine
the effect of trading off between two specific scoring rules
on θ∗. The scoring rules we consider are given in table 1,
the log, and As1 scores. The former is the log likelihood;
the latter is an asymmetric scoring rule that emphasizes a
better fit for smaller probabilities versus larger probabilities.
We can write a composite scoring rule indexed by λ ∈ [0, 1]
that nests both of them:
Sλ( y, p(x(cid:3)θ)) = y ln( p(x(cid:3)θ)) + yλ(1 − p(x(cid:3)θ))
where fX(.) is the unconditional probability density func-
tion (pdf) of X. Computing the expression explicitly and
subsuming x, we obtain
(cid:12)
(cid:13)(cid:26)
(cid:12)
(cid:13)
(cid:25)
(cid:9)
− λ
− (1 − p0)
1
p
X
p0
= 0,
1 − λ
1 − p
+ λ
p(cid:3)x fX(x)dx
(8)
where p(cid:3) ≡ ∂p(z)
∂z
(cid:15)
(cid:15)
z=xθ∗.
The first-order condition in equation (8) gives a highly
nonlinear implicit characterization of the limiting parameter
estimate θ∗. However, our setting (involving a single param-
eter λ to characterize the employed scoring rule) allows us
to nevertheless analyze how θ∗ varies across scoring rules.
Again assuming applicability of the DCT and using implicit
differentiation, we obtain
+ (1 − y)(1 − λ) ln(1 − p(x(cid:3)θ))
+ (1 − y)(−λ)p(x(cid:3)θ).
(7)
∂θ∗
∂λ
=
For λ = 0, S0( y, p(x(cid:3)θ)) gives the log score. For λ = 1,
S1( y, p(x(cid:3)θ)) is the As1 score. For all λ ∈ [0, 1], Sλ( y, p(x(cid:3)θ))
is a proper scoring rule, since propriety carries over to convex
combinations of two proper scoring rules.
Let θ∗ denote the maximizer of the objective function
defined by the scoring rule, with
θ∗ = arg max
θ∈Θ
= arg max
θ∈Θ
Sλ(Y , p(X (cid:3)θ))
(cid:22)
(cid:21)
EX,Y
(cid:21)
(cid:23)
EX
EY |X
Sλ(Y , p(X (cid:3)θ))
(cid:24)(cid:22)
.
We can write the conditional expectation in the above
objective function as follows:
(cid:23)
EY |X
Sλ(Y , p(X (cid:3)θ))
(cid:24)
= p0[X] ln( p(X (cid:3)θ)) + λp0[X]
+ (1 − p0[X])(1 − λ)
× ln(1 − p(X (cid:3)θ)) − λp(X (cid:3)θ).
For simplicity, we assume in the following that X is scalar. As
demonstrated in section A of the online appendix, extending
the example to include an intercept is possible but appears
to complicate the analysis without a compensating gain in
insight.
The probit and logit link functions p(.) are by far the most
common choices in the literature.6 Koenker and Yoon (2009)
survey several other choices. Here, we do not make spe-
cific assumptions about p, except that it does not depend on
estimands other than θ. Assuming sufficient regularity con-
ditions to apply the dominated convergence theorem (DCT),
the first-order condition for a maximum is
Sλ(Y , p(X (cid:3)θ))
∂EY |X
(cid:9)
(cid:24)
(cid:23)
fX(x) dx = 0,
X
∂θ
(cid:15)
(cid:15)
(cid:15)
(cid:15)
X=x
(cid:10)
(cid:27)
( p0−p)(1−λp)p(cid:3)(cid:3)
p(1−p)
X
(cid:10)
X
( p0−p)pp(cid:3)
p(1−p) xfX(x)dx
(cid:28)
− ( p0(1−p)2+(1−p0)(1−λ)p2)p(cid:3)2
p2(1−p)2
.
x2fX(x)dx
(9)
(cid:23)
(cid:24)
Equation (9) makes explicit how a change in the scoring rule,
which is expressed here by differentiating with respect to λ,
affects the probability limit θ∗ of the parameter estimator.
The denominator of the above expression is always negative
because it is the second derivative of the objective function,
, evaluated at θ∗. However, the sign of
EX,Y
the numerator may change depending on the truth, the model,
and the density of X.
Sλ(Y , p(X (cid:3)θ))
Under correct specification (p0[X] = p(X (cid:3)θ0)), the numer-
ator in equation (9) becomes 0, and we get ∂θ∗
= 0. This
∂λ
result holds irrespective of the link function p and the dis-
tribution fX(x). It mirrors the fact that the composite scoring
rule in equation (7) is strictly proper for any λ ∈ [0, 1]. This
implies that θ∗ = θ0 regardless of the value of λ. Hence, the
choice of scoring rule is irrelevant under correct specifica-
tion, at least in terms of the probability limit of the parameter
estimator.
Under misspecification, it still holds that the denominator
of equation (9) is always negative. Hence, the numerator
will determine the sign of ∂θ∗/∂λ. We first examine when
the sign is 0 (i.e., the choice of scoring rule does not matter)
in the following theorem.
Theorem 2. Assume
a. p0[x] = 1 − p0[−x].
b. p[x] = 1 − p[−x], p is differentiable in x.
c. fX(x) = fX(−x), fX ≥ 0, dim(X) = dim(θ) =
1, E[X] = 0.
6 In the probit case, p(·) is the cumulative distribution function (cdf) of a
standard normal variable. For the logit, p(z) = [1 + exp(−z)]−1 is the cdf
of a standard logistic variable.
In addition, assume conditions a–f in theorem 1 hold, and
define X+ = X ∩ [0, ∞) and X− = X ∩ (−∞, 0].
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
r
e
s
t
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
9
8
4
7
4
2
1
9
7
4
8
0
5
/
r
e
s
t
_
a
_
0
0
5
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
FORECASTING BINARY PROBABILITIES
749
Then
∂θ∗
∂λ
= 0
if and only if
(cid:9)
( p0[x] − p[x])+
(cid:9)
X+
fX(x)dx
p(cid:3)[x]x
p[x](1 − p[x])
p(cid:3)[x]x
p[x](1 − p[x])
=
X+
( p0[x] − p[x])−
fX(x)dx,
(10)
By symmetry, equation (10) also holds on X−.
Proof. Relating to the numerator of equation (9), we define
g[x] = ( p0[x] − p[x])p[x]p(cid:3)[x]
p[x](1 − p[x])
x.
Assumptions a and b from the statement of the theorem
imply that
p0[x] − p[x] = −( p0[−x] − p[−x]),
p[x](1 − p[x]) = p[−x](1 − p[−x]),
p(cid:3)[x] = p(cid:3)[−x],
assuming that p is differentiable in x. Note that the first equal-
ity means that the approximation error due to the chosen
model is point symmetric about the origin. The latter equal-
ities depend on the approximation model only and hold for
commonly used specifications such as probit and logit. These
relationships, together with calculations detailed in section
A of the online appendix, imply that
(cid:9)
X
g[x] fX(x)dx
(cid:9)
=
X+
( p0[x] − p[x])
p(cid:3)[x]x
p[x](1 − p[x])
fX(x)dx.
Note that all quantities on the right-hand side of the last
equality are nonnegative, except for p0[x] − p[x]. Thus, the
result follows that it would equal 0 iff
(cid:9)
( p0[x] − p[x])+
(cid:9)
X+
=
X+
( p0[x] − p[x])−
fX(x)dx
p(cid:3)[x]x
p[x](1 − p[x])
p(cid:3)[x]x
p[x](1 − p[x])
fX(x)dx,
where (h[x])+ = |h[x]|1{sign(h[x]) = +1} and (h[x])− =
|h[x]|1{sign(h[x]) = −1}. By symmetry, the same holds for
the integral of the same term over X−.
In theorem 2, assumptions a and b are point-symmetry
conditions on p0 and p, respectively, and c ensures that X is
symmetric about the origin. Together, assumptions a and b
imply the point symmetry of p0 − p about the origin.
Theorem 2 gives precise conditions under which the
choice of scoring rule, and thus the weighting over decision
makers, has no impact on the limiting parameter estimate θ∗.
The intuition here is that on each part of the support of X, the
model overestimates and underestimates the true conditional
probability. In addition, the upward and downward predic-
tion error averaged as in the above equation are equal on X+
and X−, respectively. This excludes a situation where the
model tends to overestimate the true conditional probability
on X+ and underestimate it on X−, or vice versa. In practice,
the symmetry requirements imposed on the model (assump-
tion b in theorem 2) hold for the widely used logit and
probit specifications. The symmetry requirements imposed
on the true conditional probability, as well as the density of
X (assumptions a and c in theorem 2), may be unrealistic
for many applications. We next provide Monte Carlo evi-
dence on the role of scoring rules under various scenarios
that either do or do not satisfy the conditions of theorem 2.
III. Numerical Demonstration of the Results
This section illustrates the results in theorem 2 for both
the asymptotic problem and finite samples. We consider four
data-generating processes (DGPs) given in table 2, as well
as the scoring rules in table 1. In DGP #1, the symmetry
conditions on the conditional probability under the true and
misspecified models as well as on the marginal distribution
of X, assumptions a to c in theorem 2, are fulfilled. DGP #2
and #3 are variants of DGP #1 where the symmetry condition
on the true conditional probability (a) and on the distribu-
tion of X (c), respectively, are violated. DGP #4 presents an
example where both conditions a and c are violated.7
A. Asymptotic Problem
The necessary and sufficient conditions in equation (10)
of theorem 2 give us an insight into how certain symmetry
conditions jointly determine whether the choice over scoring
rules has an effect on the estimated parameter and, thereby,
the conditional probability. In this section, we seek to illus-
trate this insight for the asymptotic problem. To do so, for
each scoring rule j, we compute θ∗
j for the misspecified logit
model given in table 2 numerically, based on a sample of size
1,000,000. The parameters are reported in table 3. Since all
other elements of the forecast are identical, the difference
between the conditional probability under various scoring
rules is due to the difference in θ∗
j . Figure 2 plots the con-
ditional probability for all scoring rules under DGPs #1 to
#4.8 The plot for DGP #1 gives a case where the choice
7 In fact, the true conditional probability p0(X) is not even defined for
X < 0.
8 For better display, figures 2, 3, and 5 omit the results for the Brier and
Boosting scores. The results for Brier are similar to the ones for the spherical
score, and the results for Boosting are similar to the ones for the log score.
The online appendix contains color versions of the figures, covering all six
scoring rules.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
r
e
s
t
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
9
8
4
7
4
2
1
9
7
4
8
0
5
/
r
e
s
t
_
a
_
0
0
5
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
750
THE REVIEW OF ECONOMICS AND STATISTICS
DGP #
1
2b
3a
4a,b
p0(X)
F(−0.5X + 0.2X 3)
F(−0.5X + 0.2X 2)
F(−0.5X + 0.2X 3)
F(
X)
√
Table 2.—DGPs Used in Simulation
fX
U(−2.5, 2.5)
U(−2.5, 2.5)
U(−1, 4)
U(0, 10)
Correctly Specified
Model (p0)
F(θ1X + θ2X 3)
F(θ1X + θ2X 3)
F(θ1X + θ2X 3)
F(θ
X)
√
(cid:29)
Misspecified
Model (p)
F(θX)
fX denotes the pdf of X, F(s) = exp(s)
1+exp(s) denotes the cdf of the logistic distribution, and U (a, b) denotes the uniform distribution with limits a and b. DGP #1 is taken from Elliott and Lieli (2013) and fulfills
conditions a–c of theorem 2.
a Indicates that a DGP violates condition c of theorem 2.
bIndicates that a DGP violates condition a of theorem 2.
Table 3.—Asymptotic Parameter Estimates for Various Scoring
Rules j, Based on 1,000,000 Observations
θ∗
j
DGP #1
DGP #2
DGP #3
DGP #4
Log
Brier
Spherical
Boosting
As1
As2
0.22
0.22
0.21
0.22
0.22
0.22
−0.44
−0.44
−0.44
−0.43
−0.33
−0.64
0.6
0.51
0.45
0.68
0.46
0.66
0.43
0.57
0.65
0.39
0.61
0.41
of the scoring rule has no effect on the conditional proba-
bility. This is true not only for the choice between the log
and As1 scores, as shown in theorem 2, but also holds for
all other scoring rules we examine. The plot clearly shows
other implications of theorem 2: the prediction error is point-
symmetric about the origin, the prediction error changes its
sign on X+ and X−, and a weighted average of the positive
and negative prediction error on X+ as well as X− would
be equal as indicated by equation (10). For all the other
DGPs, where either conditions a or c or both are violated, our
numerical results clearly show that the choice of scoring rule
has an effect on the conditional probability approximation.
Figure 2.—Asymptotic Problem under Misspecification for DGPs #1 (Upper Left), #2 (Upper Right), #3 (Lower Left), and #4 (Lower Right)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
r
e
s
t
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
9
8
4
7
4
2
1
9
7
4
8
0
5
/
r
e
s
t
_
a
_
0
0
5
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
The plots show the predicted conditional probability, whereby the parameter θ∗
j is computed using a sample of 1,000,000 draws.
Figure 3.—Asymptotic Problem under Misspecification for DGPs #1 (Upper Left), #2 (Upper Right), #3 (Lower Left), and #4 (Lower Right)
FORECASTING BINARY PROBABILITIES
751
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
r
e
s
t
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
9
8
4
7
4
2
1
9
7
4
8
0
5
/
r
e
s
t
_
a
_
0
0
5
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
The plots show the classification curves of the conditional probability given x = 2 (θ∗
j is computed using a sample of size 1,000,000).
For DGP #2, where only the symmetry condition on the true
conditional probability is violated, the only scoring rules that
result in different predicted conditional probabilities than
the log score are the asymmetric scoring rules. For DGPs
#3 and #4, we observe differences in the predicted prob-
abilities for all pairs of scoring rules, even if both scoring
rules under comparison are symmetric (such as the log versus
Brier score).
As discussed earlier, the binary action of an individual
decision maker (such as a fisherman or restaurant owner in
our example) is determined by whether the predicted prob-
ability, F(θ∗
j X), exceeds the threshold c. For a given value
of the regressor X, the chosen action may thus depend on
the scoring rule j used for parameter estimation. Figure 3
illustrates this point for x = 2. It shows that the scoring
rules generally yield different classification curves. Rather
than looking at a single design point (such as x = 2 above),
we next consider a broader summary measure of differences
between scoring rules,
(cid:21)
(cid:23)
P
sign
F(θ∗
j X) − c
(cid:30)
(cid:22)
(cid:14)= sign
F(θ∗
log score X) − c
(cid:31) (cid:24)
,
which is the probability (computed over the distribution of
X) that scoring rule j implies a different binary action than
the log score. Figure 4 shows how the choice of scoring
rules under DGP #1 is inconsequential, in the sense of almost
always leading to identical decisions. For DGP #2, the choice
between the log and the two asymmetric rules is the only one
that leads to different classifications. For c = 0.6 and c =
0.7, the probability of different classifications is 0.05 and
0.12, respectively. For DGPs #3 and #4, the choice between
the log and any other scoring rule leads to different classifica-
tions. What is particularly interesting here is that even though
choosing between the log and other symmetrical rules, such
as Brier, may be relatively inconsequential for c = 0.6, it can
lead to an 0.15 and 0.175 probability of different classifica-
tions at greater values of c for DGP #3 and #4, respectively.
Thus, whether the differences across scoring rules matter
depends on a decision maker’s preferences as embodied
in c.
As a further check, figure 5 provides evidence on the prob-
ability that the estimator defined by scoring rule j delivers a
correct classification. This probability is given by
(11)
(cid:21)
(cid:23)
P
sign
(cid:22)
F(θ∗
j X) − c
= sign {p0(X) − c}
(cid:24)
.
(12)
752
THE REVIEW OF ECONOMICS AND STATISTICS
Figure 4.—Asymptotic Problem under Misspecification for DGPs #1 (Upper Left), #2 (Upper Right), #3 (Lower Left), and #4 (Lower Right)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
r
e
s
t
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
9
8
4
7
4
2
1
9
7
4
8
0
5
/
r
e
s
t
_
a
_
0
0
5
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
The plots show the unconditional probability of different classifications using the log score as opposed to other scoring rules; see equation (11). θ∗
j is computed using a sample of size 1,000,000.
Put slightly differently, this number represents the probabil-
ity that a decision maker with preference parameter c makes
a correct decision when using a prediction model fitted via
scoring rule j. Figure 5 shows how the ranking of the scor-
ing rules differs across thresholds c. For example, consider
the comparison of As1 and As2 in the figure for DGP #2
(upper-right panel): While As1 performs better for values of
c between 0.2 and 0.4, the reverse is true for c lying between
0.6 and 0.8. This result is closely in line with the fact that,
when used as an estimation criterion, As1 places an emphasis
on fitting small thresholds c correctly, whereas As2 focuses
on high values of c (see section IIA). This analysis demon-
strates how forecasters who are willing to favor a certain
clientele (say, decision makers characterized by small values
c, such as the fishermen in our example) can achieve this goal
by issuing predictions based on an appropriate scoring rule
(in this case, As1).
B. Finite-Sample Results
All of our results until now are for the case in which
the limiting parameter values θ∗
j are known. We now briefly
turn to the effects of sampling uncertainty. Specifically, we
consider a rolling window estimation scheme for θ∗
j which
is popular in practice (see the discussion by Giacomini &
White, 2006, p. 1548), using a window length of 120. Fur-
thermore, we consider a forecast evaluation period of 100
periods.9 In each Monte Carlo iteration, we thus simulate
120 + 100 observations. For the first rolling window, we
use observations 1 to 120 to estimate the parameter θ and
make a forecast for observation 121. For the second rolling
9 These sample sizes are typical in forecasting studies using quarterly
macroeconomic data, for example, when using an estimation sample from
1960 to 1989 (30 × 4 = 120 observations) and an evaluation sample from
1990 to 2014 (25 × 4 = 100 observations).
Figure 5.—Asymptotic Problem under Misspecification for DGPs #1 (Upper Left), #2 (Upper Right), #3 (Lower Left), and #4 (Lower Right)
FORECASTING BINARY PROBABILITIES
753
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
r
e
s
t
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
9
8
4
7
4
2
1
9
7
4
8
0
5
/
r
e
s
t
_
a
_
0
0
5
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
The plots show the probability of correct classification (see equation [12]) for various thresholds c. θ∗
j is computed using a sample of size 1,000,000.
window, we use observations 2 to 121 for estimation and
make a forecast for observation 122, and so forth.
The online appendix reports variants of figures 2 and 3 for
the rolling window case, which we construct by averaging
the probability and classification curves for each estimate
of θ. These figures show that on average, the rolling win-
dow parameter estimates are very similar to their asymptotic
counterparts.
Theorem 1 implies that in the asymptotic case, it is gener-
ally optimal to use the same scoring rule for estimation and
evaluation. We next analyze to what extent this statement
carries over to the rolling window scenario. To this end,
table 4 summarizes the parameter estimates and predictive
performance obtained under each scoring rule. The median
estimates for each scoring rule (upper panel of table 4) are
very close to their asymptotic limits in table 3. This is in
line with the similarity of the prediction and classification
curves noted. Estimators defined by alternative scoring rules
clearly differ in their sampling variability, as measured by
interdecile ranges (middle panel of table 4), especially for
DGPs #3 and 4.10 That said, there is no simple relationship
10 We use interdecile ranges, rather than variances, to eliminate the effect
of outliers, which are not surprising given the scale of our Monte Carlo
between the choice of scoring rule and the variability of the
estimator it defines. For example, the spherical score defines
the most precise estimator for DGP #1, whereas it defines
the (by far) least precise estimator for DGP #4.
The bottom panel of table 4 compares the forecast perfor-
mance of two strategies: (a) using the same scoring rule for
estimation and evaluation (“matching”) and (b) simply using
MLE for parameter estimation while using a different scor-
ing rule for evaluation. In order to compare the performance
of the two, we simply report the share of Monte Carlo itera-
tions for which the first strategy performs better, whereby we
average over an out-of-sample period of 100 observations in
each Monte Carlo iteration. This share has a natural scal-
ing between 0 and 1. It is therefore easily interpretable and
comparable across scoring rules.11
experiment (for each scoring rule and DGP, we compute 100 times 10,000
rolling window estimates). Measuring the estimators’ variability by the
median absolute deviation from the median (instead of the interdecile range)
leads to the same qualitative interpretations.
11 By contrast, the scoring rules themselves have no natural scaling, which
makes it hard to judge whether differences of a given magnitude are practi-
cally relevant, and also impedes comparisons of effect sizes across scoring
rules.
754
THE REVIEW OF ECONOMICS AND STATISTICS
Table 4.—Summary Results for Rolling Windows (Window Length 120)
DGP #1 DGP #2 DGP #3 DGP #4
Median estimate ˆθ
Log
Brier
Spherical
Boosting
As1
As2
Interdecile range of estimates ˆθ
Log
Brier
Spherical
Boosting
As1
As2
Share of MC iterations for which
matching scores better than MLEa
Brier
Spherical
Boosting
As1
As2
0.22
0.21
0.21
0.22
0.22
0.22
0.32
0.31
0.29
0.33
0.34
0.33
0.76
0.75
0.23
0.43
0.45
−0.44
−0.44
−0.45
−0.44
−0.34
−0.65
0.36
0.37
0.38
0.36
0.3
0.57
0.44
0.42
0.52
0.73
0.57
0.6
0.51
0.46
0.68
0.46
0.66
0.26
0.2
0.19
0.32
0.2
0.27
0.75
0.82
0.53
0.8
0.59
0.43
0.58
0.67
0.4
0.62
0.42
0.23
0.83
1.58
0.22
1.1
0.22
0.47
0.52
0.53
0.5
0.52
a In each MC iteration, we compute the average score over an out-of-sample period of 100 observations.
“Matching” means using the same scoring rule for estimation and evaluation.
In thirteen of the twenty cases, shares in the bottom panel
of table 4 are strictly above 0.5, indicating better perfor-
mance of matching compared to MLE. For some of these
cases, we find that matching leads to a substantially differ-
ent median estimate as well as lower variability as measured
by the interdecile range. This holds true for As1 under DGP
#2, as well as Brier, Spherical, and As1 under DGP#3. There
are also some cases where the matching estimator is more
variable than MLE but nevertheless performs better out of
sample. This happens for As2 under DGP #2, Boosting and
As2 under DGP #3, as well as Spherical under DGP #4. In
these cases, it seems that the relative gain from using an esti-
mator that converges to the maximand of the scoring rule in
question outweighs the relative loss in precision. For another
subset of the cases where matching performs better, such as
Brier and Spherical under DGP #1, Boosting under DGP #2,
and As2 under DGP #4, the medians and interdecile ranges
of the MLE and matching estimator are practically indistin-
guishable. Our conjecture is that in these cases, the matching
strategy’s improvement over MLE is marginal. Along the
same lines, the six cases in which MLE does better than
matching appear very close, with both strategies attaining
similar medians and interdecile ranges.
To summarize, our results show that the “correct location”
of the matching estimator puts it at an advantage over MLE
under misspecification, which generally does not converge to
the maximand of the scoring rule in question. To compensate
for this, the MLE must be more precise (smaller interdecile
range) in order to outperform matching in terms of out-of-
sample scores.
IV. Conclusion
This paper explores the nuances in forecasting conditional
probabilities under misspecification. The natural choice
under correct specification, regardless of the scoring rule
used for out-of-sample evaluation, is indeed MLE. It is not
only consistent for the maximand of the scoring rule in ques-
tion but also efficient. Under misspecification, however, there
is no clear natural choice. The MLE is neither consistent for
the maximand of the scoring rule in question nor necessarily
“efficient” in the sense of attaining lower sampling variabil-
ity than other estimators. The paper shows in an analytical
example that under certain symmetry conditions, the choice
of scoring rule is inconsequential for parameter estimation.
With the aid of numerical results for the asymptotic prob-
lem, we then illustrate how the violation of these conditions
can lead to different probability limits of the parameter esti-
mators and different conditional probability forecasts. We
also show how these different forecasts would lead to dif-
ferent interpretations by heterogeneous decision makers. In
finite samples, we find an interesting relationship between
the sampling distribution of the parameter estimators and the
relative performance of the MLE (compared to the estimator
that maximizes the scoring rule considered for evaluation).
Finally, our analysis has conceptual implications pertain-
ing to the literature on distributional forecasting. It has been
argued (Geweke & Amisano, 2011) that the provision of
distributional forecasts is superior to the provision of point
forecasts because distributional forecasts can be employed
to construct point forecasts for any loss function. While this
argument seems valid in many situations (see section I), it
should not be misunderstood as saying that distributional
forecasts were “loss function independent.” Specifically, this
paper illustrates that probability forecasts—which are clearly
distributional—are not loss function independent. A loss
function is required for estimation, and this choice makes
explicit trade-offs regarding which aspects of the data to fit
correctly, at the cost of neglecting other aspects.
REFERENCES
Boyes, William J., Dennis L. Hoffman, and Stuart A. Low, “An Econo-
metric Analysis of the Bank Credit Scoring Problem,” Journal of
Econometrics 40 (1989), 3–14.
Brier, Glenn W., “Verification of Forecasts Expressed in Terms of Proba-
bility,” Monthly Weather Review 78 (1950), 1–3.
Buja, Andreas, Werner Stuetzle, and Yi Shen, “Loss Functions for Binary
Class Probability Estimation and Classification: Structure and
Applications,” unpublished manuscript, Duke University (2005).
de Jong, Robert M., and Tiemen Woutersen, “Dynamic Time Series Binary
Choice,” Econometric Theory 27 (2011), 673–702.
Deutsche Bank (2014), “Sovereign Default Probabilities Online,” http:
//www.dbresearch.com/servlet/reweb2.ReWEB?rwnode=DBR
_INTERNET_EN-PROD$NAVIGATION&rwobj=CDS.calias
&rwsite=DBR_INTERNET_EN-PROD.
Elliott, Graham, and Robert P. Lieli, “Predicting Binary Outcomes,”
Journal of Econometrics 174 (2013), 15–26.
Geweke, John W., and Gianni Amisano, “Optimal Prediction Pools,”
Journal of Econometrics 164 (2011), 130–141.
Giacomini, Raffaella, and Halbert White, “Tests of Conditional Predictive
Ability,” Econometrica 74 (2006), 1545–1578.
Gneiting, Tilmann, “Making and Evaluating Point Forecasts,” Journal of
the American Statistical Association 106 (2011), 746–762.
Gneiting, Tilmann, and Adrian E. Raftery, “Strictly Proper Scoring Rules,
Prediction, and Estimation,” Journal of the American Statistical
Association 102 (2007), 359–378.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
r
e
s
t
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
9
8
4
7
4
2
1
9
7
4
8
0
5
/
r
e
s
t
_
a
_
0
0
5
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
FORECASTING BINARY PROBABILITIES
755
Good, Irving J., “Rational Decisions,” Journal of the Royal Statistical
Society, Series B 14 (1952), 107–114.
Granger, Clive W. J., “On the Limitations of Comparing Mean Square
Forecast Errors: Comment,” Journal of Forecasting 12 (1993),
651–652.
Granger, Clive W. J., and M. Hashem Pesaran, “Economic and Statistical
Measures of Forecast Accuracy,” Journal of Forecasting 19 (2000),
537–560.
Hamilton,
James D.,
“Econbrowser—
and Menzie D. Chinn,
Analysis of Current Economic Conditions and Policy” (2014),
http://www.econbrowser.com.
Hand, David J., and Veronica Vinciotti, “Local versus Global Models
for Classification Problems: Fitting Models Where It Matters,”
American Statistician 57 (2003), 124–131.
Hayashi, Fumio, Econometrics (Princeton, NJ: Princeton University Press,
2000).
Inoue, Atsushi, and Barbara Rossi, “Monitoring and Forecasting Currency
Crises,” Journal of Money, Credit and Banking 40 (2008), 523–534.
Knapp, Laura G., and Terry G. Seaks, “An Analysis of the Probability of
Default on Federally Guaranteed Student Loans,” this review 74
(1992), 404–411.
Koenker, Roger, and Jungmo Yoon, “Parametric Links for Binary Choice
Models: A Fisherian–Bayesian Colloquy,” Journal of Econometrics
152 (2009), 120–130.
Lieli, Robert P., and Augusto Nieto-Barthaburu, “Optimal Binary Predic-
tion for Group Decision Making,” Journal of Business and Economic
Statistics 28 (2010), 308–319.
Lieli, Robert P., and Michael Springborn, “Closing the Gap between
Risk Estimation and Decision Making: Efficient Management of
Trade-Related Invasive Species Risk,” this review 95 (2013),
632–645.
Lieli, Robert P., and Halbert White, “The Construction of Empirical
Credit Scoring Rules Based on Maximization Principles,” Journal
of Econometrics 157 (2010), 110–119.
Manski, Charles F., and T. Scott Thompson, “Estimation of Best Predictors
of Binary Response,” Journal of Econometrics 40 (1989), 97–123.
Mass, Clifford, Jeff Baars, Susan Joslyn, John Pyle, Patrick Tew-
son, David Jones, Tilmann Gneiting, Adrian E. Raftery,
J. McLean Sloughter, and Chris Fraley, “PROBCAST: A Web-
Based Portal to Mesoscale Probabilistic Forecasts,” Bulletin of the
American Meteorological Society 90 (2009), 1009–1014.
Merkle, Edgar C., and Mark Steyvers, “Choosing a Strictly Proper Scoring
Rule,” Decision Analysis 10 (2013), 292–304.
Patton, Andrew J., “Comparing Possibly Misspecified Forecasts,” unpub-
lished manuscript, Duke University (2015).
R Core Team, “R: A Language and Environment for Statistical Com-
puting” (Vienna, Austria: R Foundation for Statistical Computing,
2015).
Schervish, Mark J., “A General Method for Comparing Probability Asses-
sors,” Annals of Statistics 17 (1989), 1856–1879.
Shuford, Emir H., Arthur Albert, and H. Edward Massengill, “Admissible
Probability Measurement Procedures,” Psychometrika 31 (1966),
125–145.
Toda, Masanao, “Measurement of Subjective Probability Distribution,”
Report 3 (State College, PA: Institute for Research, Division of
Mathematical Psychology, 1963).
Weiss, Andrew A., “Estimating Time Series Models Using the Relevant
Cost Function,” Journal of Applied Econometrics 11 (1996), 539–
560.
Weiss, Andrew A., and Allan P. Andersen, “Estimating Time Series Models
Using the Relevant Forecast Evaluation Criterion,” Journal of the
Royal Statistical Society, Series A 147 (1984), 484–487.
White, Halbert, Asymptotic Theory for Econometricians, 2nd ed. (San
Diego, CA: Academic Press, 2001).
Wickham, Hadley, ggplot2: Elegant Graphics
(New York: Springer, 2009).
for Data Analysis
Wooldridge, Jeffrey M., “Estimation and Inference for Dependent Pro-
cesses” in Robert F. Engle and Daniel L. McFadden, eds., Handbook
of Econometrics, vol. 4 (Amsterdam: Elsevier Science, 1994).
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
r
e
s
t
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
9
8
4
7
4
2
1
9
7
4
8
0
5
/
r
e
s
t
_
a
_
0
0
5
6
4
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Scarica il pdf