FORECASTING CONDITIONAL PROBABILITIES OF BINARY OUTCOMES

FORECASTING CONDITIONAL PROBABILITIES OF BINARY OUTCOMES
UNDER MISSPECIFICATION

Graham Elliott, Dalia Ghanem, and Fabian Krüger*

Abstract—We consider constructing probability forecasts from a parametric
binary choice model under a large family of loss functions (“scoring rules”).
Scoring rules are weighted averages over the utilities that heterogeneous
decision makers derive from a publicly announced forecast (Schervish,
1989). Using analytical and numerical examples, we illustrate how different
scoring rules yield asymptotically identical results if the model is correctly
specified. Under misspecification, the choice of scoring rule may be incon-
sequential under restrictive symmetry conditions on the data-generating
processi. If these conditions are violated, typically the choice of a scoring
rule favors some decision makers over others.

IO.

introduzione

C ONSIDER the problem of forecasting an as yet unob-

served outcome represented by the random variable Y ,
which takes on values {0, 1} with a vector of observables
X. The conditional probability that Y = 1 conditional on
X = x, denoted by p[X], is equivalent to the conditional
mean and distributional forecast for the binary outcome Y .
By contrast, a point forecast in this situation equals either 0
O 1. Point forecasting thus corresponds to choosing a binary
action, and it is natural if the forecaster and the decision
maker are one entity (Elliott & Lieli, 2013; Lieli & White,
2010). In situations where the forecaster and the decision
maker are separate entities, probability forecasts are often
provided since they allow decision makers to construct their
own point forecasts using their respective loss functions.
Examples of such “public forecasting” scenarios include
recession probabilities (per esempio., forecasts by Hamilton & Chinn,
2014), sovereign default probabilities (per esempio., Deutsche Bank,
2014), or probabilities of binary weather outcomes, ad esempio
rain (per esempio., Mass et al., 2009).

It is common in practice to estimate the conditional prob-
ability forecast using a parametric model, such as logit or
probit. To estimate this model, the forecaster must choose
a loss function. If the model is correctly specified, the con-
sistency and efficiency of the maximum likelihood estimator
(MLE) justify its choice as an estimation strategy. In practice,

Received for publication February 10, 2014. Revision accepted for

publication June 22, 2015. Editor: Mark W. Watson.

* Elliott: University of California, San Diego; Ghanem: University of Cal-
ifornia, Davis, and Giannini Foundation; Krüger: Heidelberg Institute for
Theoretical Studies (HITS).

We thank the editor and two anonymous referees, as well as seminar and
conference participants at the University of Konstanz (Gennaio 2013), Hei-
delberg University (May 2013), and IAAE (London, Giugno 2014) for helpful
comments and suggestions. All numerical computations in this paper were
done using the R programming language (R Core Team, 2015), whereby the
ggplot2 package (Wickham, 2009) was used for some of the graphical illus-
trations. The third author thanks UCSD for its hospitality during a research
visit, as well as the Klaus Tschira Foundation for infrastructural support
at the Heidelberg Institute for Theoretical Studies (HITS). He gratefully
acknowledges funding from the Deutsche Forschungsgemeinschaft (grant
PO 375/13-1) and the European Union Seventh Framework Programme
(grant agreement no. 290976).

A supplemental appendix is available online at http://www.mitpress

journals.org/doi/suppl/10.1162/REST_a_00564.

the model is very likely to be misspecified, and the choice
of loss function will typically matter for the forecast, even
asymptotically. Loss functions for distributional forecasts,
such as predicted probabilities, are called “scoring rules”
(Gneiting & Raftery, 2007). The log score and the Brier
score, which give rise to the MLE and nonlinear least
piazze, rispettivamente, are widely used in practice. Tuttavia,
there are many other scoring rules that may be of interest.

A first contribution of this paper is to review the scattered
theoretical literature on scoring rules for the binary case.
In the context of a double binary decision problem (two
outcomes, two possible actions), scoring rules are weighted
averages over the utility functions of heterogeneous deci-
sion makers (Shuford et al., 1966; Schervish, 1989). Questo
heterogeneity stems from the different costs that individ-
ual decision makers face under false positives versus false
negatives. For instance, in forecasting the probability of cur-
rency crises (Inoue & Rossi, 2008), currency traders’ costs of
false positives and false negatives will vary with their degree
of exposure. Another example is forecasting default proba-
bilities of federally insured student loans (Knapp & Seaks,
1992). Lenders will tend to prefer false positives to false
negatives, where the degree to which they prefer the former
over the latter depends on their exposure and other loans on
their menu. Student borrowers will prefer false negatives to
false positives at varying degrees depending on their needi-
ness, the loan amount, and how much they value their future
credit history.

Popular scoring rules are based on a certain symmetry in
these weighted averages. This symmetry may or may not
be appropriate for a given empirical problem. Asymmetric
rules relate to situations in which false negatives are much
more costly than false positives or vice versa. Per esempio,
Lieli and Springborn (2013) analyze the environmental pol-
icy decision of whether to admit possibly invasive biological
imports. From the consumer’s point of view, it may be dev-
astating to mistakenly classify an import as safe; in contrast,
classifying a safe import as unsafe is typically less harmful.
From the importer’s perspective, Tuttavia, a false positive is
much more costly than a false negative. A symmetric scor-
ing rule, such as the log score, weighs the consumers’ and
importers’ utility functions equally, which may not be appro-
priate if the safety of the general public is at stake. There
are many other settings where symmetry may not be justi-
fied, such as forecasting natural disasters (Dire, wildfires or
earthquakes) and economic disasters (Dire, recessions).

Despite their empirical relevance, such asymmetries are
not reflected in common rules such as the log or Brier
score. We hence show how to construct proper asymmetric
scoring rules in the spirit of Buja, Stuetzle, and Shen (2005).
This involves prioritizing some decision makers over others,

The Review of Economics and Statistics, ottobre 2016, 98(4): 742–755
© 2016 by the President and Fellows of Harvard College and the Massachusetts Institute of Technology. Published under a Creative Commons Attribution 3.0
Unported (CC BY 3.0) licenza.
doi:10.1162/REST_a_00564

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
R
e
S
T
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

9
8
4
7
4
2
1
9
7
4
8
0
5
/
R
e
S
T
_
UN
_
0
0
5
6
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

FORECASTING BINARY PROBABILITIES

743

and hence resembles the aggregation of utilities in a social
planner’s problem (Lieli & Nieto-Barthaburu, 2010). In
order to actually use a wide variety of scoring rules for
parameter estimation, convergence of the resulting estima-
tors is a key concern. We therefore provide conditions for
a weak law of large numbers, allowing for time series
dependence in the data.

Under misspecification, different choices of scoring rules
may lead to different estimates of the forecast model. Match-
ing the scoring rule used for parameter estimation with the
one used for forecast evaluation has been recommended in
the literature for this reason. Tuttavia, there has been less
work on examining the magnitude of these effects. In (non-
binary) MSE-based forecasting, Weiss and Andersen (1984)
make this point with respect to using autoregressions as fore-
cast models. Granger (1993) makes this suggestion without
elaboration, while Weiss (1996) makes the point more gen-
erally, giving results. Hand and Vinciotti (2003) make the
same suggestion in examining models for binary forecasting,
while Gneiting (2011) and Patton (2015) consider several
types of point forecasts.

We contribute to this literature by providing novel ana-
lytical and numerical evidence for the binary case. With the
aid of an analytical example, we illustrate how the choice of
scoring rule is inconsequential in the case of correct spec-
ification. In the case of misspecification, we characterize
the conditions under which the choice is inconsequential.
These conditions consist of specific symmetry conditions on
the data-generating process that are likely to be violated in
practice. A Monte Carlo study illustrates that if a subset of
the conditions is violated, the choice of scoring rule affects
parameter estimates, forecasts, and—most importantly—
decisions. While these effects are qualitatively robust, their
magnitude is necessarily case specific and depends on factors
such as the true DGP, the set of scoring rules being compared,
and the preferences of the decision maker. These preferences
determine whether the differences between two predicted
probabilities (obtained under scoring rules A and B, Dire) are
relevant in the sense of leading to different decisions.

The plan of this paper is as follows. Section II presents
some theoretical results. Section IIA reviews results that link
the scoring rule and the decision-maker loss functions. Noi
present results for the consistency of estimators using a wide
variety of scoring rules in section IIB. Section IIC uses an
analytical example to illustrate why matching loss functions
for estimation and evaluation may be important in the case of
misspecification. Section III provides a Monte Carlo demon-
stration of the theoretical points. Section IV concludes.

II.

Scoring Rules, Estimation, and Evaluation

UN. Characterization of Scoring Rules

Consider the problem of forecasting a binary random vari-
able Y with outcomes y ∈ {0, 1} given some predictors
denoted by the random variable(S) X with outcomes x ∈ X,

where X denotes the support of X.1 We denote by p[X] ∈ P
models of the conditional probability that Y = 1, Dove
p0[X] is the correctly specified model. We use brackets to
distinguish p[X] from the notation p(X). The latter notation
implies that p is a function defined on the support of X. For
p defined on the support of a linear index x(cid:3)θ, P[X] = p(X(cid:3)θ).
The brackets hence allow us to subsume θ. We do not assume
that p0[X] ∈ P. For notational brevity, we will often refer to
p and p0 instead of p[·] and p0[·]. A decision maker chooses
a function f (X) from the space of X to {0, 1}. The opti-
mal choice of this function depends on both the conditional
probability that Y = 1 and the utility function of the decision
maker. The decision maker’s utility function has the form

U( sì, F , C) =

,

(1)

⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎩

0
−c

if f = 1 and y = 1
if f = 1 and y = 0
(1 − c) if f = 0 and y = 1
if f = 0 and y = 0

0

⎪⎪⎪⎪⎬
⎪⎪⎪⎪⎭

Dove 0 < c < 1. Now the utility function can be rewritten as U( y, f , c) = −c1( y − f = −1) − (1 − c)1( y − f = 1) = −c(1 − y)f − (1 − c)y(1 − f ), (2) where 1(A) denotes the indicator function of the event A. Note that the utility function is normalized such that a correct decision yields 0 utility. The utilities for incorrect decisions are normalized to sum to 1 in absolute value. This is without loss of generality when U( y, f , c) depends only on these two outcomes.2 Thus, c = 0.5 indicates a decision maker’s indifference between false positives ( f = 1 and y = 0) and false negatives ( f = 0 and y = 1). To motivate the problem, we use the example of forecast- ing a storm at a coastal location. We consider two types of decision makers, local restaurant owners and fishermen. Let y denote whether a storm takes place or not. If f = 1, the restaurant owners will allocate fewer staff members, since they expect to be serving fewer customers. If f = 0, the restaurants will hire their full staff. The fishermen will go fishing only if f = 0. We expect restaurant owners to prefer false negatives to false positives: c ≥ 0.5. In the case of a false negative, tourists will be visiting the coastal location expecting good weather. Since a storm occurs, they will be spending more time at restaurants. Restaurant owners will have hired their full staff, and hence are prepared to serve many customers. In the case of a false positive, the restau- rants do not hire additional staff, and fewer tourists will be visiting the location. Thus, the restaurant owners’ profits will 1 We assume that outcomes from X are observable and that X does not include lagged values of Y . 2 Elliott and Lieli (2013) consider the more general case in which utility depends not only on the realization y and point forecast f , but also on other state variables such as measured covariates x. In this case, the normaliza- tions we use here are not available, and the utility function takes a more complicated form. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 8 4 7 4 2 1 9 7 4 8 0 5 / r e s t _ a _ 0 0 5 6 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 744 THE REVIEW OF ECONOMICS AND STATISTICS be smaller. The fishermen are likely to prefer false positives to false negatives: c ≤ 0.5. In the case of a false positive (staying at home but no storm), even though they lose the catch, they save on fuel. In the case of a false negative (going fishing when there is a storm), they may lose their equipment or even put their own life in danger. In reality, we have a continuum of heterogeneous fishermen and restaurant own- ers with different values of c. The exact value c ∈ [0, 0.5] of a fisherman’s utility will depend on the value of his or her equipment and the number of staff on his or her crew. Similarly, a restaurant owner’s c ∈ [0.5, 1] will depend on the restaurant size, menu, and how much additional staff he or she hires. Optimal forecasts for this problem are to set f (x) = 1( p0[x] > C) (see Schervish, 1989; Boyes, Hoffman, & Basso,
1989; Granger & Pesaran, 2000). This result assumes the
knowledge of the true conditional probability. In practice,
the unknown true probability p0[X] is replaced by an esti-
mate p[X] that can be frequentist (as in our analysis below)
or Bayesian (Lieli & Springborn, 2013). The utility function
can now be written as

U( sì, P, C) = −y(1 − c)1( p ≤ c) (1 − y)c1( p > c).

(3)

It is important to note that the parameter c plays a dual
role in equation (3). In addition to determining the decision
maker’s preference over false positives and negatives, c is
part of the optimal forecasting rule and determines how the
decision maker interprets a probability forecast. If p ≤ c,
the decision maker will interpret it as f = 0; otherwise, IL
decision maker will interpret it as f = 1. Coming back to our
esempio, consider a fisherman with c = 0.25 and a restaurant
owner with c = 0.75. In questo caso, the optimal forecasting
rule has the following implications. If p < 0.25, then nei- ther the fisherman nor the restaurant owner will interpret the probability forecast to indicate that a storm will occur. If 0.25 ≤ p < 0.75, then only the fisherman will interpret the probability forecast to indicate that f = 1 and will not go fishing. If p ≥ 0.75, then both the restaurant owner and the fisherman will interpret the probability forecast to indi- cate that f = 1. This shows how the preference over false positives and negatives informs the interpretation of the con- ditional probability forecast under the optimal forecasting rule. To construct a forecast, we require an estimate of the true conditional probability of Y = 1—an estimate of p0[x] or a procedure that directly estimates 1( p0[x] > C). Manski
and Thompson (1989) and Elliott and Lieli (2013) examine
the latter approach and show how direct estimation lessens
the need for an exact understanding of the true conditional
probability. Essentially, the function 1( p0[X] > c) is easier
to specify correctly than p0[X] since the former is a step
function. Our example shows why we might estimate the
conditional probability instead, since it gives the individual
decision makers (cioè., the restaurant owners and fishermen)

the opportunity to interpret the forecast in a manner that is
optimal based on their own preferences. Generalmente, Quando
there are users with a range of utility functions (cioè., values
for c), then provision of an estimate for p0[X] enables all
users to construct their own forecast rules (see Lieli & Nieto-
Barthaburu, 2010).

When constructing a model p for the conditional probabil-
ità, we require a scoring rule for estimation. By definition, UN
proper scoring rule S( sì, P) is a function for which E[S( sì, P)]
is finite and maximized at p = p0. It is considered to be a
strictly proper scoring rule if this maximum is unique: IL
rule is maximized only at the true value for the probabil-
ità (see Gneiting & Raftery, 2007). From an econometric
perspective, this means that the conditional probability is
identified by the scoring rule. For binary outcomes, all proper
scoring rules have the form

S( sì, P) = yf1( P) + (1 − y)f2( P).

(4)

Schervish (1989, theorem 4.2) shows that proper scor-
ing rules can be seen as weighted averages of many utility
functions, where the weights are over different cutoff values
C. Denote by ν(C) a nonnegative weighting function over c
defined on [0, 1]. By integrating the utility for a single deci-
sion maker in equation (3), we obtain the weighted average
utility function

(cid:9)

S( sì, P) = −y

1

(1 − c)1( p ≤ c)ν(C)dc

0

(1 − y)

(cid:9)

1

0

c1( p > c)ν(C)dc.

(5)

ν(C) may be viewed as the density of the preference param-
eter c in a population of decision makers that the fore-
caster seeks to inform. Hence, it has an intuitive economic
interpretation.

Equating equations (4) E (5), we see that f1( P) =
(cid:10)
(cid:10)
1
1
0 (1 − c)1( p ≤ c)ν(C)dc and f2( P) = −

0 c1( p >
C)ν(C)dc. As Schervish (1989) shows, scoring rules with this
form are strictly proper if ν(C) gives a nonzero weighting
over all c ∈ [0, 1]. These results are useful in a number
of ways. Primo, through specification of ν(C), they provide
a constructive approach to designing scoring rules. Secondo,
for existing scoring rules, they show the implicit weight-
ing over decision makers’ utility functions that underlie the
construction of that particular scoring rule. Tavolo 1, Quale
extends table 1 of Gneiting and Raftery (2007), gives several
scoring rules and their implicit weights, ν(C). This includes
popular approaches. Notice that the log scoring rule is sim-
ply (pseudo) maximum likelihood for a parameterized model
of p[X]. This is the most common approach to parametri-
cally estimating models of the conditional probability, Dove
the models are typically either logit or probit. (See Lieli &
Nieto-Barthaburu, 2010, for an economic interpretation of
the weighting that underlies maximum likelihood.)

It is also possible through defining f1( P) to provide a pos-
itive approach to constructing proper scoring rules. If f1( P)

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
R
e
S
T
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

9
8
4
7
4
2
1
9
7
4
8
0
5
/
R
e
S
T
_
UN
_
0
0
5
6
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

FORECASTING BINARY PROBABILITIES

745

Table 1.—Summary of the Scoring Rules We Consider

Nome

Log

(half) Brier

Spherical

Boosting

As1

As2

See equations (4) E (5) for details.

f1( P)

ln( P)

1

2 (1 − p)2
p√
1−2p+2p2
(cid:11)

1−p
P

ln( P) − p + 1

f2( P)

ln(1 − p)

1

2 p2
1−p√
1−2p+2p2
(cid:11)

P
1−p

−p

p − 1

P + ln(1 − p)

and f2( P) are once differentiable such that ∂f1( P)/∂p > 0
for p ∈ (0, 1) E

(cid:12)

(cid:13)

∂f2( P)
∂p

= − ∂f1( P)

∂p

P
1 − p

,

(6)

then S( sì, P) in equation (4) is a proper scoring rule with

(cid:12)

ν(C) = ∂f1(C)
∂c

1
1 − c

(cid:13)

.

This result was obtained by Shuford et al. (1966), restated
in theorem 4.1 of Schervish (1989). Notice that this relates
f1( P) to f2( P) through their slopes at any p. Hence, one
can construct a proper scoring rule by defining f1( P) E
constructing f2( P) using this restriction. Per esempio, set
f1( P) = p, so ∂f1( P)/∂p = 1 > 0 for all p. Now ν(C) =
(1 − c)−1 > 0 for c ∈ (0, 1). Using this ν(C) results in a
proper scoring rule. To obtain f2( P), we solve

ν(C)

[C (1 − c)]−1

1

(1 − 2c + 2c2)−3/2

[C (1 − c)] 3

2

1
2

c−1

(1 − c)−1

Fonte

Good (1952)

Brier (1950)

Toda (1963)

Buja et al. (2005)

Gneiting and Raftery (2007)

Figure 1.—Weighting Functions ν(C) for the Scoring Rules in Table 1

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
R
e
S
T
/

l

UN
R
T
io
C
e

P
D

F
/

(cid:9)

f2( P) = −

= −

(cid:9)

0

0

P

P

(C)dc

C
1 − c

dc.

All functions have been multiplied by a scaling factor for comparability.

Integrating, we obtain f2( P) = p + ln(1 − p) and hence

S( sì, P) = yp + (1 − y)[P + ln(1 − p)].

This is a strictly proper scoring rule. It is also worth men-
tioning that convex combinations of strictly proper scoring
rules are also strictly proper.3

For either of these directions, it is an outcome of the pro-
cess that we obtain an understanding of ν(C), the weights
over the individual decision makers. The weighting functions
for various popular scoring rules given in table 1 are pictured

3 To see this, consider the example of a convex combination of the log
score and As1 score, which we use as an example in section IIC. Each of
these scores is maximized in expectation by the true probability. Hence,
any convex combination of the two is also maximized in expectation by
the true probability and thus defines a strictly proper scoring rule itself.
Alternatively, note that a convex combination of the log and As1 scores
again satisfies the relationship in equation (6), and thus inherits strict
propriety.

in figure 1. We see that the log score and Boosting loss cor-
respond to U-shaped weighting functions, each placing very
similar weights over c. The weighting functions of the Brier
and spherical score are flat and bell shaped, rispettivamente. Tutto
of the popular rules are symmetric around c = 0.5, the point
of indifference between false negatives and false positives.
There is no obvious reason why this might be appropri-
ate in general for situations where distributional forecasts
are to be provided. In our simplified example of forecasting
storms, where we have the restaurant owners (c ≥ 0.5) E
the fishermen (c ≤ 0.5), the forecaster may prefer to weigh
the restaurant owners’ and fishermen’s utility functions, for
esempio, according to their proportion in the region’s popu-
lation or based on economic revenues. Così, this weighting
may have a social, political, or economic motivation. Questo
paper is not concerned with the justification of a particular
weighting scheme over another. Our goal is to show that the
choice of the weighting function and thereby the scoring rule

/

/

/

9
8
4
7
4
2
1
9
7
4
8
0
5
/
R
e
S
T
_
UN
_
0
0
5
6
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

746

THE REVIEW OF ECONOMICS AND STATISTICS

may have consequences on conditional probability forecasts
and individual decision making in practice.

Buja et al. (2005) and Merkle and Steyvers (2013) use
the beta distribution to parameterize the weighting function
ν(C). This leads to a flexible two-parameter family of scoring
rules. A somewhat simpler approach is to directly choose a
given shape for ν(C). This is exemplified by the As1 and As2
scoring rules shown in figure 1. In the first of these rules, we
set ν(C) so that it heavily weights small values of c relative
to large values. This would be a situation where forecasters
that are extremely averse to losses from false negatives are
heavily weighted. The specification of As2 does the reverse
of this, heavily weighting forecasters who are heavily averse
to false positives.4 By specifying ν(C) directly according to
a reasonable weighting function, the results presented above
allow us to construct economically meaningful scoring rules
that are strictly proper. This situation, which draws on the
existence of the Schervish (1989) representation for scor-
ing rules, helps to bridge the gap between economic and
statistical forecast evaluation criteria.

B. Scoring Rules and Parameter Estimation

For all of the choices of proper scoring rules, one can
consider estimating p[X] using the scoring rule as a loss func-
zione. Consider linear index models: P[X] = p(X(cid:3)θ). With data
{yt, xt}T
t=1, we can consider estimating the parameters of the
model from the maximization

ˆθ = arg max
θ∈Θ

T(cid:14)

t=1

S( yt, P(X(cid:3)

T

θ)).

T

T

θ) = ex(cid:3)

θ/(1 +
Per esempio, for the log scoring rule and p(X(cid:3)
T
ex(cid:3)
θ), this would be maximum likelihood of the logit model.
For the same model with the half Brier score as the scor-
ing rule, this would be nonlinear least squares estimation of
the logit model. Various combinations of scoring rules and
models could be employed to obtain a parameter estimate
ˆθ, and from this, an estimate of the conditional probability
that the outcome is 1, P(X(cid:3) ˆθ), for any possible x. Under fairly
general conditions, ˆθ p→ θ∗, Dove
θ))].

E[S( yt, P(X(cid:3)

θ∗ = arg max
θ∈Θ

T

The following theorem provides a set of conditions for
achieving this consistency result. As detailed below, IL
theorem can be seen as a special case of M-estimation
(Wooldridge, 1994), adapted to the situation of using strictly
proper scoring rules and binary models.

Theorem 1. Assume:

UN. S(.) is a strictly proper scoring rule.
B. θ ∈ Θ ⊂ Rk, where Θ is compact.

4 Gneiting and Raftery (2007, esempio 5) show that the As2 rule is also a

member of the Buja et al. (2005) family of scoring rules.

C. For each yt ∈ {0, 1}, xt ∈ X, f1( P), f2( P) are measur-
able and differentiable in p and p(xt, θ) is measurable
and differentiable in θ.

D. E|fi( P(X(cid:3)
T
and supxt ∈X supθ∈Θ
K < ∞. θ))|r+δ < Δ < ∞ for i = 1, 2, δ > 0, r ≥ 1
(cid:15)
(cid:15) < θ))−1f (cid:3) (cid:15) (cid:15)(1 − p(x(cid:3) θ) θ))p(cid:3)(x(cid:3) t 1( p(x(cid:3) t t e. {yt, xt} are strictly stationary mixing processes with uni- form mixing size −r/(2r − 1) with r ≥ 1 or strong mixing of size −r/(r − 1), r > 1.

F. E (cid:11)xt(cid:11)r+λ < Δ1 < ∞ for r ≥ 1 and λ > 0.

Then ˆθ p→ θ∗ where θ∗ = arg maxθ∈Θ E[S( yt, P(X(cid:3)

T

θ))].

Proof. The proof follows from using these conditions
the conditions of theorems 4.2 E 4.3
to show that
IL
of Wooldridge (1994) hold. Primo, via theorem 4.2,
conditions are sufficient that maxθ∈Θ |T −1
t=1 S( yt, P(X(cid:3)
θ))
θ))]| p→ 0, so the objective function converges
−E[S( yt, P(X(cid:3)
T
to its expected value uniformly in θ. Conditions b and c yield
the theorem’s conditions i and ii. For theorem 4.2 part iv,
notice that for all θ1, θ2 ∈ Θ, the mean value theorem and
equation (6) yield that

(cid:16)
T

T

θ1)) − S( yt, P(X(cid:3)
(cid:12)

T

T

=

(cid:15)
(cid:15)S( yt, P(X(cid:3)
(cid:15)
(cid:15)
(cid:15)
1( P(X(cid:3)
(cid:15)F (cid:3)
(cid:17)
(cid:17)
(cid:17)
(cid:17)F (cid:3)
(cid:15)
(cid:15)
(cid:15)
(cid:15)F (cid:3)
1( P(X(cid:3)
(cid:15)
(cid:15)
(cid:15)
1( P(X(cid:3)
(cid:15)F (cid:3)
(cid:18)

1( P(X(cid:3)

=

T

T

T

(cid:13)

(cid:15)
(cid:15)

θ2))
T
yt − p(X(cid:3)
˜θ)
T
1 − p(X(cid:3)
˜θ)
T
yt − p(X(cid:3)
˜θ)
T
1 − p(X(cid:3)
˜θ)
T
yt − p(X(cid:3)
˜θ)
T
1 − p(X(cid:3)
˜θ)
T
1
1 − p(X(cid:3)
T

˜θ))

(cid:15)
(cid:15)
(cid:15)
(cid:15) ,

X(cid:3)
T(θ1 − θ2)
(cid:17)
(cid:13)
(cid:17)
(cid:17)
(cid:17) (cid:11)θ1 − θ2(cid:11) ,

xt

(cid:13)(cid:15)
(cid:15)
(cid:15)
(cid:15)(cid:11)xt(cid:11) (cid:11)θ1 − θ2(cid:11) ,
(cid:13)(cid:15)
(cid:18)
(cid:15)
(cid:15)
(cid:15)

X(cid:3)
txt (cid:11)θ1 − θ2(cid:11) ,

˜θ))P(cid:3)(X(cid:3)
T

˜θ)

(cid:12)

˜θ))P(cid:3)(X(cid:3)
T

˜θ)

(cid:12)

(cid:12)

˜θ))P(cid:3)(X(cid:3)
T

˜θ)

˜θ))P(cid:3)(X(cid:3)
T

˜θ)

≤ K

X(cid:3)
txt (cid:11)θ1 − θ2(cid:11) ,

θ)) E

where ˜θ is an intermediate value for θ. Parts iii and ivb
(cid:18)
X(cid:3)
require that S( yt, P(X(cid:3)
txt satisfy a WLLN point-
T
wise in θ. These results follow from assumptions d through
F, which are sufficient for a WLLN via corollary 3.48 Di
White (2001). Notice that E[S( yt, P(X(cid:3)
θ))]q for any integer
T
q is equal to E[ytf1( P(X(cid:3)
θ))q + (1 − yt)f2( P(X(cid:3)
θ))q] Quale
(cid:18)
T
T
X(cid:3)
is finite given d for q ≤ r + δ. A WLLN for
txt fol-
lows directly from assumptions e and f. Consistency of the
estimate ˆθ follows from the conditions being sufficient for
theorem 4.3 of Wooldridge (1994). Condition M1 and M2
follow directly from assumptions b and c, along with the
results presented for uniform convergence of the average of
the objective function. Condition M3 follows directly from
assumption a.

Assumption a ensures that the scoring rule is strictly
proper and thus admits the decomposition in equation 4.
Inoltre, it ensures that the objective function attains a
unique maximum at θ∗. If S(.) is relaxed to be a proper

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu
/
R
e
S
T
/

l

UN
R
T
io
C
e

P
D

F
/

/

/

/

9
8
4
7
4
2
1
9
7
4
8
0
5
/
R
e
S
T
_
UN
_
0
0
5
6
4
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

FORECASTING BINARY PROBABILITIES

747

(cid:20)

(cid:19)

(cid:19)

θ))

ˆθ))

S( yt, P(X(cid:3)
T

S( yt, P(X(cid:3)
T

(cid:20) p→ maxθ∈Θ E

scoring rule that is not necessarily strictly proper, Poi
a result similar to theorem 1 still holds. In questo caso,
E
, which is suf-
ficient to justify the procedure. Assumption b is the standard
requirement to ensure a maximum. Assumption c is a stan-
dard regularity condition. The conditions in d relate to the
scoring rule and model being employed, and are functions
of both of these choices. The first part of d ensures that
expected loss exists. The second part is employed as part
of the requirements for uniform consistency of the objec-
tive function. This assumption seems strong; Tuttavia, Esso
holds widely since these objects are functions of p(X(cid:3)
θ),
T
which is bounded between 0 E 1 for all xt and θ. For
notational brevity, let st = x(cid:3)
θ. Per esempio, consider the
T
half Brier score with a logit model for the conditional
probability that yt = 1. Then f (cid:3)
1( P(st)) = 1 − p(st) E
(cid:15)
(cid:15) =
P(cid:3)(st) = p(st)(1 − p(st)) so
|P(st)(1 − p(st))| 0.5; hence, the second part of assump-
tion d is satisfied. In questo caso, E|f1( P(st))|r+δ = E|
0.5(1 − p(st))2|r+δ ≤ 0.5r+δ, and so is finite for all r, δ
finite. For the log score with a logit model for the con-
ditional probability, we have that f (cid:3)
1( P(st))P(cid:3)(st) = (1
(cid:15)
(cid:15) = 1. For the
P(st)) and so
spherical scoring rule, F (cid:3)
1( P(st)) = (1 − p(st))/(( P(st))2 +
(cid:15)
(cid:15) =
1( P(st))P(cid:3)(st)
(1 − p(st))2)3/2. Hence,
(cid:15)
(cid:15)
(cid:15) ≤ 2−1/2,
(cid:15)( P(st)(1 − p(st)))/(( P(st))2 + (1 − p(st))2)3/2
and so this is also bounded.

1( P(st))P(cid:3)(st)
(cid:15)
(cid:15)(1 − p(st))−1f (cid:3)

(cid:15)
(cid:15)(1 − p(st))−1f (cid:3)

(cid:15)
(cid:15)(1 − p(st))−1f (cid:3)

1( P(st))P(cid:3)(st)

The mixing assumptions in e impose a limit on the degree
of time series dependence in the data. The requirement of
strict stationarity gives meaning to the idea that we obtain
the true conditional probability, at least asymptotically, Quando
the model is correctly specified. If the data are not strictly
stationary, then we can still obtain consistency results; how-
ever, the interpretation of θ∗ changes to being a limiting value
that minimizes the average expected losses over time. È
worth noting that some strictly proper scoring rules, ad esempio
the half Brier and spherical scores considered in this paper,
are bounded. For bounded scoring rules, the assumptions of
theorem 1 can be relaxed. Primo, the technical requirements
of assumption d either become obsolete or are trivially sat-
isfied. Secondo, assumption f, which ensures a WLLN for
(cid:18)

X(cid:3)
txt, is not required.5
Conceptually, the main purpose of theorem 1 is to illus-
trate that the structure of scoring rules makes them well
suited for designing (consistent) parameter estimators in
the tradition of M-estimators (per esempio., Hayashi, 2000; see also
Gneiting & Raftery, 2007). There are many possible sets
of assumptions (per esempio., various ways of restricting time series

T

(cid:15)
(cid:15)S( yt, P(X(cid:3)

(cid:15)
(cid:15) < C, it follows that θ)) 5 To see this, note that if supθ∈Θ (cid:15) (cid:15) (cid:15)S( yt, p(x(cid:3) (cid:15) < 2C. This means that the stochastic upper θ1)) − S( yt, p(x(cid:3) θ2)) t bound in the proof of theorem 1 can be replaced by a constant. Hence, a WLLN for the upper bound is trivially satisfied, and assumption f becomes obsolete. Furthermore, the second part of assumption d, which is used in the (cid:15) (cid:15), is no longer general case to bound the term required. Finally, the first part of assumption d is automatically satisfied since the fi( p(x(cid:3) t θ)), i = 1, 2, are bounded. θ1)) − S( yt, p(x(cid:3) t (cid:15) (cid:15)S( yt, p(x(cid:3) θ2)) t t dependence) that could lead to the statement of theorem 1. Our chosen set of assumptions aims to strike a balance between generality and clarity of presentation, although results under alternative trade-offs between conditions are possible. Furthermore, results under more primitive condi- tions are available in more specialized settings. For example, de Jong and Woutersen (2011) analyze consistency in the important special case of a correctly specified probit model with lagged dependent variables, estimated via maximum likelihood. See their theorems 1 and 2 for low-level con- ditions that guarantee limited dependence properties of the data-generating process and their theorem 3 on consistency. In terms of understanding the results for forecasting binary outcomes, two results follow directly. First, if the model is correctly specified, that is, p0[xt] = p(x(cid:3) θ0), then for all t strictly proper scoring rules, θ∗ = θ0. Hence, all strictly proper scoring rules will give the same true conditional prob- ability asymptotically. This follows directly from the fact that strictly proper scoring rules are uniquely maximized by the true conditional probability. Thus, if the model is correctly specified, the choice of the best scoring rule to use depends not on the reasonableness of θ∗ but instead on the efficiency of the estimator ˆθ obtained by maximizing a particular scor- ing rule. The popularity of the MLE derives from the fact that it is an efficient parameter estimator under correct spec- ification. However, it should be understood that the latter strong assumption is crucial in establishing the optimality of the MLE. When the model is not correctly specified, there is no reason that θ∗ should be the same over different scoring rules. In practice, they will differ, and then so will the esti- mated conditional probability even asymptotically. Hence, decisions made for any particular loss function for the deci- sion maker (value for c) will also differ across scoring rules. Scoring rules placing more weight on high values of c will provide probability forecasts, which are most useful for deci- sion makers with high values of c, and vice versa. In order to attain (asymptotic) optimality, the scoring rule chosen for estimating the parameters of the model should match the scoring rule used to evaluate the probability forecast. Under the conditions above, the magnitude of this effect depends on how θ∗ varies with the choice of scoring rule. Since both scoring rules and models tend to be very nonlinear, this rela- tionship will generally be complex. The analytical example in section IIC and the Monte Carlo results in section III provide evidence on this issue. In choosing between scoring rules, a forecaster needs to trade off the loss from using a scoring rule other than the log score under correct specification with the gains this approach brings when the model is misspecified. The first consideration then is how plausible it would be to assume that the model is correctly specified. In most applications, especially those unmotivated by any underlying economic or scientific theory, this would be a difficult assumption to make. Nonetheless, it would generally be considered, and the answer is specific to the forecasting problem. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 8 4 7 4 2 1 9 7 4 8 0 5 / r e s t _ a _ 0 0 5 6 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 748 THE REVIEW OF ECONOMICS AND STATISTICS The second consideration is how large the gains are from using the matching strategy under misspecification of the model. C. Misspecification: An Analytical Example In order to illustrate how the choice of scoring rule mat- ters, we will give an analytical example where we examine the effect of trading off between two specific scoring rules on θ∗. The scoring rules we consider are given in table 1, the log, and As1 scores. The former is the log likelihood; the latter is an asymmetric scoring rule that emphasizes a better fit for smaller probabilities versus larger probabilities. We can write a composite scoring rule indexed by λ ∈ [0, 1] that nests both of them: Sλ( y, p(x(cid:3)θ)) = y ln( p(x(cid:3)θ)) + yλ(1 − p(x(cid:3)θ)) where fX(.) is the unconditional probability density func- tion (pdf) of X. Computing the expression explicitly and subsuming x, we obtain (cid:12) (cid:13)(cid:26) (cid:12) (cid:13) (cid:25) (cid:9) − λ − (1 − p0) 1 p X p0 = 0, 1 − λ 1 − p + λ p(cid:3)x fX(x)dx (8) where p(cid:3) ≡ ∂p(z) ∂z (cid:15) (cid:15) z=xθ∗. The first-order condition in equation (8) gives a highly nonlinear implicit characterization of the limiting parameter estimate θ∗. However, our setting (involving a single param- eter λ to characterize the employed scoring rule) allows us to nevertheless analyze how θ∗ varies across scoring rules. Again assuming applicability of the DCT and using implicit differentiation, we obtain + (1 − y)(1 − λ) ln(1 − p(x(cid:3)θ)) + (1 − y)(−λ)p(x(cid:3)θ). (7) ∂θ∗ ∂λ = For λ = 0, S0( y, p(x(cid:3)θ)) gives the log score. For λ = 1, S1( y, p(x(cid:3)θ)) is the As1 score. For all λ ∈ [0, 1], Sλ( y, p(x(cid:3)θ)) is a proper scoring rule, since propriety carries over to convex combinations of two proper scoring rules. Let θ∗ denote the maximizer of the objective function defined by the scoring rule, with θ∗ = arg max θ∈Θ = arg max θ∈Θ Sλ(Y , p(X (cid:3)θ)) (cid:22) (cid:21) EX,Y (cid:21) (cid:23) EX EY |X Sλ(Y , p(X (cid:3)θ)) (cid:24)(cid:22) . We can write the conditional expectation in the above objective function as follows: (cid:23) EY |X Sλ(Y , p(X (cid:3)θ)) (cid:24) = p0[X] ln( p(X (cid:3)θ)) + λp0[X] + (1 − p0[X])(1 − λ) × ln(1 − p(X (cid:3)θ)) − λp(X (cid:3)θ). For simplicity, we assume in the following that X is scalar. As demonstrated in section A of the online appendix, extending the example to include an intercept is possible but appears to complicate the analysis without a compensating gain in insight. The probit and logit link functions p(.) are by far the most common choices in the literature.6 Koenker and Yoon (2009) survey several other choices. Here, we do not make spe- cific assumptions about p, except that it does not depend on estimands other than θ. Assuming sufficient regularity con- ditions to apply the dominated convergence theorem (DCT), the first-order condition for a maximum is Sλ(Y , p(X (cid:3)θ)) ∂EY |X (cid:9) (cid:24) (cid:23) fX(x) dx = 0, X ∂θ (cid:15) (cid:15) (cid:15) (cid:15) X=x (cid:10) (cid:27) ( p0−p)(1−λp)p(cid:3)(cid:3) p(1−p) X (cid:10) X ( p0−p)pp(cid:3) p(1−p) xfX(x)dx (cid:28) − ( p0(1−p)2+(1−p0)(1−λ)p2)p(cid:3)2 p2(1−p)2 . x2fX(x)dx (9) (cid:23) (cid:24) Equation (9) makes explicit how a change in the scoring rule, which is expressed here by differentiating with respect to λ, affects the probability limit θ∗ of the parameter estimator. The denominator of the above expression is always negative because it is the second derivative of the objective function, , evaluated at θ∗. However, the sign of EX,Y the numerator may change depending on the truth, the model, and the density of X. Sλ(Y , p(X (cid:3)θ)) Under correct specification (p0[X] = p(X (cid:3)θ0)), the numer- ator in equation (9) becomes 0, and we get ∂θ∗ = 0. This ∂λ result holds irrespective of the link function p and the dis- tribution fX(x). It mirrors the fact that the composite scoring rule in equation (7) is strictly proper for any λ ∈ [0, 1]. This implies that θ∗ = θ0 regardless of the value of λ. Hence, the choice of scoring rule is irrelevant under correct specifica- tion, at least in terms of the probability limit of the parameter estimator. Under misspecification, it still holds that the denominator of equation (9) is always negative. Hence, the numerator will determine the sign of ∂θ∗/∂λ. We first examine when the sign is 0 (i.e., the choice of scoring rule does not matter) in the following theorem. Theorem 2. Assume a. p0[x] = 1 − p0[−x]. b. p[x] = 1 − p[−x], p is differentiable in x. c. fX(x) = fX(−x), fX ≥ 0, dim(X) = dim(θ) = 1, E[X] = 0. 6 In the probit case, p(·) is the cumulative distribution function (cdf) of a standard normal variable. For the logit, p(z) = [1 + exp(−z)]−1 is the cdf of a standard logistic variable. In addition, assume conditions a–f in theorem 1 hold, and define X+ = X ∩ [0, ∞) and X− = X ∩ (−∞, 0]. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 8 4 7 4 2 1 9 7 4 8 0 5 / r e s t _ a _ 0 0 5 6 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 FORECASTING BINARY PROBABILITIES 749 Then ∂θ∗ ∂λ = 0 if and only if (cid:9) ( p0[x] − p[x])+ (cid:9) X+ fX(x)dx p(cid:3)[x]x p[x](1 − p[x]) p(cid:3)[x]x p[x](1 − p[x]) = X+ ( p0[x] − p[x])− fX(x)dx, (10) By symmetry, equation (10) also holds on X−. Proof. Relating to the numerator of equation (9), we define g[x] = ( p0[x] − p[x])p[x]p(cid:3)[x] p[x](1 − p[x]) x. Assumptions a and b from the statement of the theorem imply that p0[x] − p[x] = −( p0[−x] − p[−x]), p[x](1 − p[x]) = p[−x](1 − p[−x]), p(cid:3)[x] = p(cid:3)[−x], assuming that p is differentiable in x. Note that the first equal- ity means that the approximation error due to the chosen model is point symmetric about the origin. The latter equal- ities depend on the approximation model only and hold for commonly used specifications such as probit and logit. These relationships, together with calculations detailed in section A of the online appendix, imply that (cid:9) X g[x] fX(x)dx (cid:9) = X+ ( p0[x] − p[x]) p(cid:3)[x]x p[x](1 − p[x]) fX(x)dx. Note that all quantities on the right-hand side of the last equality are nonnegative, except for p0[x] − p[x]. Thus, the result follows that it would equal 0 iff (cid:9) ( p0[x] − p[x])+ (cid:9) X+ = X+ ( p0[x] − p[x])− fX(x)dx p(cid:3)[x]x p[x](1 − p[x]) p(cid:3)[x]x p[x](1 − p[x]) fX(x)dx, where (h[x])+ = |h[x]|1{sign(h[x]) = +1} and (h[x])− = |h[x]|1{sign(h[x]) = −1}. By symmetry, the same holds for the integral of the same term over X−. In theorem 2, assumptions a and b are point-symmetry conditions on p0 and p, respectively, and c ensures that X is symmetric about the origin. Together, assumptions a and b imply the point symmetry of p0 − p about the origin. Theorem 2 gives precise conditions under which the choice of scoring rule, and thus the weighting over decision makers, has no impact on the limiting parameter estimate θ∗. The intuition here is that on each part of the support of X, the model overestimates and underestimates the true conditional probability. In addition, the upward and downward predic- tion error averaged as in the above equation are equal on X+ and X−, respectively. This excludes a situation where the model tends to overestimate the true conditional probability on X+ and underestimate it on X−, or vice versa. In practice, the symmetry requirements imposed on the model (assump- tion b in theorem 2) hold for the widely used logit and probit specifications. The symmetry requirements imposed on the true conditional probability, as well as the density of X (assumptions a and c in theorem 2), may be unrealistic for many applications. We next provide Monte Carlo evi- dence on the role of scoring rules under various scenarios that either do or do not satisfy the conditions of theorem 2. III. Numerical Demonstration of the Results This section illustrates the results in theorem 2 for both the asymptotic problem and finite samples. We consider four data-generating processes (DGPs) given in table 2, as well as the scoring rules in table 1. In DGP #1, the symmetry conditions on the conditional probability under the true and misspecified models as well as on the marginal distribution of X, assumptions a to c in theorem 2, are fulfilled. DGP #2 and #3 are variants of DGP #1 where the symmetry condition on the true conditional probability (a) and on the distribu- tion of X (c), respectively, are violated. DGP #4 presents an example where both conditions a and c are violated.7 A. Asymptotic Problem The necessary and sufficient conditions in equation (10) of theorem 2 give us an insight into how certain symmetry conditions jointly determine whether the choice over scoring rules has an effect on the estimated parameter and, thereby, the conditional probability. In this section, we seek to illus- trate this insight for the asymptotic problem. To do so, for each scoring rule j, we compute θ∗ j for the misspecified logit model given in table 2 numerically, based on a sample of size 1,000,000. The parameters are reported in table 3. Since all other elements of the forecast are identical, the difference between the conditional probability under various scoring rules is due to the difference in θ∗ j . Figure 2 plots the con- ditional probability for all scoring rules under DGPs #1 to #4.8 The plot for DGP #1 gives a case where the choice 7 In fact, the true conditional probability p0(X) is not even defined for X < 0. 8 For better display, figures 2, 3, and 5 omit the results for the Brier and Boosting scores. The results for Brier are similar to the ones for the spherical score, and the results for Boosting are similar to the ones for the log score. The online appendix contains color versions of the figures, covering all six scoring rules. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 8 4 7 4 2 1 9 7 4 8 0 5 / r e s t _ a _ 0 0 5 6 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 750 THE REVIEW OF ECONOMICS AND STATISTICS DGP # 1 2b 3a 4a,b p0(X) F(−0.5X + 0.2X 3) F(−0.5X + 0.2X 2) F(−0.5X + 0.2X 3) F( X) √ Table 2.—DGPs Used in Simulation fX U(−2.5, 2.5) U(−2.5, 2.5) U(−1, 4) U(0, 10) Correctly Specified Model (p0) F(θ1X + θ2X 3) F(θ1X + θ2X 3) F(θ1X + θ2X 3) F(θ X) √ (cid:29) Misspecified Model (p) F(θX) fX denotes the pdf of X, F(s) = exp(s) 1+exp(s) denotes the cdf of the logistic distribution, and U (a, b) denotes the uniform distribution with limits a and b. DGP #1 is taken from Elliott and Lieli (2013) and fulfills conditions a–c of theorem 2. a Indicates that a DGP violates condition c of theorem 2. bIndicates that a DGP violates condition a of theorem 2. Table 3.—Asymptotic Parameter Estimates for Various Scoring Rules j, Based on 1,000,000 Observations θ∗ j DGP #1 DGP #2 DGP #3 DGP #4 Log Brier Spherical Boosting As1 As2 0.22 0.22 0.21 0.22 0.22 0.22 −0.44 −0.44 −0.44 −0.43 −0.33 −0.64 0.6 0.51 0.45 0.68 0.46 0.66 0.43 0.57 0.65 0.39 0.61 0.41 of the scoring rule has no effect on the conditional proba- bility. This is true not only for the choice between the log and As1 scores, as shown in theorem 2, but also holds for all other scoring rules we examine. The plot clearly shows other implications of theorem 2: the prediction error is point- symmetric about the origin, the prediction error changes its sign on X+ and X−, and a weighted average of the positive and negative prediction error on X+ as well as X− would be equal as indicated by equation (10). For all the other DGPs, where either conditions a or c or both are violated, our numerical results clearly show that the choice of scoring rule has an effect on the conditional probability approximation. Figure 2.—Asymptotic Problem under Misspecification for DGPs #1 (Upper Left), #2 (Upper Right), #3 (Lower Left), and #4 (Lower Right) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 8 4 7 4 2 1 9 7 4 8 0 5 / r e s t _ a _ 0 0 5 6 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 The plots show the predicted conditional probability, whereby the parameter θ∗ j is computed using a sample of 1,000,000 draws. Figure 3.—Asymptotic Problem under Misspecification for DGPs #1 (Upper Left), #2 (Upper Right), #3 (Lower Left), and #4 (Lower Right) FORECASTING BINARY PROBABILITIES 751 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 8 4 7 4 2 1 9 7 4 8 0 5 / r e s t _ a _ 0 0 5 6 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 The plots show the classification curves of the conditional probability given x = 2 (θ∗ j is computed using a sample of size 1,000,000). For DGP #2, where only the symmetry condition on the true conditional probability is violated, the only scoring rules that result in different predicted conditional probabilities than the log score are the asymmetric scoring rules. For DGPs #3 and #4, we observe differences in the predicted prob- abilities for all pairs of scoring rules, even if both scoring rules under comparison are symmetric (such as the log versus Brier score). As discussed earlier, the binary action of an individual decision maker (such as a fisherman or restaurant owner in our example) is determined by whether the predicted prob- ability, F(θ∗ j X), exceeds the threshold c. For a given value of the regressor X, the chosen action may thus depend on the scoring rule j used for parameter estimation. Figure 3 illustrates this point for x = 2. It shows that the scoring rules generally yield different classification curves. Rather than looking at a single design point (such as x = 2 above), we next consider a broader summary measure of differences between scoring rules, (cid:21) (cid:23) P sign F(θ∗ j X) − c (cid:30) (cid:22) (cid:14)= sign F(θ∗ log score X) − c (cid:31) (cid:24) , which is the probability (computed over the distribution of X) that scoring rule j implies a different binary action than the log score. Figure 4 shows how the choice of scoring rules under DGP #1 is inconsequential, in the sense of almost always leading to identical decisions. For DGP #2, the choice between the log and the two asymmetric rules is the only one that leads to different classifications. For c = 0.6 and c = 0.7, the probability of different classifications is 0.05 and 0.12, respectively. For DGPs #3 and #4, the choice between the log and any other scoring rule leads to different classifica- tions. What is particularly interesting here is that even though choosing between the log and other symmetrical rules, such as Brier, may be relatively inconsequential for c = 0.6, it can lead to an 0.15 and 0.175 probability of different classifica- tions at greater values of c for DGP #3 and #4, respectively. Thus, whether the differences across scoring rules matter depends on a decision maker’s preferences as embodied in c. As a further check, figure 5 provides evidence on the prob- ability that the estimator defined by scoring rule j delivers a correct classification. This probability is given by (11) (cid:21) (cid:23) P sign (cid:22) F(θ∗ j X) − c = sign {p0(X) − c} (cid:24) . (12) 752 THE REVIEW OF ECONOMICS AND STATISTICS Figure 4.—Asymptotic Problem under Misspecification for DGPs #1 (Upper Left), #2 (Upper Right), #3 (Lower Left), and #4 (Lower Right) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 8 4 7 4 2 1 9 7 4 8 0 5 / r e s t _ a _ 0 0 5 6 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 The plots show the unconditional probability of different classifications using the log score as opposed to other scoring rules; see equation (11). θ∗ j is computed using a sample of size 1,000,000. Put slightly differently, this number represents the probabil- ity that a decision maker with preference parameter c makes a correct decision when using a prediction model fitted via scoring rule j. Figure 5 shows how the ranking of the scor- ing rules differs across thresholds c. For example, consider the comparison of As1 and As2 in the figure for DGP #2 (upper-right panel): While As1 performs better for values of c between 0.2 and 0.4, the reverse is true for c lying between 0.6 and 0.8. This result is closely in line with the fact that, when used as an estimation criterion, As1 places an emphasis on fitting small thresholds c correctly, whereas As2 focuses on high values of c (see section IIA). This analysis demon- strates how forecasters who are willing to favor a certain clientele (say, decision makers characterized by small values c, such as the fishermen in our example) can achieve this goal by issuing predictions based on an appropriate scoring rule (in this case, As1). B. Finite-Sample Results All of our results until now are for the case in which the limiting parameter values θ∗ j are known. We now briefly turn to the effects of sampling uncertainty. Specifically, we consider a rolling window estimation scheme for θ∗ j which is popular in practice (see the discussion by Giacomini & White, 2006, p. 1548), using a window length of 120. Fur- thermore, we consider a forecast evaluation period of 100 periods.9 In each Monte Carlo iteration, we thus simulate 120 + 100 observations. For the first rolling window, we use observations 1 to 120 to estimate the parameter θ and make a forecast for observation 121. For the second rolling 9 These sample sizes are typical in forecasting studies using quarterly macroeconomic data, for example, when using an estimation sample from 1960 to 1989 (30 × 4 = 120 observations) and an evaluation sample from 1990 to 2014 (25 × 4 = 100 observations). Figure 5.—Asymptotic Problem under Misspecification for DGPs #1 (Upper Left), #2 (Upper Right), #3 (Lower Left), and #4 (Lower Right) FORECASTING BINARY PROBABILITIES 753 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 8 4 7 4 2 1 9 7 4 8 0 5 / r e s t _ a _ 0 0 5 6 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 The plots show the probability of correct classification (see equation [12]) for various thresholds c. θ∗ j is computed using a sample of size 1,000,000. window, we use observations 2 to 121 for estimation and make a forecast for observation 122, and so forth. The online appendix reports variants of figures 2 and 3 for the rolling window case, which we construct by averaging the probability and classification curves for each estimate of θ. These figures show that on average, the rolling win- dow parameter estimates are very similar to their asymptotic counterparts. Theorem 1 implies that in the asymptotic case, it is gener- ally optimal to use the same scoring rule for estimation and evaluation. We next analyze to what extent this statement carries over to the rolling window scenario. To this end, table 4 summarizes the parameter estimates and predictive performance obtained under each scoring rule. The median estimates for each scoring rule (upper panel of table 4) are very close to their asymptotic limits in table 3. This is in line with the similarity of the prediction and classification curves noted. Estimators defined by alternative scoring rules clearly differ in their sampling variability, as measured by interdecile ranges (middle panel of table 4), especially for DGPs #3 and 4.10 That said, there is no simple relationship 10 We use interdecile ranges, rather than variances, to eliminate the effect of outliers, which are not surprising given the scale of our Monte Carlo between the choice of scoring rule and the variability of the estimator it defines. For example, the spherical score defines the most precise estimator for DGP #1, whereas it defines the (by far) least precise estimator for DGP #4. The bottom panel of table 4 compares the forecast perfor- mance of two strategies: (a) using the same scoring rule for estimation and evaluation (“matching”) and (b) simply using MLE for parameter estimation while using a different scor- ing rule for evaluation. In order to compare the performance of the two, we simply report the share of Monte Carlo itera- tions for which the first strategy performs better, whereby we average over an out-of-sample period of 100 observations in each Monte Carlo iteration. This share has a natural scal- ing between 0 and 1. It is therefore easily interpretable and comparable across scoring rules.11 experiment (for each scoring rule and DGP, we compute 100 times 10,000 rolling window estimates). Measuring the estimators’ variability by the median absolute deviation from the median (instead of the interdecile range) leads to the same qualitative interpretations. 11 By contrast, the scoring rules themselves have no natural scaling, which makes it hard to judge whether differences of a given magnitude are practi- cally relevant, and also impedes comparisons of effect sizes across scoring rules. 754 THE REVIEW OF ECONOMICS AND STATISTICS Table 4.—Summary Results for Rolling Windows (Window Length 120) DGP #1 DGP #2 DGP #3 DGP #4 Median estimate ˆθ Log Brier Spherical Boosting As1 As2 Interdecile range of estimates ˆθ Log Brier Spherical Boosting As1 As2 Share of MC iterations for which matching scores better than MLEa Brier Spherical Boosting As1 As2 0.22 0.21 0.21 0.22 0.22 0.22 0.32 0.31 0.29 0.33 0.34 0.33 0.76 0.75 0.23 0.43 0.45 −0.44 −0.44 −0.45 −0.44 −0.34 −0.65 0.36 0.37 0.38 0.36 0.3 0.57 0.44 0.42 0.52 0.73 0.57 0.6 0.51 0.46 0.68 0.46 0.66 0.26 0.2 0.19 0.32 0.2 0.27 0.75 0.82 0.53 0.8 0.59 0.43 0.58 0.67 0.4 0.62 0.42 0.23 0.83 1.58 0.22 1.1 0.22 0.47 0.52 0.53 0.5 0.52 a In each MC iteration, we compute the average score over an out-of-sample period of 100 observations. “Matching” means using the same scoring rule for estimation and evaluation. In thirteen of the twenty cases, shares in the bottom panel of table 4 are strictly above 0.5, indicating better perfor- mance of matching compared to MLE. For some of these cases, we find that matching leads to a substantially differ- ent median estimate as well as lower variability as measured by the interdecile range. This holds true for As1 under DGP #2, as well as Brier, Spherical, and As1 under DGP#3. There are also some cases where the matching estimator is more variable than MLE but nevertheless performs better out of sample. This happens for As2 under DGP #2, Boosting and As2 under DGP #3, as well as Spherical under DGP #4. In these cases, it seems that the relative gain from using an esti- mator that converges to the maximand of the scoring rule in question outweighs the relative loss in precision. For another subset of the cases where matching performs better, such as Brier and Spherical under DGP #1, Boosting under DGP #2, and As2 under DGP #4, the medians and interdecile ranges of the MLE and matching estimator are practically indistin- guishable. Our conjecture is that in these cases, the matching strategy’s improvement over MLE is marginal. Along the same lines, the six cases in which MLE does better than matching appear very close, with both strategies attaining similar medians and interdecile ranges. To summarize, our results show that the “correct location” of the matching estimator puts it at an advantage over MLE under misspecification, which generally does not converge to the maximand of the scoring rule in question. To compensate for this, the MLE must be more precise (smaller interdecile range) in order to outperform matching in terms of out-of- sample scores. IV. Conclusion This paper explores the nuances in forecasting conditional probabilities under misspecification. The natural choice under correct specification, regardless of the scoring rule used for out-of-sample evaluation, is indeed MLE. It is not only consistent for the maximand of the scoring rule in ques- tion but also efficient. Under misspecification, however, there is no clear natural choice. The MLE is neither consistent for the maximand of the scoring rule in question nor necessarily “efficient” in the sense of attaining lower sampling variabil- ity than other estimators. The paper shows in an analytical example that under certain symmetry conditions, the choice of scoring rule is inconsequential for parameter estimation. With the aid of numerical results for the asymptotic prob- lem, we then illustrate how the violation of these conditions can lead to different probability limits of the parameter esti- mators and different conditional probability forecasts. We also show how these different forecasts would lead to dif- ferent interpretations by heterogeneous decision makers. In finite samples, we find an interesting relationship between the sampling distribution of the parameter estimators and the relative performance of the MLE (compared to the estimator that maximizes the scoring rule considered for evaluation). Finally, our analysis has conceptual implications pertain- ing to the literature on distributional forecasting. It has been argued (Geweke & Amisano, 2011) that the provision of distributional forecasts is superior to the provision of point forecasts because distributional forecasts can be employed to construct point forecasts for any loss function. While this argument seems valid in many situations (see section I), it should not be misunderstood as saying that distributional forecasts were “loss function independent.” Specifically, this paper illustrates that probability forecasts—which are clearly distributional—are not loss function independent. A loss function is required for estimation, and this choice makes explicit trade-offs regarding which aspects of the data to fit correctly, at the cost of neglecting other aspects. REFERENCES Boyes, William J., Dennis L. Hoffman, and Stuart A. Low, “An Econo- metric Analysis of the Bank Credit Scoring Problem,” Journal of Econometrics 40 (1989), 3–14. Brier, Glenn W., “Verification of Forecasts Expressed in Terms of Proba- bility,” Monthly Weather Review 78 (1950), 1–3. Buja, Andreas, Werner Stuetzle, and Yi Shen, “Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications,” unpublished manuscript, Duke University (2005). de Jong, Robert M., and Tiemen Woutersen, “Dynamic Time Series Binary Choice,” Econometric Theory 27 (2011), 673–702. Deutsche Bank (2014), “Sovereign Default Probabilities Online,” http: //www.dbresearch.com/servlet/reweb2.ReWEB?rwnode=DBR _INTERNET_EN-PROD$NAVIGATION&rwobj=CDS.calias &rwsite=DBR_INTERNET_EN-PROD. Elliott, Graham, and Robert P. Lieli, “Predicting Binary Outcomes,” Journal of Econometrics 174 (2013), 15–26. Geweke, John W., and Gianni Amisano, “Optimal Prediction Pools,” Journal of Econometrics 164 (2011), 130–141. Giacomini, Raffaella, and Halbert White, “Tests of Conditional Predictive Ability,” Econometrica 74 (2006), 1545–1578. Gneiting, Tilmann, “Making and Evaluating Point Forecasts,” Journal of the American Statistical Association 106 (2011), 746–762. Gneiting, Tilmann, and Adrian E. Raftery, “Strictly Proper Scoring Rules, Prediction, and Estimation,” Journal of the American Statistical Association 102 (2007), 359–378. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 8 4 7 4 2 1 9 7 4 8 0 5 / r e s t _ a _ 0 0 5 6 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 FORECASTING BINARY PROBABILITIES 755 Good, Irving J., “Rational Decisions,” Journal of the Royal Statistical Society, Series B 14 (1952), 107–114. Granger, Clive W. J., “On the Limitations of Comparing Mean Square Forecast Errors: Comment,” Journal of Forecasting 12 (1993), 651–652. Granger, Clive W. J., and M. Hashem Pesaran, “Economic and Statistical Measures of Forecast Accuracy,” Journal of Forecasting 19 (2000), 537–560. Hamilton, James D., “Econbrowser— and Menzie D. Chinn, Analysis of Current Economic Conditions and Policy” (2014), http://www.econbrowser.com. Hand, David J., and Veronica Vinciotti, “Local versus Global Models for Classification Problems: Fitting Models Where It Matters,” American Statistician 57 (2003), 124–131. Hayashi, Fumio, Econometrics (Princeton, NJ: Princeton University Press, 2000). Inoue, Atsushi, and Barbara Rossi, “Monitoring and Forecasting Currency Crises,” Journal of Money, Credit and Banking 40 (2008), 523–534. Knapp, Laura G., and Terry G. Seaks, “An Analysis of the Probability of Default on Federally Guaranteed Student Loans,” this review 74 (1992), 404–411. Koenker, Roger, and Jungmo Yoon, “Parametric Links for Binary Choice Models: A Fisherian–Bayesian Colloquy,” Journal of Econometrics 152 (2009), 120–130. Lieli, Robert P., and Augusto Nieto-Barthaburu, “Optimal Binary Predic- tion for Group Decision Making,” Journal of Business and Economic Statistics 28 (2010), 308–319. Lieli, Robert P., and Michael Springborn, “Closing the Gap between Risk Estimation and Decision Making: Efficient Management of Trade-Related Invasive Species Risk,” this review 95 (2013), 632–645. Lieli, Robert P., and Halbert White, “The Construction of Empirical Credit Scoring Rules Based on Maximization Principles,” Journal of Econometrics 157 (2010), 110–119. Manski, Charles F., and T. Scott Thompson, “Estimation of Best Predictors of Binary Response,” Journal of Econometrics 40 (1989), 97–123. Mass, Clifford, Jeff Baars, Susan Joslyn, John Pyle, Patrick Tew- son, David Jones, Tilmann Gneiting, Adrian E. Raftery, J. McLean Sloughter, and Chris Fraley, “PROBCAST: A Web- Based Portal to Mesoscale Probabilistic Forecasts,” Bulletin of the American Meteorological Society 90 (2009), 1009–1014. Merkle, Edgar C., and Mark Steyvers, “Choosing a Strictly Proper Scoring Rule,” Decision Analysis 10 (2013), 292–304. Patton, Andrew J., “Comparing Possibly Misspecified Forecasts,” unpub- lished manuscript, Duke University (2015). R Core Team, “R: A Language and Environment for Statistical Com- puting” (Vienna, Austria: R Foundation for Statistical Computing, 2015). Schervish, Mark J., “A General Method for Comparing Probability Asses- sors,” Annals of Statistics 17 (1989), 1856–1879. Shuford, Emir H., Arthur Albert, and H. Edward Massengill, “Admissible Probability Measurement Procedures,” Psychometrika 31 (1966), 125–145. Toda, Masanao, “Measurement of Subjective Probability Distribution,” Report 3 (State College, PA: Institute for Research, Division of Mathematical Psychology, 1963). Weiss, Andrew A., “Estimating Time Series Models Using the Relevant Cost Function,” Journal of Applied Econometrics 11 (1996), 539– 560. Weiss, Andrew A., and Allan P. Andersen, “Estimating Time Series Models Using the Relevant Forecast Evaluation Criterion,” Journal of the Royal Statistical Society, Series A 147 (1984), 484–487. White, Halbert, Asymptotic Theory for Econometricians, 2nd ed. (San Diego, CA: Academic Press, 2001). Wickham, Hadley, ggplot2: Elegant Graphics (New York: Springer, 2009). for Data Analysis Wooldridge, Jeffrey M., “Estimation and Inference for Dependent Pro- cesses” in Robert F. Engle and Daniel L. McFadden, eds., Handbook of Econometrics, vol. 4 (Amsterdam: Elsevier Science, 1994). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / r e s t / l a r t i c e - p d f / / / / 9 8 4 7 4 2 1 9 7 4 8 0 5 / r e s t _ a _ 0 0 5 6 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3
Scarica il pdf