INTERPRETING OLS ESTIMANDS WHEN TREATMENT EFFECTS ARE - 麻省理工学院人工智能研究专业

INTERPRETING OLS ESTIMANDS WHEN TREATMENT EFFECTS ARE
HETEROGENEOUS: SMALLER GROUPS GET LARGER WEIGHTS

Tymon Słoczy´nski*

Abstract—Applied work often studies the effect of a binary variable (“treat-
ment”) using linear models with additive effects. I study the interpretation
of the OLS estimands in such models when treatment effects are hetero-
geneous. I show that the treatment coefﬁcient is a convex combination of
two parameters, which under certain conditions can be interpreted as the
average treatment effects on the treated and untreated. The weights on these
parameters are inversely related to the proportion of observations in each
团体. Reliance on these implicit weights can have serious consequences
for applied work, as I illustrate with two well-known applications. I de-
velop simple diagnostic tools that empirical researchers can use to avoid
potential biases. Software for implementing these methods is available in
R and Stata. In an important special case, my diagnostics require only the
knowledge of the proportion of treated units.

我.

介绍

MANY applied researchers study the effect of a binary

variable (“treatment”) on the expected value of an out-
come of interest, holding ﬁxed a vector of control variables.
As noted by Imbens (2015), despite the availability of a
large number of semi- and nonparametric estimators for av-
erage treatment effects, applied researchers often continue to
use conventional regression methods. 尤其, numerous
studies use ordinary least squares (OLS) to estimate

y = α + τd + X β + 你,

(1)

where y denotes the outcome, d denotes the treatment, 和
X denotes the row vector of control variables, (x1, . . . , xK ).
Usually τ is interpreted as the average treatment effect (ATE).

Received for publication December 10, 2018. Revision accepted for pub-

lication July 9, 2020. Editor: Bryan S. Graham.

∗Słoczy´nski: 布兰迪斯大学.
This paper is based on portions of my previous working paper (Słoczy´nski,
2018). I thank the editor and two anonymous referees for their help-
ful comments. I am very grateful to Alberto Abadie, Max Kasy, Pedro
Sant’Anna, and Jeff Wooldridge for many comments and discussions. 我
also thank Arun Advani, Isaiah Andrews, Josh Angrist, Orley Ashenfelter,
Richard Blundell, Stéphane Bonhomme, Carol Caetano, Marco Caliendo,
Matias Cattaneo, Gary Chamberlain, Todd Elder, Alfonso Flores-Lagunes,
Brigham Frandsen, Josh Goodman, Florian Gunsilius, Andreas Hagemann,
James Heckman, Kei Hirano, Peter Hull, Macartan Humphreys, Guido
Imbens, Krzysztof Karbownik, Shakeeb Khan, Toru Kitagawa, Pat Kline,
Paweł Królikowski, Nicholas Longford, James MacKinnon, Łukasz Mar´c,
Doug Miller, Michał Myck, Mateusz My´sliwski, Gary Solon, Jann Spiess,
Michela Tincani, Alex Torgovitsky, Joanna Tyrowicz, Takuya Ura, 和
Rudolf Winter-Ebmer; seminar participants at BC, Brandeis, Harvard-MIT,
Holy Cross, IHS Vienna, Lehigh, MSU, 波茨坦, SDU Odense, SGH, 温度-
普莱, UCL, Upjohn, and WZB Berlin; and many conference participants for
useful feedback. I thank Mark McAvoy for his excellent assistance in devel-
oping the R package hettreatreg that implements the results in this paper. 我
also thank David Card, Jochen Kluve, and Andrea Weber for providing me
with supplementary data on the articles surveyed in Card, Kluve, and Weber
(2018). I acknowledge ﬁnancial support from the National Science Centre
(grant DEC-2012/05/N/HS4/00395), the Foundation for Polish Science (A
“Start” scholarship), the “We´z stypendium—dla rozwoju” scholarship pro-
公克, and the Theodore and Jane Norman Fund.

A supplemental appendix is available online at https://doi.org/10.1162/

rest_a_00953.

This estimation strategy is used in many inﬂuential papers in
经济学 (Voigtländer & Voth, 2012; Alesina, Giuliano, &
Nunn, 2013; Aizer et al., 2016), as well as in other disciplines.
The great appeal of the model in equation (1) comes from
its simplicity (Angrist & Pischke, 2009). 同时,
然而, a large body of evidence demonstrates the impor-
tance of heterogeneity in effects (赫克曼, 2001; Bitler,
Gelbach, & Hoynes, 2006), which is explicitly ruled out by
this same model. 在本文中, I contribute to the recent lit-
erature on interpreting τ, the OLS estimand, when treatment
effects are heterogeneous (Angrist, 1998; Humphreys, 2009;
Aronow & Samii, 2016). I demonstrate that τ is a convex
combination of two parameters, which under certain condi-
tions can be interpreted as the average treatment effects on the
treated (ATT) and untreated (ATU). 出奇, the weight
that is placed by OLS on the average effect for each group
is inversely related to the proportion of observations in this
团体. The more units are treated, the less weight is placed on
ATT. One interpretation of this result is that OLS estimation
of the model in equation (1) is generally inappropriate when
treatment effects are heterogeneous.

It is also possible, 然而, to present a more pragmatic
view of my main result. I derive a number of corollaries of this
result that suggest several diagnostic methods that I recom-
mend to applied researchers. These diagnostics are applicable
whenever the researcher is (A) studying the effects of a binary
treatment, (乙) using OLS, 和 (C) unwilling to maintain that
ATT is exactly equal to ATU. 通常, such a homogene-
ity assumption would be undesirably strong because those
choosing or chosen for treatment may have unusually high or
low returns from that treatment, which would directly con-
tradict the equality of ATT and ATU.

In deriving my diagnostics, I assume that the researcher is
ultimately interested in ATE, ATT, or both and that she wishes
to estimate the model in equation (1) using OLS but is con-
cerned about treatment effect heterogeneity. 在这种情况下, 我的
diagnostics are able to detect deviations of the OLS weights
from the pattern that would be necessary to consistently es-
timate a given parameter. These diagnostics are easy to im-
plement and interpret; they are bounded between 0 和 1 在
absolute value, and they give the proportion of the difference
between ATU and ATT (or between ATT and ATU) that con-
tributes to bias. 因此, if a given diagnostic is close to 0, OLS
is likely a reasonable choice, but if a diagnostic is far from 0,
other methods should be used.

In an important special case, these diagnostics become
particularly simple and immediate to report. If we wish to
estimate ATT, this rule-of-thumb variant of my diagnos-
tic is equal to the proportion of treated units, 磷 (d = 1);
if our goal is to estimate ATE, the diagnostic is equal to

The Review of Economics and Statistics, 可能 2022, 104(3): 501–509
© 2020 The President and Fellows of Harvard College and the Massachusetts Institute of Technology. 根据知识共享署名发布 4.0
国际的 (抄送 4.0) 执照.
https://doi.org/10.1162/rest_a_00953

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
r
e
s
t
/

我

A
r
t
我
C
e
–
p
d

F
/

1
0
4
3
5
0
1
2
0
2
2
7
1
0
/
r
e
s
t
_
A
_
0
0
9
5
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

502

THE REVIEW OF ECONOMICS AND STATISTICS

2 × P (d = 1) - 1, twice the deviation of P (d = 1) 从
50%. 简而言之, OLS is expected to provide a reasonable ap-
proximation to ATE if both groups, treated and untreated, 是
of similar size. If we wish to estimate ATT, it is necessary
that the proportion of treated units is very small.

It follows that OLS might often be substantially biased for
ATE, ATT, 或两者. How common are these biases in prac-
泰斯? In a subset of 37 estimates from Card, Kluve, and Weber
(2018), a survey of evaluations of active labor market pro-
克, the mean proportion of treated units is 17.7%.1 使用
the rule-of-thumb variants of my diagnostics, I establish that
on average the difference between the OLS estimand and ATE
is expected to correspond to 64.6% of the difference between
ATT and ATU. 相似地, the expected difference between
OLS and ATT is on average equal to 17.7% of the difference
between ATU and ATT. 换句话说, these biases might
often be large.

The remainder of the paper is organized as follows. 秒-
tion II presents a leading example and the main theoretical
结果. Section III discusses two empirical applications. 在一个
study of the effects of a training program (LaLonde, 1986),
OLS estimates are very similar to (西德:2)ATT. Yet in a study of
the effects of cash transfers (Aizer et al., 2016), OLS esti-
mates are similar to (西德:2)ATU. Section IV concludes. Proofs and
several extensions are provided in the online appendixes. 这
main results are implemented in newly developed R and Stata
packages, hettreatreg.

二. A Weighted Average Interpretation of OLS

A. Leading Example

To illustrate the problem with OLS weights, 考虑
classic example of the National Supported Work (NSW)
程序. Because this program originally involved a social
实验, the difference in mean outcomes between the
treated and control units provides an unbiased estimate of
the effect of treatment. LaLonde (1986) studies the perfor-
mance of various estimators at reproducing this experimental
benchmark when the experimental controls are replaced by an
artiﬁcial comparison group from the Current Population Sur-
vey (CPS) or the Panel Study of Income Dynamics (PSID).
Angrist and Pischke (2009) reanalyze the NSW–CPS data and
conclude that OLS estimates of the effect of NSW program on
earnings in 1978 are similar to the experimental benchmark
的 $1,794.2 尤其, their richest speciﬁcation delivers an estimate of $794. As I will show, this conclusion is driven
by the small proportion of treated units in these data.

1This sample is restricted to studies that Card et al. (2018) coded as

“selection on observables” and “regression.”

2Subsequent to LaLonde (1986), these data were studied by Dehejia and
Wahba (1999), Smith and Todd (2005), 和许多其他人. Angrist and
Pischke (2009) analyze the subsample of the experimental treated units
constructed by Dehejia and Wahba (1999), combined with CPS-1 or CPS-
3, two of the nonexperimental comparison groups from CPS, constructed
by LaLonde (1986). In this replication, I focus on CPS-1.

在这个例子中, ATT and ATU are likely to be substan-
tially different. This is because the treated group, unlike the
CPS comparison (untreated) 团体, was highly economically
disadvantaged. It is plausible that ATU might be 0 或者, due to
the opportunity cost of program participation, even negative.
还, 仅有的 1.1% of the sample was treated, so ATE and ATU
will be similar.

To demonstrate this, I modify the model in equation (1)
to include all interactions between d and X . Estimation of
this expanded model, again using OLS, allows us to sepa-
rately compute (西德:2)ATE, (西德:2)ATT, 和 (西德:2)ATU. This method is usu-
ally referred to as “regression adjustment” (Wooldridge,
2010) or “Oaxaca–Blinder” (Kline, 2011; Graham & Pinto,
2022). Using the control variables that deliver the esti-
mate of $794, we obtain(西德:2)ATE = −$4,930,(西德:2)ATT = $796, 和 (西德:2)ATU = −$4,996. It turns out that since (西德:2)ATE and (西德:2)ATU are
indeed negative, the OLS estimate and (西德:2)ATE have differ-
ent signs. 而且, if we represent the OLS estimate as a
weighted average of (西德:2)ATT and (西德:2)ATU with weights that sum to
unity, we can write $794 = ˆwAT T × $796 + (1 − ˆwAT T ) ×
(西德:3)
(西德:2)
, where ˆwAT T is the weight on (西德:2)ATT. Solving for
−$4,996 ˆwAT T yields ˆwAT T = 99.96%. 换句话说, the hypothet- ical OLS weight on the effect on the treated is similar to the proportion of untreated units, 98.9%. This “weight reversal” is not a coincidence. As I demon- strate below, the intuition from this example holds more gen- 埃拉利, even though the OLS estimand is not necessarily a convex combination of two parameters from a procedure that controls for the full vector X . 乙. Main Result This section presents my main result, which focuses on the algebra of OLS and descriptive estimands that I deﬁne below. A causal interpretation of OLS also requires introducing the notion of potential outcomes as well as certain conditions that I discuss in section IIC, including an ignorability assumption. 然而, this is not needed for my main result. If L (· | ·) denotes the linear projection, we are interested in the interpretation of τ in the linear projection of y on d and X , L (y | 1, d, X ) = α + τd + X β, (2) when this linear projection does not correspond to the (struc- tural) conditional mean. Let ρ = P (d = 1) be the unconditional probability of treatment and let p (X ) = L (d | 1, X ) = αp + X βp (3) (4) be the propensity score from the linear probability model or, equivalently, the best linear approximation to the true propen- sity score. Generally the speciﬁcation in equations (2) 和 (4) l 从http下载 : / / 直接的 . 米特 . 呃呃 / r e s t / 拉蒂斯 – df / / / / 1 0 4 3 5 0 1 2 0 2 2 7 1 0 / r e s t _ a _ 0 0 9 5 3 压力 . 来宾来访 0 8 九月 2 0 2 3 INTERPRETING OLS ESTIMANDS WHEN TREATMENT EFFECTS ARE HETEROGENEOUS 503 can be arbitrarily ﬂexible, so this approximation can be made very accurate; 实际上, we can think of equation (2) as partially linear, where we may include powers and cross-products of original control variables. After deﬁning p (X ), it is helpful to introduce two linear projections of y on p (X ), separately for d = 1 and d = 0, 即, L [y | 1, p (X ), d = 1] = α1 + γ1 × p (X ) and L [y | 1, p (X ) , d = 0] = α0 + γ0 × p (X ) . (5) (6) Note that equations (4), (5), 和 (6) are deﬁnitional. It is sufﬁ- cient for my main result that the linear projections introduced so far exist and are unique. Assumption 1. (我) 乙(y2) and E((西德:3)X (西德:3)2) are ﬁnite. (二) The covariance matrix of (d, X ) is nonsingular. Assumption 2. V [p (X ) | d = 1] 和V [p (X ) | d = 0] are nonzero, where V (· | ·) denotes the conditional variance (with respect to E [p (X ) | d = j], j = 0, 1). Assumption 1 guarantees the existence and uniqueness of the linear projections in equations (2) 和 (4). 相似地, assump- 的 2 ensures that the linear projections in equations (5) 和 (6) exist and are unique.3 The next step is to use the linear projections in equations (5) 和 (6) to deﬁne the average partial linear effect of d as where w1 = w1 = ρ×V[p(X )|d=1]+(1−ρ)×V[p(X )|d=0] and w0 = 1 − ρ×V[p(X )|d=1] ρ×V[p(X )|d=1]+(1−ρ)×V[p(X )|d=0] . (1−ρ)×V[p(X )|d=0] Proof. See online appendix A. Theorem 1 shows that τ, the OLS estimand, is a convex com- bination of τAPLE ,1 and τAPLE ,0. The deﬁnition of τAPLE , j makes it clear that τ is equivalent to the outcome of a particu- lar three-step procedure. In the ﬁrst step, we obtain p (X ), the propensity score. 下一个, in the second step, we obtain τAPLE ,1 and τAPLE ,0, as in equation (8), from two linear projections of y on p (X ), separately for d = 1 and d = 0. This is anal- ogous to the regression adjustment procedure in section IIA, although now we control for p (X ) rather than the full vector X . 最后, in the third step, we calculate a weighted average of τAPLE ,1 and τAPLE ,0. The weight on τAPLE ,1, w1, is de- creasing in V[p(X )|d=1] V[p(X )|d=0] and ρ, and the weight on τAPLE ,0, w0, is increasing in V[p(X )|d=1] V[p(X )|d=0] and ρ.5 This is clearly undesirable, since τAPLE = ρ × τAPLE ,1 + (1 − ρ) × τAPLE ,0. This weighting scheme is also surprising: the more units belong to group j, the less weight is placed on τAPLE , j, the ef- fect for this group. There are several ways to provide intuition for this result. One is provided in the next section. Another follows from an alternative proof of theorem 1, which is pro- vided with discussion in online appendix B2. It parallels the intuition in Angrist (1998) and Angrist and Pischke (2009) that OLS gives more weight to treatment effects that are better estimated in ﬁnite samples.6 τAPLE = (α1 − α0) + (γ1 − γ0) × E [p (X )] , (7) C. Causal Interpretation as well as the average partial linear effect of d on group j ( j = 0, 1) as τAPLE , j = (α1 − α0) + (γ1 − γ0) × E [p (X ) | d = j] . (8) These estimands are well deﬁned under assumptions 1 和 2 and have a causal interpretation under additional assump- 系统蒸发散, as discussed in section IIC.4 When the linear projec- tions in equations (5) 和 (6) represent the conditional mean of y, the average partial linear effects of d overlap with its average partial effects. It should be stressed, 然而, that theorem 1, the main result of this paper, is more general and requires only assumptions 1 和 2. Theorem 1 (Weighted Average Interpretation of OLS). 和- der assumptions 1 和 2, τ = w1 × τAPLE ,1 + w0 × τAPLE ,0, The fact that theorem 1 requires only the existence and uniqueness of several linear projections makes this result very general. 然而, one concern about this result might be that τAPLE ,1 and τAPLE ,0 do not necessarily correspond to the usual (causal) objects of interest. To deﬁne these objects, we need two potential outcomes, y(1) 和y(0), only one of which is observed for each unit, y = y(d ) = y(1) × d + y(0) × (1 − d ). The parameters of interest, ATE, ATT, and ATU, are deﬁned as τAT E = E [y(1) − y(0)], τAT T = E [y(1) − y(0) | d = 1], and τAT U = E [y(1) − y(0) | d = 0]. A causal inter- pretation of OLS also entails the following assumptions. Assumption 3 (Ignorability in Mean). (我) 乙 [y(1) | X, d] = E [y(1) | X ]; 和 (二) 乙 [y(0) | X, d] = E [y(0) | X ]. Assumption 4. (我) 乙 [y(1) | X ] = α1 + γ1 × p (X ); 和 (二) 乙 [y(0) | X ] = α0 + γ0 × p (X ). Assumptions 3 和 4 ensure that τ admits a causal interpre- 站. Assumption 3 is standard in the program evaluation 3Both assumptions are generally innocuous, although assumption 2 rules out a small number of interesting applications, such as regression adjust- ments in Bernoulli trials and completely randomized experiments. In these cases, 然而, OLS is consistent for the average treatment effect under general conditions (Imbens & 鲁宾, 2015). 4而且, τAPLE is similar to the “average regression coefﬁcient” or “average slope coefﬁcient” in Graham and Pinto (2022), which is also a descriptive estimand in the sense of Abadie et al. (2020). 5A formal proof that the relationship between ρ and w1 (w0) is indeed always negative (积极的) is provided in online appendix B1. This proof additionally assumes that the conditional mean of d is linear in X . 6This proof uses a result from Deaton (1997) and Solon, 海德尔, and Wooldridge (2015) as a lemma. The main proof of theorem 1 uses a result on decomposition methods from Elder, Goddeeris, and Haider (2010). See online appendix A for more details. l 从http下载 : / / 直接的 . 米特 . 呃呃 / r e s t / 拉蒂斯 – df / / / / 1 0 4 3 5 0 1 2 0 2 2 7 1 0 / r e s t _ a _ 0 0 9 5 3 压力 . 来宾来访 0 8 九月 2 0 2 3 504 THE REVIEW OF ECONOMICS AND STATISTICS literature (Wooldridge, 2010). Assumption 4 is not com- monly used. Sufﬁcient for this assumption, but not necessary, is that the conditional mean of d is linear in X and the condi- tional means of y(1) 和y(0) are linear in the true propensity score, which is now equal to p (X ). Linearity of E (d | X ) is assumed in Aronow and Samii (2016) and Abadie et al. (2020). This assumption is not necessarily strong, since X might include powers and cross-products of original control variables. It is also satisﬁed automatically in saturated mod- 这, as in Angrist (1998) and Humphreys (2009). The linear- ity assumption for E [y(1) | p (X )] and E [y(0) | p (X )] dates back to Rosenbaum and Rubin (1983) but is restrictive. See also Imbens and Wooldridge (2009) and Wooldridge (2010) for a discussion. Corollary 1 (Causal Interpretation of OLS). Under assump- 系统蒸发散 1, 2, 3, 和 4, τ = w1 × τAT T + w0 × τAT U . Proof. Assumption 3 implies that E [y(1) − y(0) | X ] = E (y | X, d = 1) − E (y | X, d = 0). 然后, assumption 4 implies that E [y(1) − y(0) | X ] = (α1 − α0) + (γ1 − γ0) × p (X ), which in turn implies that τAT T = τAPLE ,1 and τAT U = τAPLE ,0. 这, together with theorem 1, completes the proof. Corollary 1 states that under assumptions 1, 2, 3, 和 4, the OLS weights from theorem 1 apply to the causal objects of interest, τAT T and τAT U . 因此, τ has a causal interpretation. The greater the proportion of treated units, the smaller is the OLS weight on τAT T . 再次, this is undesirable since τAT E = ρ × τAT T + (1 − ρ) × τAT U . To aid intuition for this surprising result, recall that an im- portant motivation for using the model in equation (1) and OLS is that the linear projection of y on d and X provides the best linear predictor of y given d and X (Angrist & Pischke, 2009). 然而, if our goal is to conduct causal inference, then this is not, 实际上, a good reason to use this method. 或者- dinary least squares is “best” in predicting actual outcomes, but causal inference is about predicting missing outcomes, deﬁned as ym = y(1) × (1 − d ) + y(0) × d. 换句话说, the OLS weights are optimal for predicting “what is.” Instead, we are interested in predicting “what would be” if treatment were assigned differently. Intuition suggests that if our goal were to predict “what is” and, without loss of generality, 团体 1 were substantially larger than group 0, we would like to place a large weight on the linear projection coefﬁcients of group 1 (α1 and γ1), because these coefﬁcients can be used to predict actual out- comes of this group. As noted by Deaton (1997) and Solon et al. (2015), the OLS weights are consistent with this idea. 的确, theorem 1 also implies that τ = [乙 (y | d = 1) − E (y | d = 0)] - (w0γ1 + w1γ0) × {乙 [p (X ) | d = 1] − E [p (X ) | d = 0]} . (9) 即, the OLS estimand is equal to the simple difference in means of y plus an adjustment term that depends on the difference in means of p (X ) and a weighted average of γ1 and γ0. When group one is large, w0, the weight on γ1, is large as well. 反过来, if group 1 is large but our goal is to predict missing outcomes, we need to place a large weight on α0 and γ0 because these coefﬁcients can be used to predict coun- terfactual outcomes of group 1. To see this point, note that it follows from the discussion in Imbens and Wooldridge (2009) that when the conditional means of y(1) 和y(0) are linear in X , we can write τAT E = [乙 (y | d = 1) − E (y | d = 0)] - [(1 − ρ) β1 + ρβ0] (10) × [乙 (X | d = 1) − E (X | d = 0)] , where β1 and β0 are the coefﬁcients on X in the conditional means of y(1) 和y(0), 分别. Equations (9) 和 (10) reiterate the point of corollary 1 that τ and τAT E have a very similar structure but differ substantially in how they assign weights. 的确, in the case of τAT E , when group 1 is large, the weight on β1 is small, the opposite of what we have seen for OLS.7 D. Implications of Theorem 1 There are several practical implications of my main result. Throughout this section, I assume that the researcher is inter- ested in estimating τAT E , τAT T , 或两者, and wishes to use OLS to estimate the model in equation (1) but is concerned about the implications of theorem 1 and corollary 1. In corollaries 2 和 3, I show how to decompose the difference between τ and τAT E or τ and τAT T into components attributable to (A) the difference between τAPLE ,1 and τAT T , (乙) the difference between τAPLE ,0 and τAT U (jointly referred to as “bias from nonlinearity”), 和 (C) the OLS weights on τAT T and τAT U (“bias from heterogeneity”).8 Because this paper generally focuses on what I now term “bias from heterogeneity,” my discussion below is restricted to this source of bias, which is equivalent to implicitly making assumptions 3 和 4. Corollary 2. Under assumptions 1 和 2, l 从http下载 : / / 直接的 . 米特 . 呃呃 / r e s t / 拉蒂斯 – df / / / / 1 0 4 3 5 0 1 2 0 2 2 7 1 0 / r e s t _ a _ 0 0 9 5 3 压力 . 来宾来访 0 8 九月 2 0 2 3 τ − τAT E = w0 × (西德:4) (西德:2) (西德:3) τAPLE ,0 − τAT U + w1 × (西德:5)(西德:6) bias from nonlinearity (西德:2) τAPLE ,1 − τAT T (西德:3) (西德:7) (西德:4) , + δ × (τAT U − τAT T ) (西德:7) (西德:5)(西德:6) bias from heterogeneity 7Note that the (infeasible) linear projection of the missing outcome, ym, on d and X would solve our problem of weight reversal. The weights on τAT T and τAT U would still be different from ρ and 1 − ρ if V [p (X ) | d = 1] 和V [p (X ) | d = 0] were different; but at least the weight on τAT T (τAT U ) would be increasing (decreasing) in ρ. 8Because bias from nonlinearity arises when assumptions 3 和/或 4 are violated, it might be more accurate to refer to this component as “bias from endogeneity and nonlinearity.” I use the former term for brevity. INTERPRETING OLS ESTIMANDS WHEN TREATMENT EFFECTS ARE HETEROGENEOUS 505 where δ = ρ − w1 = ρ2×V[p(X )|d=1]-(1−ρ)2×V[p(X )|d=0] under assumptions 1, 2, 3, 和 4, ρ×V[p(X )|d=1]+(1−ρ)×V[p(X )|d=0] . 还, τ − τAT E = δ × (τAT U − τAT T ) . Corollary 3. Under assumptions 1 和 2, τ − τAT T = w0 × (西德:4) (西德:2) τAPLE ,0 − τAT U (西德:3) (西德:2) τAPLE ,1 − τAT T (西德:3) (西德:7) + w1 × (西德:5)(西德:6) bias from nonlinearity . (西德:4) + w0 × (τAT U − τAT T ) (西德:7) (西德:5)(西德:6) bias from heterogeneity Also, under assumptions 1, 2, 3, 和 4, τ − τAT T = w0 × (τAT U − τAT T ) . The proofs of corollaries 2 和 3 follow from simple alge- bra and are omitted. These results show that regardless of whether we focus on τAT E or τAT T , the bias from heterogene- ity is equal to the product of a particular measure of hetero- geneity, 即, the difference between τAT U and τAT T , and an additional parameter that is easy to estimate, δ for τAT E and w0 for τAT T . While w0 is guaranteed to be positive under as- 假设 1 和 2, δ may be positive or negative. Both w0 and δ, 然而, are bounded between 0 和 1 in absolute value. 因此, w0 and |δ| can be interpreted as the percentage of our measure of heterogeneity, τAT U − τAT T , which contributes to bias.9 It might be useful to report estimates of w0 and δ in studies that use OLS to estimate the model in equation (1). 举个例子, consider the empirical application in sec- tion IIA. 在这种情况下, ˆw0 = 0.017 and ˆδ = −0.971. 在- terpretation of these estimates is as follows: if our goal is to estimate τAT T , using the model in equation (1) and OLS is expected to bias our estimates by only 1.7% of the difference between τAT U and τAT T . If instead we wanted to interpret τ as τAT E , our estimates would be biased by an estimated 97.1% of the difference between τAT T and τAT U . 因此, in this appli- 阳离子, it might perhaps be acceptable to interpret τ as τAT T but clearly not as τAT E . Assumption 5. V [p (X ) | d = 1] = V [p (X ) | d = 0]. The calculation of δ and w0 is further simpliﬁed under as- 消费 5. If we use δ∗ and w∗ 0 to denote the values of δ and w0 in this special case, we can write δ∗ = 2ρ − 1 and w∗ = ρ. 0 在这个设置下, the knowledge of δ and w0 requires only infor- mation on ρ, the proportion of units with d = 1. 当然, the 9To be precise, |δ| can be interpreted as the percentage of sgn(δ) × (τAT U − τAT T ) that contributes to bias when focusing on τAT E . Both δ and w0 also have an intuitive interpretation as the difference between the weight that we should place on τAT T when focusing on τAT E or τAT T and the weight that OLS actually places on this parameter. 的确, δ is equal to the differ- ence between ρ and w1. 相似地, w0 = 1 − w1. special case where V [p (X ) | d = 1] = V [p (X ) | d = 0] is hardly to be expected in practice. 仍然, δ∗ = 2ρ − 1 and w∗ = ρ can potentially serve as a rule of thumb. 0 The practical implications of assumption 5 are particularly clear when ρ is close to 0%, 50%, 或者 100%. When few units are treated, t (西德:4) τAT T . When most of the units are treated, t (西德:4) τAT U . 最后, when both groups are of similar size, t (西德:4) τAT E . This can also be seen from corollary 4: Corollary 4. Under assumptions 1, 2, 和 5, τ = (1 − ρ) × τAPLE ,1 + ρ × τAPLE ,0. 还, under assumptions 1, 2, 3, 4, 和 5, τ = (1 − ρ) × τAT T + ρ × τAT U . The proof follows immediately from simple algebra. Corol- lary 4 provides conditions under which OLS reverses the natural weights on τAPLE ,1 and τAPLE ,0 (or τAT T and τAT U ). 的确, under assumption 5, τ is a convex combination of group-speciﬁc average effects, with reversed weights at- tached to these parameters. 即, the proportion of units with d = 1 is used to weight the average effect of d on group 0, and vice versa. The results in this section allow empirical researchers to interpret the OLS estimand when treatment effects are het- erogeneous. 或者, it might be sensible to use any of the standard estimators for average treatment effects under ignorability, such as regression adjustment (see section IIA), weighting, matching, and various combinations of these ap- proaches.10 It might also help to estimate a model with homo- geneous effects using weighted least squares (WLS). 的确, in online appendix B3, I demonstrate that when we regress y on d and p (X ), with weights of 1−ρ for units with d = 1 and w0 ρ for units with d = 0, the WLS estimand is equal to τAPLE . w1 In practice, 当然, τAPLE can also be obtained directly from equation (7). 乙. Related Work This section discusses the relationship between my main result and those in Angrist (1998) and Humphreys (2009). These papers focus on saturated models with discrete covari- ates, in which the estimating equation includes an indicator for each combination of covariate values (“stratum”). In par- 针状的, Angrist (1998) provides a representation of τn in L (y | d, x1, . . . , xS ) = τnd + S(西德:8) s=1 βn,sxs, (11) 10For recent reviews, see Imbens and Wooldridge (2009), Wooldridge (2010), and Abadie and Cattaneo (2018). l 从http下载 : / / 直接的 . 米特 . 呃呃 / r e s t / 拉蒂斯 – df / / / / 1 0 4 3 5 0 1 2 0 2 2 7 1 0 / r e s t _ a _ 0 0 9 5 3 压力 . 来宾来访 0 8 九月 2 0 2 3 506 THE REVIEW OF ECONOMICS AND STATISTICS where x1, . . . , xS are stratum indicators. 更确切地说, Angrist (1998) demonstrates that τn = S(西德:8) s=1 (西德:9) 磷 (xs = 1) × V (d | xs = 1) S t=1 P (xt = 1) × V (d | xt = 1) × τs (12) where τs = E (y | d = 1, xs = 1) − E (y | d = 0, xs = 1). In online appendix B4, I demonstrate that this result follows from corollary 1 when the model for y is saturated.11 At the same time, the interpretation of OLS in Angrist (1998) is dif- ferent from theorem 1 and corollary 1. 一方面, unlike corollary 1 and Humphreys (2009), Angrist (1998) does not restrict the relationship between τs and P (d = 1 | xs = 1) in any way. 另一方面, theorem 1 and corollary 1 make it arguably easier to identify whether in a given application the OLS estimand will be close to any of the parameters of inter- 东方 (比照. corollaries 2 到 4). 尤其, Angrist (1998) does not recover a pattern of weight reversal, which is discussed in detail in this paper. Unlike Angrist (1998), Humphreys (2009) does not derive a new representation of τn, instead presenting further analy- sis of the result in equation (12). 尤其, Humphreys (2009) notes that τn can take any value between min(τs) and max(τs). Then he demonstrates that τn is also bounded by τAT T and τAT U if we restrict the relationship between τs and P (d = 1 | xs = 1) to be monotonic. According to corollary 1, τ is a convex combination of τAT T and τAT U if, 除其他事项外, both potential outcomes are linear in p (X ), which also implies a linear relationship between τs and P (d = 1 | xs = 1) when the model for y is saturated. 当然, this linearity assumption is stronger than the mono- tonicity assumption in Humphreys (2009). 然而, in re- 转动, we are able to derive a closed-form expression for τ in terms of τAT T and τAT U , a major advantage over the earlier literature, such as Angrist (1998) and Humphreys (2009).12 三、. Empirical Applications This section discusses two empirical illustrations of the- orem 1 and its corollaries.13 In online appendixes C and D, I discuss the implementation of these results in Stata and R. Throughout this section, τAPLE , τAPLE ,1, and τAPLE ,0 are im- plicitly treated as equivalent to τAT E , τAT T , and τAT U , 重新指定- 主动地. Although this might be restrictive, I also demonstrate 11Also, note that Aronow and Samii (2016) show that this result in Angrist (1998) is not speciﬁc to saturated models; 反而, it is sufﬁcient to assume that the model for d is linear in X . My analysis in online appendix B4 covers the results in both Angrist (1998) and Aronow and Samii (2016). 12Humphreys (2009) also provides a brief informal remark that the OLS estimand, as represented in Angrist (1998), is similar to τAT T (τAT U ) if propensity scores are small (大的) in every stratum. This is a special case of the rule of thumb derived from corollaries 3 和 4. My rule of thumb does not impose any such restrictions on the propensity score other than the requirement that the unconditional probability of treatment is close to 0 或者 1. 13In a follow-up paper, I apply these results in the study of racial gaps in test scores and wages (Słoczy´nski, 2020). that in both applications sample analogs of τAPLE , τAPLE ,1, and τAPLE ,0, reported in the body of the paper, are similar to other estimates of τAT E , τAT T , and τAT U , reported in online appendix E. A. The Effects of a Training Program on Earnings I ﬁrst consider the example from section IIA in more detail. This replication of the study of the effects of NSW program in Angrist and Pischke (2009) constitutes an optimistic sce- nario for OLS. In this application, as I explained in section IIA, the effect for the treated group (ATT) is likely to be substantially larger than the effect for the CPS comparison group (ATU). 而且, since the experimental benchmark of $1,794 corresponds to(西德:2)ATT and not to (西德:2)ATU, the researcher
should also focus on ATT. It turns out that my diagnostic for
estimating ATT, ˆw0, indicates that this parameter should ap-
proximately be recovered by OLS, even if treatment effects
are heterogeneous.14

The top and middle panels of table 1 reproduce the esti-
mates from Angrist and Pischke (2009) and report my di-
agnostics. The speciﬁcation in column 4 was discussed in
section IIA. It turns out that ˆw0 is between 0.1% 和 1.9%
for all speciﬁcations; similarly, the rule-of-thumb value of
this diagnostic, ˆw∗
0, 是, as always, equal to the proportion of
treated units (仅有的 1.1% in this sample). These results are
very simple to interpret. As in section IID, we estimate that
the difference between the OLS estimand and ATT is less than
2% of the difference between ATU and ATT. 在这种情况下, 它
might indeed be sensible to rely on the OLS estimates of the
effect of treatment.

The bottom panel of table 1 provides an application of
推论 1 to these results. 换句话说, the estimates from
Angrist and Pischke (2009) are now decomposed into two
成分, (西德:2)ATT and (西德:2)ATU. The difference between these
estimates is substantial. In column 4, while the estimate of
ATT is $928, ATU is estimated to be −$6,840. 换句话说,
the OLS estimate of $794, reported in Angrist and Pischke (2009) and discussed in section IIA, is actually a weighted average of these two estimates. The fact that it is close to $928,
and not to −$6,840, is a consequence of the small proportion of treated units in this sample, 1.1%. The weight on $928,
ˆw1, 是 98.3%, and the weight on −$6,840, ˆw0, 只是 1.7%.
We might expect that if the proportion of treated units
was larger, the weight on (西德:2)ATT would be smaller and the
performance of OLS in replicating the experimental bench-
mark would deteriorate. I conﬁrm this conjecture in online
appendix E1 by quasi-discarding random subsamples of un-
treated units over a range of sample sizes. 尤其, I rees-
timate the model in equation (1) using WLS, with weights of
1 for treated and 1
k for untreated units. Figures E1.1 to E1.4

14It is well known that in the NSW–CPS data, there is limited overlap in
terms of covariate values between the treated and untreated units (Dehejia
& Wahba, 1999; 史密斯 & Todd, 2005). 因此, it is important to note that my
theoretical results in section II do not impose the overlap assumption.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
r
e
s
t
/

我

A
r
t
我
C
e
–
p
d

F
/

1
0
4
3
5
0
1
2
0
2
2
7
1
0
/
r
e
s
t
_
A
_
0
0
9
5
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

INTERPRETING OLS ESTIMANDS WHEN TREATMENT EFFECTS ARE HETEROGENEOUS

507

TABLE 1.—THE EFFECTS OF A TRAINING PROGRAM ON EARNINGS

OLS

= ˆρ

ˆw0
ˆw∗
0
ˆδ
ˆδ∗ = 2 ˆρ − 1

(西德:2)ATT

ˆw1

(西德:2)ATU

ˆw0

(西德:2)ATE

Demographic controls
Earnings in 1974
Earnings in 1975

ˆρ = ˆP (d = 1)
观察结果

(1)

−3,437***
(612)

0.019
0.011
−0.970
−0.977

−3,373***
(620)
0.981

−6,753***
(1,219)
0.019

−6,714***
(1,206)
√

0.011
16,177

(2)

−78
(596)

0.001
0.011
−0.987
−0.977

−69
(595)
0.999

−6,289**
(2,807)
0.001

−6,218**
(2,777)

√

0.011
16,177

Original estimates

Diagnostics

Decomposition

(3)

623
(610)

0.017
0.011
−0.971
−0.977

754
(619)
0.983

−6,841***
(1,294)
0.017

−6,754***
(1,281)
√

√

0.011
16,177

(4)

794
(619)

0.017
0.011
−0.971
−0.977

928
(630)
0.983

−6,840***
(1,319)
0.017

−6,751***
(1,305)
√
√
√

0.011
16,177

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
r
e
s
t
/

我

A
r
t
我
C
e
–
p
d

F
/

1
0
4
3
5
0
1
2
0
2
2
7
1
0
/
r
e
s
t
_
A
_
0
0
9
5
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

The estimates in the top panel correspond to column 2 在表中 3.3.3 in Angrist and Pischke (2009, p. 89). The dependent variable is earnings in 1978. Demographic controls include age, age squared, years of
schooling, and indicators for married, high school dropout, 黑色的, and Hispanic. For treated individuals, earnings in 1974 correspond to real earnings in months 13 到 24 prior to randomization, which overlaps with
calendar year 1974 for a number of individuals. Formulas for w0, w1, and δ are given in theorem 1 and corollary 2. Following these results, OLS = ˆw1 × (西德:10)ATT + ˆw0 × (西德:10)ATU. Estimates of ATE, ATT, and ATU are
sample analogs of τAPLE , τAPLE ,1, and τAPLE ,0, 分别. 还, (西德:10)ATE = ˆρ × (西德:10)ATT + (1 − ˆρ) × (西德:10)ATU. Huber–White standard errors (OLS) and bootstrap standard errors ( (西德:10)ATE, (西德:10)ATT, 和 (西德:10)ATU) are in parentheses.
Statistically signiﬁcant at ∗10%, ∗∗5%, and ∗∗∗1%.

show that in this application WLS estimates become more
negative as k increases. This is because larger values of k
correspond to greater proportions of untreated units being
“discarded,” and hence larger weights on (西德:2)ATU, which is sub-
stantially more negative than (西德:2)ATT.

Additional extensions of my analysis are also presented
in online appendix E1. For each speciﬁcation in table 1, 我
provide both a linear and a nonparametric estimate of the
conditional mean of the outcome given p (X ), separately for
treated and untreated units (ﬁgures E1.5 to E1.8). A visual
comparison of both estimates provides an informal test of
assumption 4, which is necessary for a causal interpretation of
τAPLE , τAPLE ,1, and τAPLE ,0. The linearity assumption appears
to be approximately satisﬁed for the treated but usually not
for the untreated units.

因此, as a robustness check, I also report a number of alter-
native estimates of the effects of NSW program in table E1.1.
I consider regression adjustment, as in section IIA, 还有
as matching on p (X ) and on the logit propensity score.15 In
each case, I separately estimate ATE, ATT, and ATU. 这些
estimates are consistent with the claim that the general pat-
tern of results in table 1 is driven by the OLS weights. 这

15尤其, the estimates discussed in section IIA are reported in col-

umn 4 of the bottom panel of table E1.1.

estimates of ATE and ATU are always negative and large
in magnitude; the estimates of ATT are much closer to the
experimental benchmark.

最后, I repeat the following exercise from section IIA.
When we match the OLS estimates in table 1 with the cor-
responding estimates of ATT and ATU in table E1.1, 我们可以
write ˆτ = ˆwAT T × ˆτAT T + (1 − ˆwAT T ) × ˆτAT U . Unless ˆτAT T
and ˆτAT U are sample analogs of τAPLE ,1 and τAPLE ,0, ˆwAT T
does not need to be bounded between 0 和 1. Yet we can
solve for ˆwAT T for each set of estimates. The mean of ˆwAT T
across all sets of estimates in table E1.1 is 98.3%, 这是
nearly identical to the sample proportion of untreated units,
98.9%. This is reassuring for my claims.

乙. The Effects of Cash Transfers on Longevity

In my second application, I replicate a recent paper by
Aizer et al. (2016) and study the effects of cash transfers on
longevity of the children of their beneﬁciaries, 测量的
by their log age at death. 尤其, Aizer et al. (2016) 一个-
alyze the administrative records of applicants to the Mothers’
Pension (国会议员) 程序, which supported poor mothers with
dependent children in pre–World War II United States. 在这个
学习, the untreated group consists only of children of mothers
who applied for a transfer and were initially deemed eligible

508

THE REVIEW OF ECONOMICS AND STATISTICS

TABLE 2.—THE EFFECTS OF CASH TRANSFERS ON LONGEVITY

(1)

(2)

(3)

(4)

OLS

= ˆρ

ˆw0
ˆw∗
0
ˆδ
ˆδ∗ = 2 ˆρ − 1

(西德:2)ATT

ˆw1

(西德:2)ATU

ˆw0

(西德:2)ATE

State ﬁxed effects
County ﬁxed effects
Cohort ﬁxed effects
State characteristics
County characteristics
Individual characteristics

ˆρ = ˆP (d = 1)
观察结果

0.0157***
(0.0058)

0.861
0.875
0.736
0.750

0.0129**
(0.0064)
0.139

0.0162***
(0.0057)
0.861

0.0133**
(0.0063)
√

√

0.875
7,860

0.0158***
(0.0059)

0.870
0.875
0.745
0.750

0.0149**
(0.0071)
0.130

0.0160***
(0.0059)
0.870

0.0150**
(0.0068)

√
√
√
√

0.875
7,859

Original estimates

Diagnostics

Decomposition

0.0182***
(0.0062)

0.784
0.875
0.659
0.750

0.0097
(0.0078)
0.216

0.0206***
(0.0063)
0.784

0.0110
(0.0073)

√
√
√

√

0.875
7,859

0.0167***
(0.0061)

0.784
0.875
0.659
0.750

0.0089
(0.0079)
0.216

0.0188***
(0.0064)
0.784

0.0102
(0.0074)

√
√
√

√

0.875
7,857

The estimates in the top panel correspond to columns 1 到 4 in panel A of table 4 in Aizer et al. (2016, p. 952). The dependent variable is log age at death, as reported in the MP records (columns 1 到 3) 或上
the death certiﬁcate (柱子 4). 状态, county, and individual characteristics are listed in table E2.1 in online appendix E2. Formulas for w0, w1, and δ are given in theorem 1 and corollary 2. Following these results,
OLS = ˆw1 × (西德:10)ATT + ˆw0 × (西德:10)ATU. Estimates of ATE, ATT, and ATU are sample analogs of τAPLE , τAPLE ,1, and τAPLE ,0, 分别. 还, (西德:10)ATE = ˆρ × (西德:10)ATT + (1 − ˆρ) × (西德:10)ATU. Huber–White standard errors (OLS)
and bootstrap standard errors ( (西德:10)ATE, (西德:10)ATT, 和 (西德:10)ATU) are in parentheses. Statistically signiﬁcant at ∗10%, ∗∗5%, and ∗∗∗1%.

but were ultimately rejected. This strategy is used to ensure
that treated and untreated individuals are broadly comparable,
and hence an ignorability assumption might be plausible.
尽管如此, rejected mothers were slightly older and came
from slightly smaller and richer families than accepted moth-
呃. 因此, as before, there is no reason to believe that ATT
and ATU are equal, although it is perhaps less clear a priori
which is larger. Unlike in section IIIA, it seems plausible that
the researcher might be interested in either the average effect
of cash transfers, ATE, or in their average effect for accepted
applicants, ATT.

The top and middle panels of table 2 reproduce the baseline
estimates from Aizer et al. (2016) and report my diagnostics.
While the OLS estimates are positive and statistically sig-
niﬁcant, my diagnostics indicate that these results should be
approached with caution. 即, treated units constitute the
vast majority (或者 87.5%) of the sample. It follows that OLS is
expected to place a disproportionately large weight on (西德:2)ATU,
in which case the OLS estimates might be very biased for both
ATE and ATT (see corollaries 2 和 3). 的确, my estimates
of δ suggest that the difference between the OLS estimand
and ATE is equal to 65.9% 到 74.5% of the difference be-
tween ATU and ATT. 还, the estimates of w0 suggest that
the difference between OLS and ATT corresponds to 78.4%

到 87.0% of this measure of heterogeneity. The estimates of
δ∗ and w∗
0 are similar. It turns out that in this application the
OLS estimates might be substantially biased for both of our
parameters of interest. This would be a pessimistic scenario
for OLS.

The results in the bottom panel of table 2 suggest that these
biases are indeed substantial. In this panel, following corol-
lary 1, each OLS estimate from Aizer et al. (2016) is repre-
sented as a weighted average of estimates of two effects, 在
accepted (ATT) and rejected (ATU) applicants. 估计
of ATU are consistently larger than those of ATT. 因此, OLS
overestimates both ATE (since ˆδ > 0) and ATT. While the
implicit OLS estimates of these parameters remain statisti-
cally signiﬁcant in columns 1 和 2, this is no longer the case
in columns 3 和 4, following the inclusion of county ﬁxed
effects. Perhaps more importantly, these estimates of ATT are
half smaller than the corresponding OLS estimates. 清楚地,
this difference is economically quite meaningful.

To assess the robustness of these ﬁndings, I present sev-
eral extensions of my analysis in online appendix E2. 这
informal test of assumption 4, as discussed in section IIIA,
appears to suggest that the conditional mean of the outcome
given p (X ) is approximately linear for both the treated and
untreated units (see ﬁgures E2.5 to E2.8). I also report a

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
r
e
s
t
/

我

A
r
t
我
C
e
–
p
d

F
/

1
0
4
3
5
0
1
2
0
2
2
7
1
0
/
r
e
s
t
_
A
_
0
0
9
5
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

INTERPRETING OLS ESTIMANDS WHEN TREATMENT EFFECTS ARE HETEROGENEOUS

509

number of alternative estimates of the effects of cash transfers
in table E2.1. These additional results support my conclusion.
Only one in twelve estimates of ATT is statistically different
从 0, and four of the insigniﬁcant estimates are negative.
While it is possible that cash transfers increase longevity, 这
OLS estimates reported in Aizer et al. (2016) are almost cer-
tainly too large. 有趣的是, this bias appears to be driven
by the implicit OLS weights on ATT and ATU, the focus of
this paper.16

IV. 结论

This paper proposes a new interpretation of the OLS esti-
mand for the effect of a binary treatment in the standard linear
model with additive effects. According to the main result of
这篇论文, the OLS estimand is a convex combination of two
参数, which under certain conditions are equivalent to
the average treatment effects on the treated (ATT) and un-
treated (ATU). 出奇, the weights on these parameters
are inversely related to the proportion of observations in each
团体, which can lead to substantial biases when interpreting
the OLS estimand as ATE or ATT.

One lesson from this result is that it might be preferable, 作为
suggested by a body of work in econometrics, to use any of the
standard estimators of average treatment effects under ignor-
能力, such as regression adjustment, weighting, matching,
and various combinations of these approaches. Empirical re-
searchers with a preference for OLS might instead want to use
the diagnostic tools that this paper also provides. These diag-
nostics, which are implemented in the hettreatreg pack-
age in R and Stata, are applicable whenever the researcher
is studying the effects of a binary treatment, using OLS, 和
unwilling to maintain that ATT is exactly equal to ATU. 在
an important special case, these diagnostics require only the
knowledge of the proportion of treated units.

16I also repeat two further exercises from section IIIA. 第一的, after I
reestimate the model in equation (1) using WLS, with weights of 1 为了
treated and 1
k for untreated units, I demonstrate in ﬁgures E2.1 to E2.4
that these estimates become more positive as k increases. As before, larger
values of k translate into larger weights on (西德:10)ATU, which is now greater than
(西德:10)ATT. 第二, when I use the estimates of ATT and ATU in table E2.1 to
recover the hypothetical OLS weights, I obtain 22.8% as the mean of ˆwAT T .
This is reasonably similar to the proportion of untreated units, 12.5%.

参考
Abadie, 阿尔贝托, Susan Athey, Guido W.

Imbens, and Jeffrey M.
Wooldridge, “Sampling-Based versus Design-Based Uncertainty in
Regression Analysis,” Econometrica 88 (2020), 265–296. 10.3982/
ECTA12675

Abadie, 阿尔贝托, and Matias D. 卡塔内奥, “Econometric Methods for Pro-
gram Evaluation,” Annual Review of Economics 10 (2018), 465–503.
10.1146/annurev-economics-080217-053402

Aizer, 安娜, Shari Eli, Joseph Ferrie, and Adriana Lleras-Muney, “这
Long-Run Impact of Cash Transfers to Poor Families,” American
经济评论 106 (2016), 935–971. 10.1257/aer.20140529

Alesina, 阿尔贝托, Paola Giuliano, and Nathan Nunn, “On the Origins of
Gender Roles: Women and the Plough,” Quarterly Journal of Eco-
经济学 128 (2013), 469–530. 10.1093/qje/qjt005

Angrist, Joshua D., “Estimating the Labor Market Impact of Voluntary
Military Service Using Social Security Data on Military Applicants,”
Econometrica 66 (1998), 249–288. 10.2307/2998558

Angrist, Joshua D., and Jörn-Steffen Pischke, Mostly Harmless Economet-
rics: An Empiricist’s Companion. 普林斯顿大学, 新泽西州: Princeton Univer-
城市出版社 (2009).

Aronow, Peter M., and Cyrus Samii, “Does Regression Produce Represen-
tative Estimates of Causal Effects?” American Journal of Political
科学 60 (2016), 250–267. 10.1111/ajps.12185

Bitler, Marianne P., Jonah B. Gelbach, and Hilary W. Hoynes, “什么
Mean Impacts Miss: Distributional Effects of Welfare Reform
实验,” 美国经济评论 96 (2006), 988–1012.
10.1257/aer.96.4.988

Card, 大卫, Jochen Kluve, and Andrea Weber, “What Works? A Meta
Analysis of Recent Active Labor Market Program Evaluations,”
Journal of the European Economic Association 16 (2018), 894–931.
10.1093/jeea/jvx028

Deaton, Angus, The Analysis of Household Surveys: A Microeconometric
Approach to Development Policy (巴尔的摩, 医学博士: 约翰霍普金斯大学
大学出版社, 1997).

Dehejia, Rajeev H., and Sadek Wahba, “Causal Effects in Nonexperimental
学习: Reevaluating the Evaluation of Training Programs,” Jour-
nal of the American Statistical Association 94 (1999), 1053–1062.
10.1080/01621459.1999.10473858

Elder, Todd E., John H. Goddeeris, and Steven J. 海德尔, “Unexplained
Gaps and Oaxaca–Blinder Decompositions,” Labour Economics 17
(2010), 284–290. 10.1016/j.labeco.2009.11.002

Graham, Bryan S., and Cristine Campos de Xavier Pinto, “Semipara-
metrically Efﬁcient Estimation of the Average Linear Regression
Function,” Journal of Econometrics 226 (2022), 115–138. 10.1016/
j.jeconom.2021.07.008

赫克曼, James J., “Micro Data, Heterogeneity, and the Evaluation of
Public Policy: Nobel Lecture,》政治经济学杂志 109
(2001), 673–748. 10.1086/322086

Humphreys, Macartan, “Bounds on Least Squares Estimates of Causal Ef-
fects in the Presence of Heterogeneous Assignment Probabilities,”
unpublished paper (2009).

Imbens, Guido W., “Matching Methods in Practice: Three Examples,” Jour-
nal of Human Resources 50 (2015), 373–419. 10.3368/jhr.50.2.373
Imbens, Guido W., and Donald B. 鲁宾, Causal Inference for Statistics,
Social, and Biomedical Sciences: 一个介绍 (纽约: 凸轮-
桥大学出版社, 2015).

Imbens, Guido W., and Jeffrey M. Wooldridge, “Recent Developments in
the Econometrics of Program Evaluation,” Journal of Economic Lit-
erature 47 (2009), 5–86. 10.1257/jel.47.1.5

Kline, Patrick, “Oaxaca–Blinder as a Reweighting Estimator,” American
经济评论: Papers and Proceedings 101 (2011), 532–537.
10.1257/aer.101.3.532

LaLonde, Robert J., “Evaluating the Econometric Evaluations of Training
Programs with Experimental Data,” 美国经济评论 76
(1986), 604–620. https://www.jstor.org/stable/1806062

Rosenbaum, Paul R., and Donald B. 鲁宾, “The Central Role of the Propen-
sity Score in Observational Studies for Causal Effects,” Biometrika
70 (1983), 41–55. 10.1093/biomet/70.1.41

Słoczy´nski, Tymon, “A General Weighted Average Representation of the
Ordinary and Two-Stage Least Squares Estimands,” IZA discussion
纸 11866 (2018).

——— “Average Gaps and Oaxaca–Blinder Decompositions: A Cautionary
Tale about Regression Estimates of Racial Differences in Labor Mar-
ket Outcomes,” Industrial and Labor Relations Review 73 (2020),
705–729. 10.1177/0019793919874063

史密斯, Jeffrey A., and Petra E. Todd, “Does Matching Overcome LaLonde’s
Critique of Nonexperimental Estimators?” Journal of Econometrics
125 (2005), 305–353. 10.1016/j.jeconom.2004.04.011

Solon, Gary, Steven J. 海德尔, and Jeffrey M. Wooldridge, “What Are We
Weighting For?” Journal of Human Resources 50 (2015), 301–316.
10.3368/jhr.50.2.301

Voigtländer, Nico, and Hans-Joachim Voth, “Persecution Perpetuated: 这
Medieval Origins of Anti-Semitic Violence in Nazi Germany,” Quar-
terly Journal of Economics 127 (2012), 1339–1392. 10.1093/qje/
qjs019

Wooldridge, Jeffrey M., Econometric Analysis of Cross Section and Panel

数据, 2ND版. (剑桥, 嘛: 与新闻界, 2010).

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
r
e
s
t
/

我

A
r
t
我
C
e
–
p
d

F
/

1
0
4
3
5
0
1
2
0
2
2
7
1
0
/
r
e
s
t
_
A
_
0
0
9
5
3
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
下载pdf