options(digits=2)
# R Imports Boilerplate
rm(list=ls())
library(LalRUtils)
load_or_install(c('tidyverse','data.table', 'stargazer','lfe','boot', 'sandwich','lmtest',
'magrittr', 'rio', 'cobalt', 'Matching', 'doParallel', 'foreach'))
theme_set(theme_bw())
exp_type = 'text'
root = '/home/alal/Dropbox/0_GradSchool/0_classes/450b/applied-econometrics-notes/'
data = paste0(root, 'data/')
nsw_exper = import(paste0(data, 'nsw_exper.dta'))
nsw_psid = import(paste0(data,'nsw_psid_withtreated.dta'))
Variance of Y = Variance of CEF (Explained Variance) + Variance of residual
where $\tilde{x_{ki}}$ is the residual from a regression of $x_{ki}$ on all other covariates.
Justifications for Linear Regression:
Conditional on $X_i$, $s_i$ can be said to be 'as good as randomly assigned'. Not Testable
Where the LHS is a vector of coefficients from the 'short' regression (regressing Y on $O_i$), $O_i$ is a vector of observed predictors (included in the regression), and $\rho$ is the corresponding vector of coefficients from regressing Y on $O_i$ and $U_i$ (long regression estimates). $U_i$ is a vector of unobserved predictors, and $\gamma$ is the corresponding set of coefficients from the long regression, and $\delta_{UO}$ is the vector of coefficients from regressing $U_i$ on $O_i$. In words, the OVB formula says:
Coefficient in Short Regression = Coefficient in long regression + effect of omitted $\times$ regression of omitted on included
ate_diffmeans(nsw_exper, re78, nsw)
basespec = as.formula(re78 ~ nsw)
fullspec = as.formula(re78 ~ nsw + age + educ + black + hisp + married + re74 + re75 +
u74 + u75)
m1 = robustify(felm(basespec, nsw_exper))
m2 = robustify(felm(fullspec, nsw_exper))
stargazer(m1, m2,
column.labels = c('Basic', 'Controls'),
dep.var.labels.include = FALSE,
header = F,
single.row = T,
type = exp_type
)
$(Y_0 , Y_1) \coprod D | X$ - unconfoundedness / Selection on Observables / Ignorability
$0 < Pr(D=1|X) < 1$ - common support
Weakening assumptions to uncover ATT only
$(Y_0) \coprod D | X$ - unconfoundedness for controls
$Pr(D=1|X) < 1$ - weak overlap
Given unconfoundedness, we can write : $$ \begin{aligned} \mathbf { E } \left[ Y _ { 1 } - Y _ { 0 } | X \right] & = \mathbf { E } \left[ Y _ { 1 } - Y _ { 0 } | X , D = 1 \right] \\ & = \mathbf { E } [ Y | X , D = 1 ] - \mathbf { E } [ Y | X , D = 0 ] \end{aligned} $$
Estimands:
$$ \begin{aligned} \tau _ { A T E } & = \mathbf { E } \left[ Y _ { 1 } - Y _ { 0 } \right] = \int \mathbf { E } \left[ Y _ { 1 } - Y _ { 0 } | X \right] d P ( X ) \\ & = \int ( \mathbb { E } [ Y | D = 1 , X ] - \mathbb { E } [ Y | D = 0 , X ] ) d P ( X ) \end{aligned} $$$$ \begin{aligned} \tau _ { A T T } & = \mathbf { E } \left[ Y _ { 1 } - Y _ { 0 } | D = 1 \right] \\ & = \int ( \mathbf { E } [ Y | X , D = 1 ] - \mathbf { E } [ Y | X , D = 0 ] ) d P ( X | D = 1 ) \end{aligned} $$Discrete Case ($x$ has finite values indexed by $1, \dots, m$
$$ \tau_{ATE} = \sum_{k = 1}^m ( \mathbb { E } [ Y | D = 1 , X = x_k ] - \mathbb { E } [ Y | D = 0 , X = k_k ] ) Pr(X = x_k) $$$$ \tau_{ATE} = \sum_{k = 1}^m ( \mathbb { E } [ Y | D = 1 , X = x_k ] - \mathbb { E } [ Y | D = 0 , X = k_k ] ) Pr(X = x_k | D = 1) $$att_subclass = function(df, treat_var, outcome_var, ...){
treat_var = enquo(treat_var)
outcome_var = enquo(outcome_var)
subclass = quos(...) # arbitrarily large sets of sublasses
df %>%
# count number of treated
mutate(N_treat = sum(!! treat_var)) %>%
# select and rename variables
dplyr::select(N_treat, treatment = !! treat_var,
outcome = !! outcome_var, !!! subclass) %>%
# group by subclassification groups
group_by(!!! subclass) %>%
# count treatments, build weights by subclass
mutate(n_treat = sum(treatment), wgt = n_treat/N_treat) %>%
# group by subclass + treatment
group_by(!!! subclass, treatment) %>%
# subclass X treatment means
mutate(ybar = mean(outcome)) %>%
# take first row from each subclass X treatment combination
slice(1) %>%
# subset to relevant columns
dplyr::select(n_treat, N_treat, ybar, wgt, treatment, !!! subclass) %>%
# reshape (long to wide) so each subclass is a row
spread(treatment, ybar, sep = '_') %>%
# difference in means for each subclass + weighted mean
mutate(ydiff = treatment_1 - treatment_0, wtydiff = wgt * ydiff) -> groupsums
# overall ATT , drop missing subclasses
att = sum(groupsums$wtydiff, na.rm = T)
return(list(groupsums, att))
}
att_subclass(nsw_psid, nsw, re78, married)[[2]]
nsw_psid %<>% mutate(terc_75 = ntile(re75, 3))
att_subclass(nsw_psid, nsw, re78, married, hisp, black, u75, terc_75)[[2]]
Where $\Sigma$ is the variance-covariance matrix.
Bias-corrected (Abadie-Imbens) $$ \hat { \tau } _ { A T T } = \frac { 1 } { N _ { 1 } } \sum _ { D = 1 } \left( Y _ { i } - Y _ { j ( i ) } \right) - \left( \hat { \mu } _ { 0 } \left( X _ { i } \right) - \hat { \mu } _ { 0 } \left( X _ { j ( i ) } \right) \right) $$
Where $\mu _ { 0 } ( x ) = E [ Y | X = x , D = 0 ]$ is the population regression function under the control.
m1 = Match(Y = nsw_psid$re78, Tr = nsw_psid$nsw,
X = nsw_psid %>% dplyr::select(age, educ, black, hisp,
married, re74, re75, u74, u75),
BiasAdjust = T, estimand = 'ATT'
)
summary(m1)
Under unconfoundedness,
$$ \mathbf { E } [ Y | D = 1 , \pi ( X ) = \overline { \pi } ] - \mathbf { E } [ Y | D = 0 , \pi ( X ) = \overline { \pi } ] = \mathbf { E } \left[ Y _ { 1 } - Y _ { 0 } | \pi ( X ) = \overline { \pi } \right] $$Balancing property of propensity score $$ \operatorname { Pr } ( X | D = 1 , \pi ( X ) ) = \operatorname { Pr } ( X | D = 0 , \pi ( X ) ) $$
pscore_fml = as.formula(nsw ~ age + educ + black + hisp +
married + re74 + re75 + u74 + u75 )
pscore_fitter <- function(df=nsw_psid, fmla = pscore_fml){
pscore1 = glm(fmla, df, family=binomial(link='logit'))
data.frame(pscore = pscore1$fitted.values,
treat = as.factor(df$nsw))
}
pscores2 = pscore_fitter(nsw_exper)
p0 = ggplot(pscores2, aes(x = pscore, colour = treat)) +
geom_density() +
scale_colour_viridis_d() +
labs(title = 'Density of Propensity Score by Treatment Status')
p1 = ggplot(pscores2, aes(x = pscore, fill = treat, colour = treat)) +
geom_density(position='fill') +
scale_colour_viridis_d() +
scale_fill_viridis_d() +
labs(title = 'Distribution of P(Treatment Status | Pscore = p)',
caption = 'Experimental Sample'
)
multiplot(p0, p1)
pscore = pscore_fitter()$pscore
m4 = Match(Y = nsw_psid$re78, Tr = nsw_psid$nsw,
X = pscore,
BiasAdjust = T, estimand = 'ATT'
)
summary(m4)
where $\hat{\pi}(X_i)$ is the propensity score.
Additional Assumptions
1) Constant treatment effects
2) Outcomes linear in X
(2) fails - $\tau_{OLS}$ is Best Linear Approximation.
(1) fails - $\tau_{OLS}$ is conditional variance weighted average of underlying $\tau$s.
where
$$ w_k = \frac{Var(D|X=x_k) Pr(X=x_k)}{\sum_{r=1}^{m} Var(D|X=x_r) Pr(X = x_r) } $$$Var(D|X=x_k)$ is maximal when $Pr(D=1|X=X_k) = 1/2$ and is decreasing as this probability moves towards 0 and 1. In other words, $\tau_{OLS}$ weighs up groups where the size of the treated and untreated population are roughly equal, and weighs down groups with large imbalances in the size of these two groups.
$\tau_{OLS}$ is true effect IFF constant treatment effects holds.
regform = as.formula(re78 ~ nsw + age + educ + black + hisp +
married + re74 + re75 + u74 + u75 )
stargazer(lm(regform, nsw_psid), lm(regform, nsw_exper), type = exp_type)
Doubly Robust estimator:
$$ \widehat { \tau } _ { \mathrm { ATE } } = \frac { 1 } { n } \sum _ { i = 1 } ^ { n } \left( \frac { W _ { i } \left( Y _ { i } - \widehat { \mu } _ { 1 } \left( X _ { i } \right) \right) } { \widehat { \pi } \left( X _ { i } \right) } + \widehat { \mu } _ { 1 } \left( X _ { i } \right) \right) - \left( \frac { \left( 1 - W _ { i } \right) \left( Y _ { i } - \widehat { \mu } _ { 0 } \left( X _ { i } \right) \right) } { 1 - \widehat { \pi } \left( X _ { i } \right) } + \widehat { \mu } _ { 0 } \left( X _ { i } \right) \right) $$Consistent estimation of $\pi(.)$ or $\mu_1(.)$ and $\mu_0(.)$ required for validity.
Change of notation - $w$ is treatment, $\hat{e}(x)$ is the propensity score; standard errors are almost always bootstrapped
data = read_table(paste0(data, 'lalonde_exper.txt'), col_names = c('t', 're78', 'education', 're74', 're75'))
boot.extract <- function(obj){
point_est = obj[[1]]
se = sd(obj[[2]])
return(c(point_est, se))
}
# Regression function for boot
lm.boot <- function(formula, data, indices){
fit = lm(formula, data=data[indices,])
return(coef(fit))
}
lm.r <- function(data, formula){
tmp = lm(formula, data)
out = coeftest(tmp, vcov = vcovHC(tmp, "HC1"))
return(out)
}
# 1. difference in means
diff_fit = lm.r(data, re78 ~ t)
diff_boot_output = boot(data, lm.boot, 1000, formula=re78~t)
diff = c(diff_fit[2,1], diff_fit[2,2])
diff_boot = c(diff_boot_output[[1]][2], sd(diff_boot_output[[2]][,2]))
diff_boot
# Covariate adjustment function for boot
cov.adj.boot <- function(data, treat, outcome, cov, indices) {
d = data[indices,]
formula = formula(paste0(outcome, "~", cov))
mu0 = d[d[,treat]==0,] %>%
lm(formula, .) %>%
predict(d)
mu1 = d[d[,treat]==1,] %>%
lm(formula, .) %>%
predict(d)
return(mean(mu1-mu0))
}
# 2 Estimate the effect using covariate adjustment
boot(data, cov.adj.boot, 1000,
treat="t", outcome="re78", cov="re75") %>%
boot.extract()
# Inverse propensity score weighting function for boot
ipw.boot <- function(data, outcome, treat, cov, indices) {
d = as.data.frame(data[indices,])
formula = formula(paste0(treat, "~", cov))
fit_wt = glm(formula, binomial(link="logit"), d)
prob_wt = predict(fit_wt, type="response")
w = d[, treat]
y = d[, outcome]
out = mean(w*y/prob_wt - (1-w)*y/(1-prob_wt))
return(out)
}
# 3 Estimate the effect using inverse propensity score weighting
boot(data, ipw.boot, 1000,
outcome="re78", treat="t", cov="re75") %>%
boot.extract()
# Doubly-robust estimator function for boot
dr.boot <- function(data, outcome, treat, cov, indices) {
d = as.data.frame(data[indices,])
formula_y = formula(paste0(outcome, "~", cov))
formula_w = formula(paste0(treat, "~", cov))
mu0 = d[d[,treat]==0,] %>%
lm(formula_y, .) %>%
predict(d)
mu1 = d[d[,treat]==1,] %>%
lm(formula_y, .) %>%
predict(d)
fit_wt = glm(formula_w, binomial(link="logit"), d)
prob_wt = predict(fit_wt, type="response")
w = d[, treat]
y = d[, outcome]
out = mean(w*(y-mu1)/prob_wt - (1-w)*(y-mu0)/(1-prob_wt) +
mu1 - mu0)
return(out)
}
# 4 Estimate the effect using a doubly-robust estimator
(dr = boot(data, dr.boot, 1000,
outcome="re78", treat="t", cov="re75") %>%
boot.extract())
Score function:
$$ \psi ( y , w , x , \tau ) = \frac { w y } { e ( x ) } - \frac { ( 1 - w ) y } { 1 - e ( x ) } - \tau - \left( \frac { \mu _ { 1 } ( x ) } { e ( x ) } + \frac { \mu _ { 0 } ( x ) } { 1 - e ( x ) } \right) ( w - e ( x ) ) $$Efficiency Bound:
$$ \mathbb { V } \geq \mathbb { E } \left[ \psi ( Y , W , X , \tau ) ^ { 2 } \right] $$Where RHS is $$ = \mathbb { E } \left( \frac { \sigma _ { 1 } ^ { 2 } ( x ) } { e \left( X _ { i } \right) } + \frac { \sigma _ { 0 } ^ { 2 } ( x ) } { 1 - e \left( X _ { i } \right) } + \left( \mu _ { 1 } \left( x _ { i } \right) + \mu _ { 0 } \left( x _ { i } \right) - \tau \right) ^ { 2 } \right) $$
# A function for estimating the efficiency bound
estimate.eff.bound <- function(data, outcome, treat, cov, tau){
formula_y = formula(paste0(outcome, "~", cov))
formula_w = formula(paste0(treat, "~", cov))
fit0 = data[data[,treat]==0,] %>% lm(formula_y, .)
fit1 = data[data[,treat]==1,] %>% lm(formula_y, .)
mu0 = predict(fit0, data)
mu1 = predict(fit1, data)
var0 = (summary(fit0)$sigma)^2
var1 = (summary(fit1)$sigma)^2
prob_wt = predict(glm(formula_w, binomial(link="logit"), data), type="response")
est_eff_bound = sqrt(mean(var1/prob_wt + var0/(1-prob_wt) + (mu1 - mu0 - tau)^2)/nrow(data))
return(est_eff_bound)
}
estimate.eff.bound(data, "re78", "t", "re75", dr[1])
Assume $D$ is realised at $t = 0$ and is ignorable conditional on a set of $T$ lags of the outcome.
$$ Y _ { 1 } , Y _ { 0 } \perp D | Y _ { t = - 1 } , Y _ { t = - 2 } , \ldots , Y _ { t = - T } , X $$This implies we should have ignorability conditional on all lags but one
$$ Y _ { t = - 1 } \perp D | Y _ { t = - 2 } , \ldots , Y _ { t = - T } , X $$Estimate treatment effect on $Y_{t=-1}$, nonzero treatment effect would imply that unconfoundedness is implausible
$U$ is a nuisance parameter.
$$ Y_1, Y_0 \perp D|X, U $$Where $U \sim \mathcal{B}(\pi = 0.5)$, and $U \perp X$. $P(U=1) = P(U=0) = 0.5$.
Propensity score is Logistic:
$$ P(D = 1|X, U) = \frac{exp(X\theta + \gamma U)}{1 + exp(X\theta + \gamma U)} $$$\gamma$ indicates strength of relationship between $U$ and $D|X$.
Y is conditionally normal $$ Y | X, U \sim \mathcal{N}(\alpha D + X\beta + \delta U , \sigma^2) $$
$\delta$ indicates strength of relationship between $U$ and $Y|X$.
MLE setup
Construct grid of $(\gamma, \delta)$ and calculate the MLE for $\hat{\alpha}(\gamma, \delta)$ by maximising $\mathcal{l}(\alpha, \beta, \theta, \gamma, \delta)$ over $(\gamma, \delta)$.
Use 2 partial $R^2$s:
$R^2_{Y, par} (\delta)$: Residual variation in outcome explained by $U$ (after partialling out $X$).
$R^2_{D, par} (\gamma)$: Residual variation in treatment assignment explained by $U$ (after partialling out $X$).
Draw threshold contours, should expect most covariates to be clustered around origin.
Tuning parameter $\Gamma \geq 1$ that measures departure from zero hidden bias.
For any two observations $i$ and $j$ with identical covariate values $X_i = X_j$, under unconfoundedness, probability of assignment into treatment should be identical $\pi(X_i) = \pi(X_j)$
Treatment assignment probability may differ due to unobserved binary confounder $U$. We can bound this by the ratio:
$$ \frac{1}{\Gamma} \leq \frac{e_i (1-e_j)}{(1-e_i) e_j} \leq \Gamma $$$\gamma = 1 \implies $ No bias. $\Gamma =2 \implies$ $i$ is twice as likely to be treated than $j$ despite identical $x$.
How much does treatment effect move when controls are added?
Estimate model with and without controls:
$Y_i = \alpha^F D_i + X \beta + \epsilon$
$Y_i = \alpha^R D_i + \epsilon$
AET ratio:
$$ \rho = \frac{\hat{\alpha}^F}{\hat{\alpha}^R - \hat{\alpha}^F} $$Want $\rho$ to be as big as possible (i.e. $\hat{\alpha}^R - \hat{\alpha}^F} \to 0$ under unconfoundedness).