\(\DeclareMathOperator*{\argmin}{argmin}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\epsi}{\varepsilon}\) \(\newcommand{\phii}{\varphi}\) \(\newcommand{\tra}{^{\top}}\) \(\newcommand{\sumin}{\sum_{i=1}^n}\) \(\newcommand{\sumiN}{\sum_{i=1}^n}\) \(\newcommand{\norm}[1]{\left\Vert{#1} \right\Vert}\) \(\newcommand\Bigpar[1]{\left( #1 \right )}\) \(\newcommand\Bigbr[1]{\left[ #1 \right ]}\) \(\newcommand\Bigcr[1]{\left\{ #1 \right \}}\) \(\newcommand\SetB[1]{\left\{ #1 \right\}}\) \(\newcommand\Sett[1]{\mathcal{#1}}\) \(\newcommand{\Data}{\mathcal{D}}\) \(\newcommand{\Ubr}[2]{\underbrace{#1}_{\text{#2}}}\) \(\newcommand{\Obr}[2]{ \overbrace{#1}^{\text{#2}}}\) \(\newcommand{\sumiN}{\sum_{i=1}^N}\) \(\newcommand{\dydx}[2]{\frac{\partial #1}{\partial #2}}\) \(\newcommand\Indic[1]{\mathds{1}_{#1}}\) \(\newcommand{\Realm}[1]{\mathbb{R}^{#1}}\) \(\newcommand{\Exp}[1]{\mathbb{E}\left[#1\right]}\) \(\newcommand{\Expt}[2]{\mathbb{E}_{#1}\left[#2\right]}\) \(\newcommand{\Var}[1]{\mathbb{V}\left[#1\right]}\) \(\newcommand{\Covar}[1]{\text{Cov}\left[#1\right]}\) \(\newcommand{\Prob}[1]{\mathbf{Pr}\left(#1\right)}\) \(\newcommand{\Supp}[1]{\text{Supp}\left[#1\right]}\) \(\newcommand{\doyx}{\Prob{Y \, |\,\mathsf{do} (X = x)}}\) \(\newcommand{\doo}[1]{\Prob{Y |\,\mathsf{do} (#1) }}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\Z}{\mathbb{Z}}\) \(\newcommand{\wh}[1]{\widehat{#1}}\) \(\newcommand{\wt}[1]{\widetilde{#1}}\) \(\newcommand{\wb}[1]{\overline{#1}}\) \(\newcommand\Ol[1]{\overline{#1}}\) \(\newcommand\Ul[1]{\underline{#1}}\) \(\newcommand\Str[1]{#1^{*}}\) \(\newcommand{\F}{\mathsf{F}}\) \(\newcommand{\ff}{\mathsf{f}}\) \(\newcommand{\Cdf}[1]{\mathbb{F}\left(#1\right)}\) \(\newcommand{\Cdff}[2]{\mathbb{F}_{#1}\left(#2\right)}\) \(\newcommand{\Pdf}[1]{\mathsf{f}\left(#1\right)}\) \(\newcommand{\Pdff}[2]{\mathsf{f}_{#1}\left(#2\right)}\) \(\newcommand{\dd}{\mathsf{d}}\) \(\newcommand\Normal[1]{\mathcal{N} \left( #1 \right )}\) \(\newcommand\Unif[1]{\mathsf{U} \left[ #1 \right ]}\) \(\newcommand\Bern[1]{\mathsf{Bernoulli} \left( #1 \right )}\) \(\newcommand\Binom[1]{\mathsf{Bin} \left( #1 \right )}\) \(\newcommand\Pois[1]{\mathsf{Poi} \left( #1 \right )}\) \(\newcommand\BetaD[1]{\mathsf{Beta} \left( #1 \right )}\) \(\newcommand\Diri[1]{\mathsf{Dir} \left( #1 \right )}\) \(\newcommand\Gdist[1]{\mathsf{Gamma} \left( #1 \right )}\) \(\def\mbf#1{\mathbf{#1}}\) \(\def\mrm#1{\mathrm{#1}}\) \(\def\mbi#1{\boldsymbol{#1}}\) \(\def\ve#1{\mbi{#1}}\) \(\def\vee#1{\mathbf{#1}}\) \(\newcommand{\Mat}[1]{\mathbf{#1}}\) \(\newcommand{\eucN}[1]{\norm{#1}}\) \(\newcommand{\lzero}[1]{\norm{#1}_0}\) \(\newcommand{\lone}[1]{\norm{#1}_1}\) \(\newcommand{\ltwo}[1]{\norm{#1}_2}\) \(\newcommand{\pnorm}[1]{\norm{#1}_p}\)
This note extends the simulation study in Chang (2020) and re-implements the estimators considered in efficient R code with the glmnet
library.
Throughout, we write \(Y^d(t)\) to denote the potential outcomes \(Y^1, Y^0\) at time \(t \in \{0, 1\}\) and \(Y(t)\) to denote the realised outcome. The estimand is the ATT in the 2nd period \(\Exp{Y^1(1) - Y^0(1) | D = 1}\).
The conditional parallel trends assumption is written as
\[ E\left[Y^{0}(1)-Y^{0}(0) \mid X, D=1\right]=E\left[Y^{0}(1)-Y^{0}(0) \mid X, D=0\right] \]
Under (c)PT, most well-known difference in differences estimator can be written conditional on \(X\) as
\[\begin{aligned} \hat{\tau}^{\text{OM}} =\{ & \wh{\mathbb{E}}[Y(1) \mid X, D=1]-\wh{\mathbb{E}}[Y(1) \mid X, D=0]\} \\ - \{ & \wh{\mathbb{E}}[Y(0) \mid X, D=1]-\wh{\mathbb{E}}[Y(0) \mid X, D=0]\} \end{aligned}\]where the four expectations on the RHS are estimated nonparametrically. This is implemented in omDID
.
The Abadie (2005) semiparametric IPW estimator for the 2-period ATT is
\[ \hat{\tau}^{\text{IPW}} = \frac{1}{N} \sum_i Y_i(1) - Y_i(0) \frac{D - \hat{e}(X_i)}{P(D=1) (1 - \hat{e}(X_i) )} \]
and is implemented in ipwDID
.
This scheme works by weighting-down the distribution of \(Y(1) − Y(0)\) for the untreated for those values of the covariates which are over-represented among the untreated (that is, with low \(\hat{e}(X_i)/ (1 - \hat{e}(X_i)\)). Conversely, we are weighting-up \(Y(1) − Y(0)\) for those values of the covariates under-represented among the untreated (that is with high \(\hat{e}(X_i)/ (1 - \hat{e}(X_i)\)). In this way the same distribution of the covariates is imposed for treated and untreated.
The Chang (2020) double-robust AIPW estimator is
\[ \hat{\tau}^{\text{AIPW}} = \frac{1}{N} \sum_i \Bigpar{Y_i(1) - Y_i(0) - \hat{\mathcal{l}}(X_i, D = 0)} \cdot \Bigbr{\frac{D - \hat{e}(X_i)}{P(D = 1) (1- \hat{e}(X_i) )}} \]
where \(\hat{e}(\cdot)\) is the propensity score and \(\hat{\mathcal{l}}(\cdot)\) is an outcome model for \(Y(1) - Y(0)\) regressed on covariates in the untreated subsample \(D = 0\). The latter contains the additional ‘debiasing’ term \(\hat{\mathcal{l}}(\cdot)\), which subtracts out the projected change in the outcome in the untreated group (i.e. it can be thought of as the 2nd difference in the conventional DiD estimator).
This is implemented in aipwDID
.
Both these functions are estimated using LASSO(glmnet
) using internal cross-fitting.
\(N=200\) is the and \(p=100\) is the dimension of control variables, \(X_{i} \sim N\left(0, I_{p \times p}\right)\).
Let \(\gamma_{0}=(1,1 / 2,1 / 3,1 / 4\), \(1 / 5,0, \ldots, 0) \in \mathbb{R}^{p}\), and \(D_{i}\) is generated by the propensity score \(e := P(D=1 \mid X)=\frac{1}{1+\exp \left(-X^{\prime} \gamma_{0}\right)}\). The pscore is sparse.
Also, let the potential outcomes be \(Y_{i}^{0}(0)=X_{i}^{\prime} \beta_{0}+\varepsilon_{1}, Y_{i}^{0}(1)=Y_{i}^{0}(0)+1+ \rho * e + \varepsilon_{2}\), and \(Y_{i}^{1}(1)=\) \(\theta_{0}+Y_{i}^{0}(1)+\varepsilon_{3}\), where \(\beta_{0}=\gamma_{0}+0.5\) and \(\theta_{0}=3\), and all error terms follow \(N(0,0.1)\). Since \(e\) enters into the process for the untreated potential outcome in the 2nd period \(Y_{i}^{0}(1)\), the parallel trends assumption is violated whenever \(\rho \neq 0\).
Researchers observe \(\left\{Y_{i}(0), Y_{i}(1), D_{i}, X_{i}\right\}\) for \(i=1, \ldots, N\), where \(Y_{i}(0)=Y_{i}^{0}(0)\) and \(Y_{i}(1)=\) \(Y_{i}^{0}(1)\left(1-D_{i}\right)+Y_{i}^{1}\) (1) \(D_{i}\).
# %% # DGP
simdgp = \(N=200,
p=100, s=5, # covariate dimension and signal strength
theta=3, # true effect
ρ = 0 # no parallel trends violation by default
){
# sparse signal in pscore
γ=c(s:1,rep(0,(p-s)))/s
X = array(rnorm(N*p), dim=c(N,p))
e = expit(X %*% γ)
D = rbinom(N,1,e)
beta1 = γ+0.5
ε1 = rnorm(N,0,0.1); ε2 = ε3 = ε1
# potential outcomes
Y10=Y11=Y01=Y00=rep(NA, N)
Y00 = X %*% beta1 + ε1
Y10 =
Y01 = Y00 + 1 + ρ * e + ε2 # pscore enters outcome model
Y11 = theta + Y01 +ε3
# outcomes at two time periods
Y0 = Y00
Y1 = Y01*(1-D) + Y11*D
data.frame(Y1, Y0, D, X)
}
estDIDs = \(...){
df = simdgp(...)
res = c(
# naive difference in post
naive = with(df, mean(Y1[D==1]) - mean(Y1[D==0])),
# diff in diff
did = DID(df$D, df$Y1, df$Y0),
# outcome modeling
om = omDID(df[,-(1:3)]|>as.matrix(), df$D, df$Y1, df$Y0, xfit = F),
# outcome modeling XF
omXF = omDID(df[,-(1:3)]|>as.matrix(), df$D, df$Y1, df$Y0, xfit = T),
# IPW
ipw = ipwDID(df[,-(1:3)]|>as.matrix(), df$D, df$Y1, df$Y0, xfit = F),
# IPW XF
ipwXF = ipwDID(df[,-(1:3)]|>as.matrix(), df$D, df$Y1, df$Y0, xfit = T),
# aIPW
aipw = aipwDID(df[,-(1:3)]|>as.matrix(), df$D, df$Y1, df$Y0, xfit = F),
# aIPW
aipwXF= aipwDID(df[,-(1:3)]|>as.matrix(), df$D, df$Y1, df$Y0, xfit = T)
)
return(res)
}
# trial run
estDIDs()
## naive did om omXF ipw ipwXF aipw aipwXF
## 5.932 3.001 3.009 2.717 3.160 3.032 2.982 2.978
Difference in differences is clearly unbiased and has low MSE, as does AIPW. IPW has small bias when fit using sample splitting, but large bias otherwise.
Now, Difference in differences is clearly biased. IPW has smaller bias when fit using sample splitting, but is still quite biased. AIPW is still unbiased and has low MSE.
par(mfrow = c(4, 2), mar = c(5,0,4,0))
distPlot(ests2[1,], "naive", lb = 2, ub = 4)
distPlot(ests2[2,], "did", lb = 2, ub = 4)
distPlot(ests2[3,], "om", lb = 2, ub = 4)
distPlot(ests2[4,], "omXF", lb = 2, ub = 4)
distPlot(ests2[5,], "ipw", lb = 2, ub = 4)
distPlot(ests2[6,], "ipwXF", lb = 2, ub = 4)
distPlot(ests2[7,], "aipw", lb = 2, ub = 4)
distPlot(ests2[8,], "aipwXF", lb = 2, ub = 4)
Abadie, A. (2005): “Semiparametric Difference-in-Differences Estimators,” The Review of economic studies, 72, 1–19.
Chang, Neng-Chieh (2020): “Double/debiased machine learning for difference-in-differences models.” The Econometrics Journal , 177-191.
To be wrapped into an R library at some future date
# %%
DID = function(D, Y1, Y0){
#' @param D N-vector of treatment assignments (treatment only applies in period 2)
#' @param Y1 N-vector of outcomes in the second period
#' @param Y0 N-vector of outcomes in the first period
#' @return ATT estimate
(mean(Y1[D==1]) - mean(Y0[D==1])) - (mean(Y1[D==0]) - mean(Y0[D==0]))
}
# %%
omDID = function(X, D, Y1, Y0, xfit = T){
#' outcome model for the ATT fit using LASSO
#' @param X NxK matrix of covariates for propensity score
#' @param D N-vector of treatment assignments (treatment only applies in period 2)
#' @param Y1 N-vector of outcomes in the second period
#' @param Y0 N-vector of outcomes in the first period
#' @return ATT estimate
#' @references Heckman et al (1997), Abadie(2005)
#' @export
d0 = which(D==0); d1 = which(D==1);
m1 = glmnet::cv.glmnet(y = Y1[d1], x = X[d1,], alpha=1, keep = TRUE)
m2 = glmnet::cv.glmnet(y = Y1[d0], x = X[d0,], alpha=1, keep = TRUE)
m3 = glmnet::cv.glmnet(y = Y0[d1], x = X[d1,], alpha=1, keep = TRUE)
m4 = glmnet::cv.glmnet(y = Y0[d0], x = X[d0,], alpha=1, keep = TRUE)
m1hat = predict(m1, newx = X, s = m1$lambda.min)
m2hat = predict(m2, newx = X, s = m2$lambda.min)
m3hat = predict(m3, newx = X, s = m3$lambda.min)
m4hat = predict(m4, newx = X, s = m4$lambda.min)
if(xfit) { # cross fit predictions for observations used to train model
m1hat[d1] = fitGet(m1)
m2hat[d0] = fitGet(m2)
m3hat[d1] = fitGet(m3)
m4hat[d0] = fitGet(m4)
}
mean( (m1hat - m2hat) - (m3hat - m4hat) )
}
# %%
ipwDID = function(X, D, Y1, Y0, xfit = T){
#' Abadie (2005) semiparametric IPW DiD estimator for the
#' ATT fit using LASSO
#' @param X NxK matrix of covariates for propensity score
#' @param D N-vector of treatment assignments (treatment only applies in period 2)
#' @param Y1 N-vector of outcomes in the second period
#' @param Y0 N-vector of outcomes in the first period
#' @return ATT estimate
#' @references Abadie(2005)
#' @export
psfit = glmnet::cv.glmnet(X, D, family="binomial", alpha=1, keep = TRUE)
if(!xfit){
ehat = predict(psfit, newx = X, type = "response", s = psfit$lambda.min)
} else{ # cross fit predictions
ehat = expit(fitGet(psfit))
}
φ = (1/mean(D)) * (D-ehat)/(1-ehat)
# IPW estimate
mean( φ * (Y1-Y0) )
}
aipwDID = function(X, D, Y1, Y0, xfit = T){
#' Chang (2020) Double-Robust semiparametric difference in differences
#' estimator for the ATT fit using LASSO
#' @param X NxK matrix of covariates for propensity score
#' @param D N-vector of treatment assignments (treatment only applies in period 2)
#' @param Y1 N-vector of outcomes in the second period
#' @param Y0 N-vector of outcomes in the first period
#' @return ATT estimate
#' @references Abadie(2005), Chang(2020), Zhao and Sant'Anna (2021)
#' @export
# ps model
psfit = glmnet::cv.glmnet(X, D, family="binomial", alpha=1, keep = TRUE)
# trend model
index = which(D==0)
y = Y1[index]-Y0[index]
ofit = glmnet::cv.glmnet(X[index, ], y, alpha=1, keep = TRUE)
if(!xfit){
ehat = predict(psfit, newx = X, type = "response", s = psfit$lambda.min)
mhat = predict(ofit, newx = X, type = "response", s = ofit$lambda.min)
} else{ # cross fit predictions
ehat = expit(fitGet(psfit))
mhat = predict(ofit, newx = X, type = "response", s = ofit$lambda.min)
# fill in obs that model was trained on using oob predictions
mhat[index] = fitGet(ofit)
}
mean(
(Y1-Y0)/mean(D) * (D-ehat)/(1-ehat) - (D-ehat)/mean(D)/(1-ehat) * mhat
)
}
# %% internals
expit = \(x) exp(x)/(1+exp(x))
fitGet = \(m){
#' Get cross-fit predictions from LASSO without fitting the model again
m$fit.preval[,
!is.na(colSums(m$fit.preval))][,
m$lambda[!is.na(colSums(m$fit.preval))] == m$lambda.min]
}