Double-ML for Difference in Differences - R functions and simulation study

DGP
unconditional parallel trends (\(\rho = 0\))
conditional parallel trends (\(\rho = 2\))
References
Appendix : Estimation functions

\(\DeclareMathOperator*{\argmin}{argmin}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\epsi}{\varepsilon}\) \(\newcommand{\phii}{\varphi}\) \(\newcommand{\tra}{^{\top}}\) \(\newcommand{\sumin}{\sum_{i=1}^n}\) \(\newcommand{\sumiN}{\sum_{i=1}^n}\) \(\newcommand{\norm}[1]{\left\Vert{#1} \right\Vert}\) \(\newcommand\Bigpar[1]{\left( #1 \right )}\) \(\newcommand\Bigbr[1]{\left[ #1 \right ]}\) \(\newcommand\Bigcr[1]{\left\{ #1 \right \}}\) \(\newcommand\SetB[1]{\left\{ #1 \right\}}\) \(\newcommand\Sett[1]{\mathcal{#1}}\) \(\newcommand{\Data}{\mathcal{D}}\) \(\newcommand{\Ubr}[2]{\underbrace{#1}_{\text{#2}}}\) \(\newcommand{\Obr}[2]{ \overbrace{#1}^{\text{#2}}}\) \(\newcommand{\sumiN}{\sum_{i=1}^N}\) \(\newcommand{\dydx}[2]{\frac{\partial #1}{\partial #2}}\) \(\newcommand\Indic[1]{\mathds{1}_{#1}}\) \(\newcommand{\Realm}[1]{\mathbb{R}^{#1}}\) \(\newcommand{\Exp}[1]{\mathbb{E}\left[#1\right]}\) \(\newcommand{\Expt}[2]{\mathbb{E}_{#1}\left[#2\right]}\) \(\newcommand{\Var}[1]{\mathbb{V}\left[#1\right]}\) \(\newcommand{\Covar}[1]{\text{Cov}\left[#1\right]}\) \(\newcommand{\Prob}[1]{\mathbf{Pr}\left(#1\right)}\) \(\newcommand{\Supp}[1]{\text{Supp}\left[#1\right]}\) \(\newcommand{\doyx}{\Prob{Y \, |\,\mathsf{do} (X = x)}}\) \(\newcommand{\doo}[1]{\Prob{Y |\,\mathsf{do} (#1) }}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\Z}{\mathbb{Z}}\) \(\newcommand{\wh}[1]{\widehat{#1}}\) \(\newcommand{\wt}[1]{\widetilde{#1}}\) \(\newcommand{\wb}[1]{\overline{#1}}\) \(\newcommand\Ol[1]{\overline{#1}}\) \(\newcommand\Ul[1]{\underline{#1}}\) \(\newcommand\Str[1]{#1^{*}}\) \(\newcommand{\F}{\mathsf{F}}\) \(\newcommand{\ff}{\mathsf{f}}\) \(\newcommand{\Cdf}[1]{\mathbb{F}\left(#1\right)}\) \(\newcommand{\Cdff}[2]{\mathbb{F}_{#1}\left(#2\right)}\) \(\newcommand{\Pdf}[1]{\mathsf{f}\left(#1\right)}\) \(\newcommand{\Pdff}[2]{\mathsf{f}_{#1}\left(#2\right)}\) \(\newcommand{\dd}{\mathsf{d}}\) \(\newcommand\Normal[1]{\mathcal{N} \left( #1 \right )}\) \(\newcommand\Unif[1]{\mathsf{U} \left[ #1 \right ]}\) \(\newcommand\Bern[1]{\mathsf{Bernoulli} \left( #1 \right )}\) \(\newcommand\Binom[1]{\mathsf{Bin} \left( #1 \right )}\) \(\newcommand\Pois[1]{\mathsf{Poi} \left( #1 \right )}\) \(\newcommand\BetaD[1]{\mathsf{Beta} \left( #1 \right )}\) \(\newcommand\Diri[1]{\mathsf{Dir} \left( #1 \right )}\) \(\newcommand\Gdist[1]{\mathsf{Gamma} \left( #1 \right )}\) \(\def\mbf#1{\mathbf{#1}}\) \(\def\mrm#1{\mathrm{#1}}\) \(\def\mbi#1{\boldsymbol{#1}}\) \(\def\ve#1{\mbi{#1}}\) \(\def\vee#1{\mathbf{#1}}\) \(\newcommand{\Mat}[1]{\mathbf{#1}}\) \(\newcommand{\eucN}[1]{\norm{#1}}\) \(\newcommand{\lzero}[1]{\norm{#1}_0}\) \(\newcommand{\lone}[1]{\norm{#1}_1}\) \(\newcommand{\ltwo}[1]{\norm{#1}_2}\) \(\newcommand{\pnorm}[1]{\norm{#1}_p}\)

This note extends the simulation study in Chang (2020) and re-implements the estimators considered in efficient R code with the glmnet library.

Throughout, we write \(Y^d(t)\) to denote the potential outcomes \(Y^1, Y^0\) at time \(t \in \{0, 1\}\) and \(Y(t)\) to denote the realised outcome. The estimand is the ATT in the 2nd period \(\Exp{Y^1(1) - Y^0(1) | D = 1}\).

The conditional parallel trends assumption is written as

\[ E\left[Y^{0}(1)-Y^{0}(0) \mid X, D=1\right]=E\left[Y^{0}(1)-Y^{0}(0) \mid X, D=0\right] \]

Under (c)PT, most well-known difference in differences estimator can be written conditional on \(X\) as

\[\begin{aligned} \hat{\tau}^{\text{OM}} =\{ & \wh{\mathbb{E}}[Y(1) \mid X, D=1]-\wh{\mathbb{E}}[Y(1) \mid X, D=0]\} \\ - \{ & \wh{\mathbb{E}}[Y(0) \mid X, D=1]-\wh{\mathbb{E}}[Y(0) \mid X, D=0]\} \end{aligned}\]

where the four expectations on the RHS are estimated nonparametrically. This is implemented in omDID.

The Abadie (2005) semiparametric IPW estimator for the 2-period ATT is

\[ \hat{\tau}^{\text{IPW}} = \frac{1}{N} \sum_i Y_i(1) - Y_i(0) \frac{D - \hat{e}(X_i)}{P(D=1) (1 - \hat{e}(X_i) )} \]

and is implemented in ipwDID.

This scheme works by weighting-down the distribution of \(Y(1) − Y(0)\) for the untreated for those values of the covariates which are over-represented among the untreated (that is, with low \(\hat{e}(X_i)/ (1 - \hat{e}(X_i)\)). Conversely, we are weighting-up \(Y(1) − Y(0)\) for those values of the covariates under-represented among the untreated (that is with high \(\hat{e}(X_i)/ (1 - \hat{e}(X_i)\)). In this way the same distribution of the covariates is imposed for treated and untreated.

The Chang (2020) double-robust AIPW estimator is

\[ \hat{\tau}^{\text{AIPW}} = \frac{1}{N} \sum_i \Bigpar{Y_i(1) - Y_i(0) - \hat{\mathcal{l}}(X_i, D = 0)} \cdot \Bigbr{\frac{D - \hat{e}(X_i)}{P(D = 1) (1- \hat{e}(X_i) )}} \]

where \(\hat{e}(\cdot)\) is the propensity score and \(\hat{\mathcal{l}}(\cdot)\) is an outcome model for \(Y(1) - Y(0)\) regressed on covariates in the untreated subsample \(D = 0\). The latter contains the additional ‘debiasing’ term \(\hat{\mathcal{l}}(\cdot)\), which subtracts out the projected change in the outcome in the untreated group (i.e. it can be thought of as the 2nd difference in the conventional DiD estimator).

This is implemented in aipwDID.

Both these functions are estimated using LASSO(glmnet) using internal cross-fitting.

DGP

\(N=200\) is the and \(p=100\) is the dimension of control variables, \(X_{i} \sim N\left(0, I_{p \times p}\right)\).

Let \(\gamma_{0}=(1,1 / 2,1 / 3,1 / 4\), \(1 / 5,0, \ldots, 0) \in \mathbb{R}^{p}\), and \(D_{i}\) is generated by the propensity score \(e := P(D=1 \mid X)=\frac{1}{1+\exp \left(-X^{\prime} \gamma_{0}\right)}\). The pscore is sparse.

Also, let the potential outcomes be \(Y_{i}^{0}(0)=X_{i}^{\prime} \beta_{0}+\varepsilon_{1}, Y_{i}^{0}(1)=Y_{i}^{0}(0)+1+ \rho * e + \varepsilon_{2}\), and \(Y_{i}^{1}(1)=\) \(\theta_{0}+Y_{i}^{0}(1)+\varepsilon_{3}\), where \(\beta_{0}=\gamma_{0}+0.5\) and \(\theta_{0}=3\), and all error terms follow \(N(0,0.1)\). Since \(e\) enters into the process for the untreated potential outcome in the 2nd period \(Y_{i}^{0}(1)\), the parallel trends assumption is violated whenever \(\rho \neq 0\).

Researchers observe \(\left\{Y_{i}(0), Y_{i}(1), D_{i}, X_{i}\right\}\) for \(i=1, \ldots, N\), where \(Y_{i}(0)=Y_{i}^{0}(0)\) and \(Y_{i}(1)=\) \(Y_{i}^{0}(1)\left(1-D_{i}\right)+Y_{i}^{1}\) (1) \(D_{i}\).

# %% # DGP
simdgp = \(N=200,
      p=100, s=5, # covariate dimension and signal strength
      theta=3,  # true effect
      ρ = 0     # no parallel trends violation by default
    ){
    # sparse signal in pscore
    γ=c(s:1,rep(0,(p-s)))/s
    X = array(rnorm(N*p), dim=c(N,p))
    e = expit(X %*% γ)
    D  = rbinom(N,1,e)
    beta1 = γ+0.5
    ε1 = rnorm(N,0,0.1); ε2 = ε3 = ε1
    # potential outcomes
    Y10=Y11=Y01=Y00=rep(NA, N)
    Y00 = X %*% beta1 + ε1
    Y10 =
    Y01 = Y00 + 1 + ρ * e + ε2      # pscore enters outcome model
    Y11 = theta + Y01 +ε3
    # outcomes at two time periods
    Y0 = Y00
    Y1 = Y01*(1-D) + Y11*D
    data.frame(Y1, Y0, D, X)
}

estDIDs = \(...){
  df = simdgp(...)
  res = c(
    # naive difference in post
    naive = with(df, mean(Y1[D==1]) - mean(Y1[D==0])),
    # diff in diff
    did = DID(df$D, df$Y1, df$Y0),
    # outcome modeling
    om    = omDID(df[,-(1:3)]|>as.matrix(),   df$D, df$Y1, df$Y0, xfit = F),
    # outcome modeling XF
    omXF  = omDID(df[,-(1:3)]|>as.matrix(),   df$D, df$Y1, df$Y0, xfit = T),
    # IPW
    ipw   = ipwDID(df[,-(1:3)]|>as.matrix(),  df$D, df$Y1, df$Y0, xfit = F),
    # IPW XF
    ipwXF = ipwDID(df[,-(1:3)]|>as.matrix(),  df$D, df$Y1, df$Y0, xfit = T),
    # aIPW
    aipw  = aipwDID(df[,-(1:3)]|>as.matrix(), df$D, df$Y1, df$Y0, xfit = F),
    # aIPW
    aipwXF= aipwDID(df[,-(1:3)]|>as.matrix(), df$D, df$Y1, df$Y0, xfit = T)
  )
  return(res)
}
# trial run
estDIDs()

##  naive    did     om   omXF    ipw  ipwXF   aipw aipwXF
##  5.932  3.001  3.009  2.717  3.160  3.032  2.982  2.978

# sampling distribution plotter
distPlot = \(x, t, lb = 2.5, ub = 3.5, trueval = 3){
  hist(x, breaks=100, main=t, xlab="", ylab="", yaxt="n", xlim = c(lb, ub))
  abline(v = mean(x), col = 'blue'); abline(v = trueval, col = 'red')
}

unconditional parallel trends (\(\rho = 0\))

Difference in differences is clearly unbiased and has low MSE, as does AIPW. IPW has small bias when fit using sample splitting, but large bias otherwise.

B = 1000
ests = mcReplicate(B, estDIDs(), mc.cores = 8)

par(mfrow = c(4, 2), mar = c(5,0,4,0))
distPlot(ests[1,], "naive")
distPlot(ests[2,], "did")
distPlot(ests[3,], "om")
distPlot(ests[4,], "omXF")
distPlot(ests[5,], "ipw")
distPlot(ests[6,], "ipwXF")
distPlot(ests[7,], "aipw")
distPlot(ests[8,], "aipwXF")

conditional parallel trends (\(\rho = 2\))

Now, Difference in differences is clearly biased. IPW has smaller bias when fit using sample splitting, but is still quite biased. AIPW is still unbiased and has low MSE.

B = 1000
ests2 = mcReplicate(B, estDIDs(ρ = 2), mc.cores = 8)

par(mfrow = c(4, 2), mar = c(5,0,4,0))
distPlot(ests2[1,], "naive", lb = 2, ub = 4)
distPlot(ests2[2,], "did", lb = 2, ub = 4)
distPlot(ests2[3,], "om", lb = 2, ub = 4)
distPlot(ests2[4,], "omXF", lb = 2, ub = 4)
distPlot(ests2[5,], "ipw", lb = 2, ub = 4)
distPlot(ests2[6,], "ipwXF", lb = 2, ub = 4)
distPlot(ests2[7,], "aipw", lb = 2, ub = 4)
distPlot(ests2[8,], "aipwXF", lb = 2, ub = 4)

References

Abadie, A. (2005): “Semiparametric Difference-in-Differences Estimators,” The Review of economic studies, 72, 1–19.
Chang, Neng-Chieh (2020): “Double/debiased machine learning for difference-in-differences models.” The Econometrics Journal , 177-191.

Appendix : Estimation functions

To be wrapped into an R library at some future date

# %%
DID = function(D, Y1, Y0){
  #' @param D N-vector of treatment assignments (treatment only applies in period 2)
  #' @param Y1 N-vector of outcomes in the second period
  #' @param Y0 N-vector of outcomes in the first period
  #' @return ATT estimate
  (mean(Y1[D==1]) - mean(Y0[D==1])) - (mean(Y1[D==0]) - mean(Y0[D==0]))
}

# %%
omDID = function(X, D, Y1, Y0, xfit = T){
  #' outcome model for the ATT fit using LASSO
  #' @param X NxK matrix of covariates for propensity score
  #' @param D N-vector of treatment assignments (treatment only applies in period 2)
  #' @param Y1 N-vector of outcomes in the second period
  #' @param Y0 N-vector of outcomes in the first period
  #' @return ATT estimate
  #' @references Heckman et al (1997), Abadie(2005)
  #' @export
  d0 = which(D==0); d1 = which(D==1);
  m1 = glmnet::cv.glmnet(y = Y1[d1], x = X[d1,],  alpha=1, keep = TRUE)
  m2 = glmnet::cv.glmnet(y = Y1[d0], x = X[d0,],  alpha=1, keep = TRUE)
  m3 = glmnet::cv.glmnet(y = Y0[d1], x = X[d1,],  alpha=1, keep = TRUE)
  m4 = glmnet::cv.glmnet(y = Y0[d0], x = X[d0,],  alpha=1, keep = TRUE)

  m1hat = predict(m1, newx = X, s = m1$lambda.min)
  m2hat = predict(m2, newx = X, s = m2$lambda.min)
  m3hat = predict(m3, newx = X, s = m3$lambda.min)
  m4hat = predict(m4, newx = X, s = m4$lambda.min)
  if(xfit) { # cross fit predictions for observations used to train model
    m1hat[d1] = fitGet(m1)
    m2hat[d0] = fitGet(m2)
    m3hat[d1] = fitGet(m3)
    m4hat[d0] = fitGet(m4)
  }
  mean( (m1hat - m2hat) - (m3hat - m4hat) )
}

# %%
ipwDID = function(X, D, Y1, Y0, xfit = T){
  #' Abadie (2005) semiparametric IPW DiD estimator for the
  #' ATT fit using LASSO
  #' @param X NxK matrix of covariates for propensity score
  #' @param D N-vector of treatment assignments (treatment only applies in period 2)
  #' @param Y1 N-vector of outcomes in the second period
  #' @param Y0 N-vector of outcomes in the first period
  #' @return ATT estimate
  #' @references Abadie(2005)
  #' @export
  psfit = glmnet::cv.glmnet(X, D, family="binomial", alpha=1, keep = TRUE)
  if(!xfit){
      ehat = predict(psfit, newx = X, type = "response", s = psfit$lambda.min)
  } else{ # cross fit predictions
      ehat = expit(fitGet(psfit))
  }
  φ = (1/mean(D)) * (D-ehat)/(1-ehat)
  # IPW estimate
  mean( φ * (Y1-Y0) )
}

aipwDID = function(X, D, Y1, Y0, xfit = T){
  #' Chang (2020) Double-Robust semiparametric difference in differences
  #' estimator for the ATT fit using LASSO
  #' @param X NxK matrix of covariates for propensity score
  #' @param D N-vector of treatment assignments (treatment only applies in period 2)
  #' @param Y1 N-vector of outcomes in the second period
  #' @param Y0 N-vector of outcomes in the first period
  #' @return ATT estimate
  #' @references Abadie(2005), Chang(2020), Zhao and Sant'Anna (2021)
  #' @export
  # ps model
  psfit = glmnet::cv.glmnet(X, D, family="binomial", alpha=1, keep = TRUE)
  # trend model
  index = which(D==0)
  y = Y1[index]-Y0[index]
  ofit  = glmnet::cv.glmnet(X[index, ], y, alpha=1, keep = TRUE)
  if(!xfit){
      ehat = predict(psfit, newx = X, type = "response", s = psfit$lambda.min)
      mhat = predict(ofit,  newx = X, type = "response", s = ofit$lambda.min)
  } else{ # cross fit predictions
      ehat = expit(fitGet(psfit))
      mhat = predict(ofit, newx = X, type = "response", s = ofit$lambda.min)
      # fill in obs that model was trained on using oob predictions
      mhat[index] = fitGet(ofit)
  }
  mean(
    (Y1-Y0)/mean(D) * (D-ehat)/(1-ehat) - (D-ehat)/mean(D)/(1-ehat) * mhat
  )
}

# %% internals
expit = \(x) exp(x)/(1+exp(x))

fitGet = \(m){
    #' Get cross-fit predictions from LASSO without fitting the model again
    m$fit.preval[,
      !is.na(colSums(m$fit.preval))][,
      m$lambda[!is.na(colSums(m$fit.preval))] == m$lambda.min]
}

Double-ML for Difference in Differences - R Functions and a simulation study

Apoorva Lal

May 21, 2022

DGP

unconditional parallel trends (\(\rho = 0\))

conditional parallel trends (\(\rho = 2\))

References

Appendix : Estimation functions