DoubleML: Machine learning for causal inference

.title[
# DoubleML: Machine learning for causal inference
]
.author[
### Apoorva Lal
]
.date[
### 02 June 2022
]

---

`$\DeclareMathOperator*{\argmin}{argmin}$`
`$\newcommand{\defeq}{:=}$`
`$\newcommand{\eqdef}{=:}$`
`$\newcommand{\var}{\mathrm{Var}}$`
`$\newcommand{\epsi}{\varepsilon}$`
`$\newcommand{\phii}{\varphi}$`
`$\newcommand\Bigpar[1]{\left( #1 \right )}$`
`$\newcommand\Bigbr[1]{\left[ #1 \right ]}$`
`$\newcommand\Bigcr[1]{\left\{ #1 \right \}}$`
`$\newcommand\SetB[1]{\left\{ #1 \right\}}$`
`$\newcommand\Sett[1]{\mathcal{#1}}$`
`$\newcommand{\Data}{\mathcal{O}}$`
`$\newcommand{\ooN}{\frac{1}{n}}$`
`$\newcommand{\Ubr}[2]{\underbrace{#1}_{\text{#2}}}$`
`$\newcommand{\Obr}[2]{ \overbrace{#1}^{\text{#2}}}$`
`$\newcommand{\sumiN}{\sum_{i=1}^N}$`
`$\newcommand{\dydx}[2]{\frac{\partial #1}{\partial #2}}$`
`$\newcommand\Indic[1]{\mathds{1}_{#1}}$`
`$\newcommand{\Realm}[1]{\mathbb{R}^{#1}}$`
`$\newcommand{\Exp}[1]{\mathbb{E}\left[#1\right]}$`
`$\newcommand{\Expt}[2]{\mathbb{E}_{#1}\left[#2\right]}$`
`$\newcommand{\Var}[1]{\mathbb{V}\left[#1\right]}$`
`$\newcommand{\Covar}[1]{\text{Cov}\left[#1\right]}$`
`$\newcommand{\Prob}[1]{\mathbf{Pr}\left(#1\right)}$`
`$\newcommand{\Supp}[1]{\text{Supp}\left[#1\right]}$`
`$\newcommand{\doyx}{\Prob{Y \, |\,\mathsf{do} (X = x)}}$`
`$\newcommand{\doo}[1]{\Prob{Y |\,\mathsf{do} (#1) }}$`
`$\newcommand{\R}{\mathbb{R}}$`
`$\newcommand{\Z}{\mathbb{Z}}$`
`$\newcommand{\wh}[1]{\widehat{#1}} % Wide hat$`
`$\newcommand{\wt}[1]{\widetilde{#1}} % Wide tilde$`
`$\newcommand{\wb}[1]{\overline{#1}} % Wide bar$`
`$\newcommand\Ol[1]{\overline{#1}}$`
`$\newcommand\Ul[1]{\underline{#1}}$`
`$\newcommand\Str[1]{#1^{*}}$`
`$\newcommand{\F}{\mathsf{F}}$`
`$\newcommand{\ff}{\mathsf{f}}$`
`$\newcommand{\Cdf}[1]{\mathbb{F}\left(#1\right)}$`
`$\newcommand{\Cdff}[2]{\mathbb{F}_{#1}\left(#2\right)}$`
`$\newcommand{\Pdf}[1]{\mathsf{f}\left(#1\right)}$`
`$\newcommand{\Pdff}[2]{\mathsf{f}_{#1}\left(#2\right)}$`
`$\newcommand{\dd}{\mathsf{d}}$`
`$\newcommand\Normal[1]{\mathcal{N} \left( #1 \right )}$`
`$\newcommand\Unif[1]{\mathsf{U} \left[ #1 \right ]}$`
`$\newcommand\Bern[1]{\mathsf{Bernoulli} \left( #1 \right )}$`
`$\newcommand\Binom[1]{\mathsf{Bin} \left( #1 \right )}$`
`$\newcommand\Pois[1]{\mathsf{Poi} \left( #1 \right )}$`
`$\newcommand\BetaD[1]{\mathsf{Beta} \left( #1 \right )}$`
`$\newcommand\Diri[1]{\mathsf{Dir} \left( #1 \right )}$`
`$\newcommand\Gdist[1]{\mathsf{Gamma} \left( #1 \right )}$`
`$\def\mbf#1{\mathbf{#1}}$`
`$\def\mrm#1{\mathrm{#1}}$`
`$\def\mbi#1{\boldsymbol{#1}}$`
`$\def\ve#1{\mbi{#1}} % Vector notation$`
`$\def\vee#1{\mathbf{#1}} % Vector notation$`
`$\newcommand{\Mat}[1]{\mathbf{#1}}$`
`$\newcommand{\abs}[1]{\left\vert {#1} \right\vert}$`
`$\newcommand{\eucN}[1]{\norm{#1}}$`
`$\newcommand{\norm}[1]{\left\Vert{#1} \right\Vert}$`
`$\newcommand{\lzero}[1]{\norm{#1}_0}$`
`$\newcommand{\lone}[1]{\norm{#1}_1}$`
`$\newcommand{\ltwo}[1]{\norm{#1}_2}$`
`$\newcommand{\pnorm}[1]{\norm{#1}_p}$`

.scroll-output {
  height: 90%;
  overflow-y: scroll;
}

</style>

---

# Uses of machine Learning for causal inference

---

# Notation Refresher

+ Data `$\SetB{D_i, Y_i, \vee{X}_i}_i^N \in \SetB{0,1} \times \R \times \Sett{X}$`
+ interested in estimating the treatment effect of `$D$` on `$Y$`
  * ATE : `$\tau = \Exp{Y_i(1) - Y_i(0)}$`
--

## Selection on Observables assumptions

* Unconfoundedness: `$Y_i(1), Y_i(0) \perp D_i | \vee{X}_i$`
* Overlap: `$0 < \Prob{D = 1 | \vee{X = x}} < 1$`

---
# ATE Estimators using CEFs

+ ML is excellent at fitting conditional expectation functions (CEFs)
+ For causal inference, we use the following CEFs repeatedly
  * Conditional mean: `$\mu_{(d)} (\vee{x}) \defeq \Exp{Y_i(d) | \vee{X}_i = \vee{x}}$`
  * Propensity score: `$e(x) \defeq \Exp{D_i = 1 | \vee{X}_i - \vee{x}}$`

$$
\tau = \Exp{\frac{D_i Y_i}{e(\vee{X}_i)} - \frac{(1-D_i) Y_i}{1 - e(\vee{X}_i)}}
$$

Plugin principle for estimator

]

`\begin{align*}
\tau(\vee{x}) & = \Exp{Y_i(1) - Y_i(0) | \vee{X}_i = \vee{x}} \\
& = \mu_{(1)}(\vee{x}) - \mu_{(0)}(\vee{x}) \\
\wh{\tau} &= \ooN \sumiN (\wh{\mu}_{(1)}(\vee{x}) -
  \wh{\mu}_{(0)}(\vee{x}))
\end{align*}`
]

---

# Combining the two approaches

+ Augmented IPW / 'Double Robust' estimator (Robins, Rotnitzky, and
  Zhao 1994)

`\begin{align*}
\hat{\tau}_{AIPW}& =\frac{1}{n} \sum_{i=1}^{n}\left(\hat{\mu}_{(1)}\left(X_{i}\right)-\hat{\mu}_{(0)}\left(X_{i}\right)\right. \\
&\left.+D_{i} \frac{Y_{i}-\hat{\mu}_{(1)}\left(X_{i}\right)}{\hat{e}\left(X_{i}\right)}-\left(1-D_{i}\right) \frac{Y_{i}-\hat{\mu}_{(0)}\left(X_{i}\right)}{1-\hat{e}\left(X_{i}\right)}\right)
\end{align*}`

+ Double Robustness: consistent if either `$\wh{\mu}_{(d)}(\vee{x})$` OR `$\wh{e}(\vee{x})$` is consistent
+ Honesty principle: use 'cross-fitting' (split the sample, fit `$\wh{\mu}$` and `$\wh{e}$` on one half and predict on the other)

---

# Orthogonal Scores / Double Machine Learning

+ [Chernozhukov et al](https://economics.mit.edu/files/12538)

`\begin{align*}
Y = \tau D + \mu(\vee{X}) + \epsi \; \; , \Exp{\epsi | D, \vee{X}} = 0 \\
D = e(\vee{X}_i) + \eta \; \; , \Exp{\eta | \vee{X}} = 0
\end{align*}`

+ Can use linear approximation for `$\mu$` and `$e$` [Double selection - BCH 2014]
+ Or generic machine learners and estimate residuals-on-residuals regression

$$
y-\widehat{\mathbb{E}}[Y \mid \mathbf{x}]=\tau(D-\widehat{\mathbb{E}}[D \mid \mathbf{x}])+\nu
$$

---

# Application to Gerber, Green, Larimer (2008)

+ [Athey et al Tutorial](https://gsbdbi.github.io/ml_tutorial/ate_tutorial/ate_tutorial.html)

---

# Data Prep

```r
# If you need to, you can download the data from GitHub
if(!file.exists(here::here("data", "socialneighbor.csv"))){
  datafile <- "https://raw.githubusercontent.com/gsbDBI/ExperimentData/master/Social/ProcessedData/socialpresswgeooneperhh_NEIGH.csv"

# get data file and load into memory
  data_raw <- read.csv(datafile)

# create a data folder if not already available
  dir.create(here::here("data"), showWarnings = FALSE)
  # Save the data
  write.csv(data_raw, here::here("data", "socialneighbor.csv"))
}else{
  # Read in the data
  data_raw <- read.csv(here::here("data", "socialneighbor.csv"))
}

# These are the covariates we'll use
cts_variables_names <- c("yob", "hh_size", "totalpopulation_estimate",
                         "percent_male", "median_age",
                         "percent_62yearsandover",
                         "percent_white", "percent_black",
                         "percent_asian", "median_income",
                         "employ_20to64", "highschool", "bach_orhigher",
                         "percent_hispanicorlatino")
binary_variables_names <- c("sex","g2000", "g2002", "p2000", "p2002", "p2004")

#Alternatively, implement means-encoding for the categorical variable 'city':
city_means <- data_raw  %>%
                group_by(city) %>%
                dplyr::select(cts_variables_names) %>%
                summarise_all(funs(mean))
```

```
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(cts_variables_names)` instead of `cts_variables_names` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Adding missing grouping variables: `city`
```

```r
colnames(city_means)<-paste0("city_mean_", colnames(city_means))
data_raw<-inner_join(data_raw, city_means, by= c("city" = "city_mean_city"))

cts_variables_names<-append(cts_variables_names, colnames(city_means)[-1])

covariates <- c(cts_variables_names, binary_variables_names)
all_variables_names <- c(covariates, "outcome_voted", "treat_neighbors")

# For this tutorial, we will not use all observations -- it would take too long to run all the methods below
# Instead, keep 2000 controls and 1000 treated samples
keep <- c(base::sample(which(data_raw$treat_neighbors == 0), 2000),
          base::sample(which(data_raw$treat_neighbors == 1), 1000)
          )

data_subset <- data_raw[keep,] %>% dplyr::select(all_variables_names)
```

```
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(all_variables_names)` instead of `all_variables_names` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
```

```r
# Extracting and scaling continuous variables
scaled_cts_covariates <- base::scale(data_subset[,cts_variables_names])

# Extracting indicator variables
binary_covariates <- data_subset[,binary_variables_names]

# Extracting outcome and treatment variables
outcome <- data_subset$outcome_voted
treatment <- data_subset$treat_neighbors

# Setting up the data, renaming columns
df <- data.frame(scaled_cts_covariates, binary_covariates,
                  W = treatment, Y = outcome
                )

# Discard rows with na (if any)
df <- stats::na.omit(df)
```
]
---

# Experimental ATE using Linear Regression

```r
(tauhat_linear <- lm(Y~W, data=df))
```

```
## 
## Call:
## lm(formula = Y ~ W, data = df)
## 
## Coefficients:
## (Intercept)            W  
##      0.2925       0.0855
```

---

## Adding Confounding

```r
likely_voter = (((df$g2002 == 1) & (df$sex == 0)) | (df$p2004 == 1))
prob = 0.25 # proportion of voters to drop

#drop proportion of likely voters from the treated group
drop_from_treat <- base::sample(which(likely_voter & df$W == 1), round(prob * sum(likely_voter)))
#drop proportion of non-likely voters from the control group
drop_from_control <- base::sample(which(!likely_voter & df$W == 0), round(prob * sum(!likely_voter)))

# Modified dataset
df_mod <- df[-c(drop_from_treat, drop_from_control),]

# The difference in means is now biased!
(tauhat_conf_linear <- lm(Y~W, data=df_mod))
```

```
## 
## Call:
## lm(formula = Y ~ W, data = df_mod)
## 
## Coefficients:
## (Intercept)            W  
##      0.3018       0.0473
```

---

# Estimating the Pscore

```r
covariates_names <- names(df_mod[,!names(df_mod) %in% c("Y", "W")])
#Create a matrix for covariates other than treatment(W) and outcome(Y)
Xmod = df_mod[,!names(df_mod) %in% c("Y", "W")]
#Create a vectors for the outcome variable (Y) and the treatment (W)
Ymod = df_mod$Y
Wmod = df_mod$W
# Computing the propensity score by logistic regression of W on all the covariates except the outcome (Y).

# Declare formula for logistic regression
p_logistic.frmla <- as.formula(paste("W ~ ", paste(covariates_names, collapse= "+")))
p_logistic.fit <- glm(p_logistic.frmla , data = df_mod, family = "binomial")
#These are predicted probabilities of getting the treatment (W) as a function of the covariates (X)
#Add the predicted probabilities to our original data frame (df_mod)
df_mod$p_logistic<- predict(p_logistic.fit, type = "response")

#The following plot is the histogram of the propensity scores.
p <- ggplot(df_mod, aes(x = p_logistic, fill = as.factor(W))) +
  geom_histogram(alpha = 0.2, position = "identity", bins = 35) +
  geom_vline(aes(xintercept=1),color="red", linetype="dashed", size=1, alpha = 0.5) +
  geom_vline(aes(xintercept=0), color="red", linetype="dashed", size=1, alpha = 0.5) +
  labs(title="", x="Propensity Score", y = "Frequency") +
  ggthemes::theme_few() +
  scale_fill_colorblind(name="",labels=c("Control", "Treatment")) +
  theme(legend.position = "top")
```

---

---

# Calibration Plot for pscore

```r
log_calibration <- smooth.spline(df_mod$p_logistic, Wmod, df = 4)
log_calibration <- as.data.frame(cbind(log_calibration$x, log_calibration$y))

log_spline <- ggplot(log_calibration,aes(x=V1, y= V2)) +
  geom_point(color = "orange")+
  labs(title = "Logistic Calibration Check",
       x= "True Treatment Assignment",
       y="Fitted Treatment Assignment") +
  geom_abline(intercept = 0) + xlim(c(0,1))+ ylim(c(0,1))+
  ggthemes::theme_few()
```

---

---

# Conditional Mean Estimation

```r
libreq(estimatr)
```

```
##      wants      loaded
## [1,] "estimatr" TRUE
```

```r
lm_interact = lm_robust(Y ~ (. - p_logistic) * W , data = df_mod)
lm_interact %>% tidy %>% filter(term == "W")
```

```
##   term estimate std.error statistic p.value conf.low conf.high   df outcome
## 1    W  0.08971   0.07527     1.192  0.2335  -0.0579    0.2373 2180       Y
```

---

# IPW

```r
signif.level <- qnorm(0.025, lower.tail = FALSE)
ipw <- function(dataset, p) {
  W <- dataset$W
  Y <- dataset$Y
  G <- ((W - p) * Y) / (p * (1 - p))
  tau.hat <- mean(G)
  se.hat <- sqrt(var(G) / (length(G) - 1))
  c(ATE=tau.hat, lower_ci = tau.hat - signif.level * se.hat,
      upper_ci = tau.hat + signif.level * se.hat)
}

(tauhat_logistic_ipw <- ipw(df_mod, df_mod$p_logistic))
```

```
##      ATE lower_ci upper_ci 
##  0.10919  0.02585  0.19254
```

---

# AIPW

```r
aipw_ols <- function(dataset, p) {
  dataset = dataset[,-which(names(dataset) %in% c("p_logistic"))]
  ols.fit = lm(Y ~ W * ., data = dataset)
  # predict on dataset as if everyone was treated
  dataset.treatall = copy(dataset); dataset.treatall$W = 1
  μ1 = predict(ols.fit, dataset.treatall)
  # predict on dataset as if no one was treated
  dataset.treatnone = copy(dataset); dataset.treatnone$W = 0
  μ0 = predict(ols.fit, dataset.treatnone)
  # predict on actual status
  actual_pred = predict(ols.fit, dataset)
  G <- μ1 - μ0+
    ((dataset$W - p) * (dataset$Y - actual_pred)) / (p * (1 - p))
  tau.hat <- mean(G)
  se.hat <- sqrt(var(G) / (length(G) - 1))
  c(ATE=tau.hat,
    lower_ci = tau.hat - signif.level * se.hat,
    upper_ci = tau.hat + signif.level * se.hat)
}
(tauhat_lin_logistic_aipw <- aipw_ols(df_mod, df_mod$p_logistic))
```

```
##      ATE lower_ci upper_ci 
##  0.07458  0.01020  0.13896
```

---

# AIPW using forests

```r
library(grf)

cf <- causal_forest(Xmod, Ymod, Wmod, num.trees = 500)
ate_cf_aipw <- average_treatment_effect(cf)
ate_cf_aipw %>% summary
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0281  0.0447  0.0613  0.0613  0.0779  0.0945
```

```r
p_rf <- as.data.frame(cf$W.hat) %>% rename(p_cf='cf$W.hat')
(tauhat_ols_rf_aipw <- aipw_ols(df_mod, p_rf$p_cf))
```

```
##      ATE lower_ci upper_ci 
##  0.08653  0.03350  0.13956
```

---

class: inverse, center, middle
name: DoubleML
# DoubleML Implementation
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

+ Use `DoubleML` (Bach et al 2021) in real applications [built on top of `mlr3`]

---

# DoubleML 'by hand'

```r
libreq(hdm, AER, randomForest, glmnet)
```

```
##      wants          loaded
## [1,] "hdm"          TRUE  
## [2,] "AER"          TRUE  
## [3,] "randomForest" TRUE  
## [4,] "glmnet"       TRUE
```

```r
data(GrowthData)                     # Barro-Lee growth data
y= as.matrix(GrowthData[,1])         # outcome: growth rate
d= as.matrix(GrowthData[,3])         # treatment: initial wealth
x= as.matrix(GrowthData[,-c(1,2,3)]) # controls: country characteristics

DML2.for.PLM <- function(x, d, y, dreg, yreg, nfold=2) {
  nobs <- nrow(x) #number of observations
  foldid <- rep.int(1:nfold, times = ceiling(nobs/nfold))[sample.int(nobs)] #define folds indices
  I <- split(1:nobs, foldid)  #split observation indices into folds
  ytil <- dtil <- rep(NA, nobs)
  cat("fold: ")
  for(b in 1:length(I)){
    dfit <- dreg(x[-I[[b]],], d[-I[[b]]]) #take a fold out
    yfit <- yreg(x[-I[[b]],], y[-I[[b]]]) # take a fold out
    dhat <- predict(dfit, x[I[[b]],], type="response") #predict the left-out fold
    yhat <- predict(yfit, x[I[[b]],], type="response") #predict the left-out fold
    dtil[I[[b]]] <- (d[I[[b]]] - dhat) #record residual for the left-out fold
    ytil[I[[b]]] <- (y[I[[b]]] - yhat) #record residual for the left-out fold
    cat(b," ")
        }
  rfit <- lm(ytil ~ dtil)    #estimate the main parameter by regressing one residual on the other
  coef.est <- coef(rfit)[2]  #extract coefficient
  se <- sqrt(vcovHC(rfit)[2,2]) #record robust standard error
  cat(sprintf("\ncoef (se) = %g (%g)\n", coef.est , se))  #printing output
  return( list(coef.est =coef.est , se=se, dtil=dtil, ytil=ytil) ) #save output and residuals
}
```
]

---

# DML with different learners

```r
set.seed(1)
dreg <- \(x,d) glmnet(x, d, lambda = 0)  #ML method= OLS using glmnet; using lm gives bugs
yreg <- \(x,y) glmnet(x, y, lambda = 0)  #ML method = OLS
DML2.OLS = DML2.for.PLM(x, d, y, dreg, yreg, nfold=10)
```

```
## fold: 1  2  3  4  5  6  7  8  9  10  
## coef (se) = 0.01013 (0.0167061)
```

```r
dreg <- \(x,d) rlasso(x,d, post=FALSE)  #ML method= lasso from hdm
yreg <- \(x,y) rlasso(x,y, post=FALSE)  #ML method = lasso from hdm
DML2.lasso = DML2.for.PLM(x, d, y, dreg, yreg, nfold=10)
```

```
## fold: 1  2  3  4  5  6  7  8  9  10  
## coef (se) = -0.0417523 (0.016092)
```

```r
dreg <- \(x,d) randomForest(x, d)  #ML method=Forest
yreg <- \(x,y) randomForest(x, y)  #ML method=Forest
DML2.RF = DML2.for.PLM(x, d, y, dreg, yreg, nfold=10)
```

```
## fold: 1  2  3  4  5  6  7  8  9  10  
## coef (se) = -0.038251 (0.0149232)
```