class: center, middle, inverse, title-slide .title[ # DoubleML: Machine learning for causal inference ] .author[ ### Apoorva Lal ] .date[ ### 02 June 2022 ] --- <!--- For HTML renders - selection from math_shortcuts.tex ---> `\(\DeclareMathOperator*{\argmin}{argmin}\)` `\(\newcommand{\defeq}{:=}\)` `\(\newcommand{\eqdef}{=:}\)` `\(\newcommand{\var}{\mathrm{Var}}\)` `\(\newcommand{\epsi}{\varepsilon}\)` `\(\newcommand{\phii}{\varphi}\)` `\(\newcommand\Bigpar[1]{\left( #1 \right )}\)` `\(\newcommand\Bigbr[1]{\left[ #1 \right ]}\)` `\(\newcommand\Bigcr[1]{\left\{ #1 \right \}}\)` `\(\newcommand\SetB[1]{\left\{ #1 \right\}}\)` `\(\newcommand\Sett[1]{\mathcal{#1}}\)` `\(\newcommand{\Data}{\mathcal{O}}\)` `\(\newcommand{\ooN}{\frac{1}{n}}\)` `\(\newcommand{\Ubr}[2]{\underbrace{#1}_{\text{#2}}}\)` `\(\newcommand{\Obr}[2]{ \overbrace{#1}^{\text{#2}}}\)` `\(\newcommand{\sumiN}{\sum_{i=1}^N}\)` `\(\newcommand{\dydx}[2]{\frac{\partial #1}{\partial #2}}\)` `\(\newcommand\Indic[1]{\mathds{1}_{#1}}\)` `\(\newcommand{\Realm}[1]{\mathbb{R}^{#1}}\)` `\(\newcommand{\Exp}[1]{\mathbb{E}\left[#1\right]}\)` `\(\newcommand{\Expt}[2]{\mathbb{E}_{#1}\left[#2\right]}\)` `\(\newcommand{\Var}[1]{\mathbb{V}\left[#1\right]}\)` `\(\newcommand{\Covar}[1]{\text{Cov}\left[#1\right]}\)` `\(\newcommand{\Prob}[1]{\mathbf{Pr}\left(#1\right)}\)` `\(\newcommand{\Supp}[1]{\text{Supp}\left[#1\right]}\)` `\(\newcommand{\doyx}{\Prob{Y \, |\,\mathsf{do} (X = x)}}\)` `\(\newcommand{\doo}[1]{\Prob{Y |\,\mathsf{do} (#1) }}\)` `\(\newcommand{\R}{\mathbb{R}}\)` `\(\newcommand{\Z}{\mathbb{Z}}\)` `\(\newcommand{\wh}[1]{\widehat{#1}} % Wide hat\)` `\(\newcommand{\wt}[1]{\widetilde{#1}} % Wide tilde\)` `\(\newcommand{\wb}[1]{\overline{#1}} % Wide bar\)` `\(\newcommand\Ol[1]{\overline{#1}}\)` `\(\newcommand\Ul[1]{\underline{#1}}\)` `\(\newcommand\Str[1]{#1^{*}}\)` `\(\newcommand{\F}{\mathsf{F}}\)` `\(\newcommand{\ff}{\mathsf{f}}\)` `\(\newcommand{\Cdf}[1]{\mathbb{F}\left(#1\right)}\)` `\(\newcommand{\Cdff}[2]{\mathbb{F}_{#1}\left(#2\right)}\)` `\(\newcommand{\Pdf}[1]{\mathsf{f}\left(#1\right)}\)` `\(\newcommand{\Pdff}[2]{\mathsf{f}_{#1}\left(#2\right)}\)` `\(\newcommand{\dd}{\mathsf{d}}\)` `\(\newcommand\Normal[1]{\mathcal{N} \left( #1 \right )}\)` `\(\newcommand\Unif[1]{\mathsf{U} \left[ #1 \right ]}\)` `\(\newcommand\Bern[1]{\mathsf{Bernoulli} \left( #1 \right )}\)` `\(\newcommand\Binom[1]{\mathsf{Bin} \left( #1 \right )}\)` `\(\newcommand\Pois[1]{\mathsf{Poi} \left( #1 \right )}\)` `\(\newcommand\BetaD[1]{\mathsf{Beta} \left( #1 \right )}\)` `\(\newcommand\Diri[1]{\mathsf{Dir} \left( #1 \right )}\)` `\(\newcommand\Gdist[1]{\mathsf{Gamma} \left( #1 \right )}\)` `\(\def\mbf#1{\mathbf{#1}}\)` `\(\def\mrm#1{\mathrm{#1}}\)` `\(\def\mbi#1{\boldsymbol{#1}}\)` `\(\def\ve#1{\mbi{#1}} % Vector notation\)` `\(\def\vee#1{\mathbf{#1}} % Vector notation\)` `\(\newcommand{\Mat}[1]{\mathbf{#1}}\)` `\(\newcommand{\abs}[1]{\left\vert {#1} \right\vert}\)` `\(\newcommand{\eucN}[1]{\norm{#1}}\)` `\(\newcommand{\norm}[1]{\left\Vert{#1} \right\Vert}\)` `\(\newcommand{\lzero}[1]{\norm{#1}_0}\)` `\(\newcommand{\lone}[1]{\norm{#1}_1}\)` `\(\newcommand{\ltwo}[1]{\norm{#1}_2}\)` `\(\newcommand{\pnorm}[1]{\norm{#1}_p}\)` <style type="text/css"> .remark-slide-content { padding-top: 1px; padding-left: 10px; padding-right: 10px; padding-bottom: 5px; font-size: 28px } .remark-slide-content > h1:first-of-type { margin-top: 0px; } .scroll-output { height: 90%; overflow-y: scroll; } </style> --- class: inverse, middle # Uses of machine Learning for causal inference <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Notation Refresher + Data `\(\SetB{D_i, Y_i, \vee{X}_i}_i^N \in \SetB{0,1} \times \R \times \Sett{X}\)` + interested in estimating the treatment effect of `\(D\)` on `\(Y\)` * ATE : `\(\tau = \Exp{Y_i(1) - Y_i(0)}\)` -- ## Selection on Observables assumptions * Unconfoundedness: `\(Y_i(1), Y_i(0) \perp D_i | \vee{X}_i\)` * Overlap: `\(0 < \Prob{D = 1 | \vee{X = x}} < 1\)` --- # ATE Estimators using CEFs + ML is excellent at fitting conditional expectation functions (CEFs) + For causal inference, we use the following CEFs repeatedly * Conditional mean: `\(\mu_{(d)} (\vee{x}) \defeq \Exp{Y_i(d) | \vee{X}_i = \vee{x}}\)` * Propensity score: `\(e(x) \defeq \Exp{D_i = 1 | \vee{X}_i - \vee{x}}\)` .pull-left[ ## Inverse Propensity Weighting $$ \tau = \Exp{\frac{D_i Y_i}{e(\vee{X}_i)} - \frac{(1-D_i) Y_i}{1 - e(\vee{X}_i)}} $$ Plugin principle for estimator ] .pull-right[ ## Conditional Mean Estimation `\begin{align*} \tau(\vee{x}) & = \Exp{Y_i(1) - Y_i(0) | \vee{X}_i = \vee{x}} \\ & = \mu_{(1)}(\vee{x}) - \mu_{(0)}(\vee{x}) \\ \wh{\tau} &= \ooN \sumiN (\wh{\mu}_{(1)}(\vee{x}) - \wh{\mu}_{(0)}(\vee{x})) \end{align*}` ] --- # Combining the two approaches + Augmented IPW / 'Double Robust' estimator (Robins, Rotnitzky, and Zhao 1994) `\begin{align*} \hat{\tau}_{AIPW}& =\frac{1}{n} \sum_{i=1}^{n}\left(\hat{\mu}_{(1)}\left(X_{i}\right)-\hat{\mu}_{(0)}\left(X_{i}\right)\right. \\ &\left.+D_{i} \frac{Y_{i}-\hat{\mu}_{(1)}\left(X_{i}\right)}{\hat{e}\left(X_{i}\right)}-\left(1-D_{i}\right) \frac{Y_{i}-\hat{\mu}_{(0)}\left(X_{i}\right)}{1-\hat{e}\left(X_{i}\right)}\right) \end{align*}` + Double Robustness: consistent if either `\(\wh{\mu}_{(d)}(\vee{x})\)` OR `\(\wh{e}(\vee{x})\)` is consistent + Honesty principle: use 'cross-fitting' (split the sample, fit `\(\wh{\mu}\)` and `\(\wh{e}\)` on one half and predict on the other) --- # Orthogonal Scores / Double Machine Learning + [Chernozhukov et al](https://economics.mit.edu/files/12538) `\begin{align*} Y = \tau D + \mu(\vee{X}) + \epsi \; \; , \Exp{\epsi | D, \vee{X}} = 0 \\ D = e(\vee{X}_i) + \eta \; \; , \Exp{\eta | \vee{X}} = 0 \end{align*}` + Can use linear approximation for `\(\mu\)` and `\(e\)` [Double selection - BCH 2014] + Or generic machine learners and estimate residuals-on-residuals regression $$ y-\widehat{\mathbb{E}}[Y \mid \mathbf{x}]=\tau(D-\widehat{\mathbb{E}}[D \mid \mathbf{x}])+\nu $$ --- class: inverse, middle # Application to Gerber, Green, Larimer (2008) <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> + [Athey et al Tutorial](https://gsbdbi.github.io/ml_tutorial/ate_tutorial/ate_tutorial.html) --- # Data Prep .scroll-output[ ```r # If you need to, you can download the data from GitHub if(!file.exists(here::here("data", "socialneighbor.csv"))){ datafile <- "https://raw.githubusercontent.com/gsbDBI/ExperimentData/master/Social/ProcessedData/socialpresswgeooneperhh_NEIGH.csv" # get data file and load into memory data_raw <- read.csv(datafile) # create a data folder if not already available dir.create(here::here("data"), showWarnings = FALSE) # Save the data write.csv(data_raw, here::here("data", "socialneighbor.csv")) }else{ # Read in the data data_raw <- read.csv(here::here("data", "socialneighbor.csv")) } # These are the covariates we'll use cts_variables_names <- c("yob", "hh_size", "totalpopulation_estimate", "percent_male", "median_age", "percent_62yearsandover", "percent_white", "percent_black", "percent_asian", "median_income", "employ_20to64", "highschool", "bach_orhigher", "percent_hispanicorlatino") binary_variables_names <- c("sex","g2000", "g2002", "p2000", "p2002", "p2004") #Alternatively, implement means-encoding for the categorical variable 'city': city_means <- data_raw %>% group_by(city) %>% dplyr::select(cts_variables_names) %>% summarise_all(funs(mean)) ``` ``` ## Note: Using an external vector in selections is ambiguous. ## ℹ Use `all_of(cts_variables_names)` instead of `cts_variables_names` to silence this message. ## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>. ## This message is displayed once per session. ## Adding missing grouping variables: `city` ``` ```r colnames(city_means)<-paste0("city_mean_", colnames(city_means)) data_raw<-inner_join(data_raw, city_means, by= c("city" = "city_mean_city")) cts_variables_names<-append(cts_variables_names, colnames(city_means)[-1]) covariates <- c(cts_variables_names, binary_variables_names) all_variables_names <- c(covariates, "outcome_voted", "treat_neighbors") # For this tutorial, we will not use all observations -- it would take too long to run all the methods below # Instead, keep 2000 controls and 1000 treated samples keep <- c(base::sample(which(data_raw$treat_neighbors == 0), 2000), base::sample(which(data_raw$treat_neighbors == 1), 1000) ) data_subset <- data_raw[keep,] %>% dplyr::select(all_variables_names) ``` ``` ## Note: Using an external vector in selections is ambiguous. ## ℹ Use `all_of(all_variables_names)` instead of `all_variables_names` to silence this message. ## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>. ## This message is displayed once per session. ``` ```r # Extracting and scaling continuous variables scaled_cts_covariates <- base::scale(data_subset[,cts_variables_names]) # Extracting indicator variables binary_covariates <- data_subset[,binary_variables_names] # Extracting outcome and treatment variables outcome <- data_subset$outcome_voted treatment <- data_subset$treat_neighbors # Setting up the data, renaming columns df <- data.frame(scaled_cts_covariates, binary_covariates, W = treatment, Y = outcome ) # Discard rows with na (if any) df <- stats::na.omit(df) ``` ] --- # Experimental ATE using Linear Regression ```r (tauhat_linear <- lm(Y~W, data=df)) ``` ``` ## ## Call: ## lm(formula = Y ~ W, data = df) ## ## Coefficients: ## (Intercept) W ## 0.2925 0.0855 ``` --- ## Adding Confounding ```r likely_voter = (((df$g2002 == 1) & (df$sex == 0)) | (df$p2004 == 1)) prob = 0.25 # proportion of voters to drop #drop proportion of likely voters from the treated group drop_from_treat <- base::sample(which(likely_voter & df$W == 1), round(prob * sum(likely_voter))) #drop proportion of non-likely voters from the control group drop_from_control <- base::sample(which(!likely_voter & df$W == 0), round(prob * sum(!likely_voter))) # Modified dataset df_mod <- df[-c(drop_from_treat, drop_from_control),] # The difference in means is now biased! (tauhat_conf_linear <- lm(Y~W, data=df_mod)) ``` ``` ## ## Call: ## lm(formula = Y ~ W, data = df_mod) ## ## Coefficients: ## (Intercept) W ## 0.3018 0.0473 ``` --- # Estimating the Pscore ```r covariates_names <- names(df_mod[,!names(df_mod) %in% c("Y", "W")]) #Create a matrix for covariates other than treatment(W) and outcome(Y) Xmod = df_mod[,!names(df_mod) %in% c("Y", "W")] #Create a vectors for the outcome variable (Y) and the treatment (W) Ymod = df_mod$Y Wmod = df_mod$W # Computing the propensity score by logistic regression of W on all the covariates except the outcome (Y). # Declare formula for logistic regression p_logistic.frmla <- as.formula(paste("W ~ ", paste(covariates_names, collapse= "+"))) p_logistic.fit <- glm(p_logistic.frmla , data = df_mod, family = "binomial") #These are predicted probabilities of getting the treatment (W) as a function of the covariates (X) #Add the predicted probabilities to our original data frame (df_mod) df_mod$p_logistic<- predict(p_logistic.fit, type = "response") #The following plot is the histogram of the propensity scores. p <- ggplot(df_mod, aes(x = p_logistic, fill = as.factor(W))) + geom_histogram(alpha = 0.2, position = "identity", bins = 35) + geom_vline(aes(xintercept=1),color="red", linetype="dashed", size=1, alpha = 0.5) + geom_vline(aes(xintercept=0), color="red", linetype="dashed", size=1, alpha = 0.5) + labs(title="", x="Propensity Score", y = "Frequency") + ggthemes::theme_few() + scale_fill_colorblind(name="",labels=c("Control", "Treatment")) + theme(legend.position = "top") ``` --- <img src="data:image/png;base64,#s9_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- # Calibration Plot for pscore ```r log_calibration <- smooth.spline(df_mod$p_logistic, Wmod, df = 4) log_calibration <- as.data.frame(cbind(log_calibration$x, log_calibration$y)) log_spline <- ggplot(log_calibration,aes(x=V1, y= V2)) + geom_point(color = "orange")+ labs(title = "Logistic Calibration Check", x= "True Treatment Assignment", y="Fitted Treatment Assignment") + geom_abline(intercept = 0) + xlim(c(0,1))+ ylim(c(0,1))+ ggthemes::theme_few() ``` --- <img src="data:image/png;base64,#s9_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- # Conditional Mean Estimation ```r libreq(estimatr) ``` ``` ## wants loaded ## [1,] "estimatr" TRUE ``` ```r lm_interact = lm_robust(Y ~ (. - p_logistic) * W , data = df_mod) lm_interact %>% tidy %>% filter(term == "W") ``` ``` ## term estimate std.error statistic p.value conf.low conf.high df outcome ## 1 W 0.08971 0.07527 1.192 0.2335 -0.0579 0.2373 2180 Y ``` --- # IPW ```r signif.level <- qnorm(0.025, lower.tail = FALSE) ipw <- function(dataset, p) { W <- dataset$W Y <- dataset$Y G <- ((W - p) * Y) / (p * (1 - p)) tau.hat <- mean(G) se.hat <- sqrt(var(G) / (length(G) - 1)) c(ATE=tau.hat, lower_ci = tau.hat - signif.level * se.hat, upper_ci = tau.hat + signif.level * se.hat) } (tauhat_logistic_ipw <- ipw(df_mod, df_mod$p_logistic)) ``` ``` ## ATE lower_ci upper_ci ## 0.10919 0.02585 0.19254 ``` --- # AIPW ```r aipw_ols <- function(dataset, p) { dataset = dataset[,-which(names(dataset) %in% c("p_logistic"))] ols.fit = lm(Y ~ W * ., data = dataset) # predict on dataset as if everyone was treated dataset.treatall = copy(dataset); dataset.treatall$W = 1 μ1 = predict(ols.fit, dataset.treatall) # predict on dataset as if no one was treated dataset.treatnone = copy(dataset); dataset.treatnone$W = 0 μ0 = predict(ols.fit, dataset.treatnone) # predict on actual status actual_pred = predict(ols.fit, dataset) G <- μ1 - μ0+ ((dataset$W - p) * (dataset$Y - actual_pred)) / (p * (1 - p)) tau.hat <- mean(G) se.hat <- sqrt(var(G) / (length(G) - 1)) c(ATE=tau.hat, lower_ci = tau.hat - signif.level * se.hat, upper_ci = tau.hat + signif.level * se.hat) } (tauhat_lin_logistic_aipw <- aipw_ols(df_mod, df_mod$p_logistic)) ``` ``` ## ATE lower_ci upper_ci ## 0.07458 0.01020 0.13896 ``` --- # AIPW using forests ```r library(grf) cf <- causal_forest(Xmod, Ymod, Wmod, num.trees = 500) ate_cf_aipw <- average_treatment_effect(cf) ate_cf_aipw %>% summary ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0281 0.0447 0.0613 0.0613 0.0779 0.0945 ``` ```r p_rf <- as.data.frame(cf$W.hat) %>% rename(p_cf='cf$W.hat') (tauhat_ols_rf_aipw <- aipw_ols(df_mod, p_rf$p_cf)) ``` ``` ## ATE lower_ci upper_ci ## 0.08653 0.03350 0.13956 ``` --- class: inverse, center, middle name: DoubleML # DoubleML Implementation <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> + Use `DoubleML` (Bach et al 2021) in real applications [built on top of `mlr3`] --- # DoubleML 'by hand' .scroll-output[ ```r libreq(hdm, AER, randomForest, glmnet) ``` ``` ## wants loaded ## [1,] "hdm" TRUE ## [2,] "AER" TRUE ## [3,] "randomForest" TRUE ## [4,] "glmnet" TRUE ``` ```r data(GrowthData) # Barro-Lee growth data y= as.matrix(GrowthData[,1]) # outcome: growth rate d= as.matrix(GrowthData[,3]) # treatment: initial wealth x= as.matrix(GrowthData[,-c(1,2,3)]) # controls: country characteristics DML2.for.PLM <- function(x, d, y, dreg, yreg, nfold=2) { nobs <- nrow(x) #number of observations foldid <- rep.int(1:nfold, times = ceiling(nobs/nfold))[sample.int(nobs)] #define folds indices I <- split(1:nobs, foldid) #split observation indices into folds ytil <- dtil <- rep(NA, nobs) cat("fold: ") for(b in 1:length(I)){ dfit <- dreg(x[-I[[b]],], d[-I[[b]]]) #take a fold out yfit <- yreg(x[-I[[b]],], y[-I[[b]]]) # take a fold out dhat <- predict(dfit, x[I[[b]],], type="response") #predict the left-out fold yhat <- predict(yfit, x[I[[b]],], type="response") #predict the left-out fold dtil[I[[b]]] <- (d[I[[b]]] - dhat) #record residual for the left-out fold ytil[I[[b]]] <- (y[I[[b]]] - yhat) #record residual for the left-out fold cat(b," ") } rfit <- lm(ytil ~ dtil) #estimate the main parameter by regressing one residual on the other coef.est <- coef(rfit)[2] #extract coefficient se <- sqrt(vcovHC(rfit)[2,2]) #record robust standard error cat(sprintf("\ncoef (se) = %g (%g)\n", coef.est , se)) #printing output return( list(coef.est =coef.est , se=se, dtil=dtil, ytil=ytil) ) #save output and residuals } ``` ] --- # DML with different learners ```r set.seed(1) dreg <- \(x,d) glmnet(x, d, lambda = 0) #ML method= OLS using glmnet; using lm gives bugs yreg <- \(x,y) glmnet(x, y, lambda = 0) #ML method = OLS DML2.OLS = DML2.for.PLM(x, d, y, dreg, yreg, nfold=10) ``` ``` ## fold: 1 2 3 4 5 6 7 8 9 10 ## coef (se) = 0.01013 (0.0167061) ``` ```r dreg <- \(x,d) rlasso(x,d, post=FALSE) #ML method= lasso from hdm yreg <- \(x,y) rlasso(x,y, post=FALSE) #ML method = lasso from hdm DML2.lasso = DML2.for.PLM(x, d, y, dreg, yreg, nfold=10) ``` ``` ## fold: 1 2 3 4 5 6 7 8 9 10 ## coef (se) = -0.0417523 (0.016092) ``` ```r dreg <- \(x,d) randomForest(x, d) #ML method=Forest yreg <- \(x,y) randomForest(x, y) #ML method=Forest DML2.RF = DML2.for.PLM(x, d, y, dreg, yreg, nfold=10) ``` ``` ## fold: 1 2 3 4 5 6 7 8 9 10 ## coef (se) = -0.038251 (0.0149232) ```