1. Preliminaries

started off as Chapter-by-chapter notes for Angrist and Pischke (2008); based on material from Josh Gottlieb’s research design class and Thomas Lemieux’s applied econometrics class at UBC, added material from Cameron and Trivedi (2005)

Replication code adapted from this repository, UCLA IDRE Code Repo, and relevant R documentation and vignettes.

SessionInfo

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /opt/microsoft/ropen/3.5.0/lib64/R/lib/libRblas.so
## LAPACK: /opt/microsoft/ropen/3.5.0/lib64/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] splines   stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggthemes_3.5.0       VGAM_1.0-5           MASS_7.3-49         
##  [4] reshape2_1.4.3       systemfit_1.1-22     gridExtra_2.3       
##  [7] nnet_7.3-12          margins_0.3.23       quantreg_5.36       
## [10] SparseM_1.77         AER_1.2-5            sandwich_2.4-0      
## [13] lmtest_0.9-36        zoo_1.8-3            car_3.0-0           
## [16] carData_3.0-1        lubridate_1.7.4      Hmisc_4.1-1         
## [19] Formula_1.2-3        survival_2.41-3      lattice_0.20-35     
## [22] magrittr_1.5         rdrobust_0.99.4      lfe_2.8             
## [25] Matrix_1.2-14        stargazer_5.2.2      data.table_1.11.4   
## [28] forcats_0.3.0        stringr_1.3.1        dplyr_0.7.6         
## [31] purrr_0.2.5          readr_1.1.1          tidyr_0.8.1         
## [34] tibble_1.4.2         ggplot2_3.0.0        tidyverse_1.2.1     
## [37] rmarkdown_1.9        RevoUtils_11.0.0     LalRUtils_0.0.0.9000
## [40] RevoUtilsMath_11.0.0
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-137        RColorBrewer_1.1-2  httr_1.3.1         
##  [4] rprojroot_1.3-2     tools_3.5.0         backports_1.1.2    
##  [7] R6_2.2.2            rpart_4.1-13        lazyeval_0.2.1     
## [10] colorspace_1.3-2    withr_2.1.2         tidyselect_0.2.4   
## [13] curl_3.2            compiler_3.5.0      cli_1.0.0          
## [16] rvest_0.3.2         htmlTable_1.12      xml2_1.2.0         
## [19] scales_0.5.0        checkmate_1.8.5     digest_0.6.15      
## [22] foreign_0.8-70      rio_0.5.10          base64enc_0.1-3    
## [25] pkgconfig_2.0.1     htmltools_0.3.6     htmlwidgets_1.2    
## [28] rlang_0.2.1         readxl_1.1.0        rstudioapi_0.7     
## [31] bindr_0.1.1         jsonlite_1.5        acepack_1.4.1      
## [34] zip_1.0.0           Rcpp_0.12.17        munsell_0.5.0      
## [37] abind_1.4-5         prediction_0.3.6    stringi_1.2.3      
## [40] yaml_2.1.19         plyr_1.8.4          grid_3.5.0         
## [43] parallel_3.5.0      crayon_1.3.4        haven_1.1.1        
## [46] hms_0.4.2           knitr_1.20          pillar_1.3.0       
## [49] glue_1.2.0          evaluate_0.10.1     latticeExtra_0.6-28
## [52] modelr_0.1.2        MatrixModels_0.4-1  cellranger_1.1.0   
## [55] gtable_0.2.0        assertthat_0.2.0    openxlsx_4.1.0     
## [58] xtable_1.8-2        broom_0.5.0         bindrcpp_0.2.2     
## [61] cluster_2.0.7-1

2. The Experimental Ideal

Potential Outcomes (Rubin 1974, Holland 1986)

\(Y_i\) is the observed outcome, \(D_i\) is the treatment

\[ Y_i = \begin{cases} Y_{1i} \text{ if } D_i = 1 \\ Y_{0i} \text{ if } D_i = 0 \\ \end{cases} \]

Fundamental Problem of Causal Inference : We never see both potential outcomes for any given unit.

We observe

\[ Y_i = Y_{0i} + (Y_{1i} - Y_{0i}) D_i \]

\[\begin{align*} \mathbb{E}[Y_i|D_i = 1] - \mathbb{E}[Y_i|D_i = 0] & = \mathbb{E}[Y_{1i}|D_i=1] - \mathbb{E}[Y_{0i}|D_i=1] + \mathbb{E}[Y_{0i}|D_i=1] - \mathbb{E}[Y_{0i}|D_i=0]\\ \text{observed effect} & = \text{Treatment on Treated} + \text{Selection Bias} \end{align*}\]

Randomisation

Solves the selection problem by making: \[Y_{0i} ⊥D_i\]

\[\begin{align*} Y_i & = \alpha + \rho X_i + \eta_i \\ \end{align*}\]

\(\alpha = \mathbb{E}[Y_{0i}]\)
\(\rho = (Y_{1i} - Y_{0i})\)
\(\eta_i = Y_{0i} - \mathbb{E}(Y_{0i})\)

Selection bias: \(Cov(D_i, \eta_i) \neq 0\)

Papers:

MTO - Chetty, Hendren, and Katz (2016)
OHE - Finkelstein et al. (2012)
oDesk Experiment - Pallais (2014)

Experimental Design - Treatment effects

Estimating equation is:

Treatment group: \(y_i = X^T \beta + \epsilon_i\) Expected value of treatment group: \(E(y|T) = x^T + E(\epsilon_i|T)\)

Control group: \(y_i = X^C \beta + \epsilon_i\) Expected value of control group: \(E(y|C) = x^C + E(\epsilon_i|C)\)

Because of randomisation, \(E(\epsilon_i|T) = E(\epsilon_i|C)\), which implies \[ E(y|T) - E(y|C) = (x^T - x^C)\beta \]

Causal Parameter of interest:

\[ \beta = \frac{E(y|T)-E(y|C)}{x^T-x^C} \]

For most experiments with A/B testing structure, \(x^T=1\) and \(x^C=0\), so the treatment effect collapses to a simple difference in means \(\beta = E(y|T)-E(y|C)\)

3. Regression Fundamentals

Conditional Expectation Function (CEF)

\[ \mathbb{E}[Y|X=x] = \int y f(y|x) dy \]

Law of Iterated Expectation \[\mathbb{E}[Y_i] = \mathbb{E}[\mathbb{E}[Y_i|X_i]] \]

#%%
# Download data and unzip the data - run once in working directory
# download.file('http://economics.mit.edu/files/397', 'asciiqob.zip')
# unzip('asciiqob.zip')
#%%
# Read the data into a dataframe
pums =        read.table('asciiqob.txt',
                          header           = FALSE,
                          stringsAsFactors = FALSE)
names(pums) <- c('lwklywge', 'educ', 'yob', 'qob', 'pob')
pums$age <- 80 - pums$yob
#%%
# Estimate OLS regression
reg.model <- lm(lwklywge ~ educ, data = pums)
#%%
# Calculate means by educ attainment and predicted values
pums.data.table <- data.table(pums)
educ.means      <- pums.data.table[
                , list(mean = mean(lwklywge)), by = educ]
educ.means$yhat <- predict(reg.model, educ.means)

CEF \(E(Wages|Education)\)

# Create plot
p <- ggplot(data = educ.means, aes(x = educ))       +
     geom_point(aes(y = mean))                      +
     geom_line(aes(y = mean))                       +
     geom_line(aes(y = yhat), linetype='dashed' )   +
     labs(title='CEF (Wages | Education)',
            y = "Log weekly earnings, $2003",
            x = "Years of completed education")
print(p)

stargazer(reg.model,type=exp_type)


	Dependent variable:

	lwklywge

educ	0.071^***
	(0.0003)

Constant	5.000^***
	(0.004)


Observations	329,509
R²	0.120
Adjusted R²	0.120
Residual Std. Error	0.640 (df = 329507)
F Statistic	43,783.000^*** (df = 1; 329507)

Note:	p<0.1; p<0.05; p<0.01

CEF Decomposition

\[ Y_i = \mathbb{E}(Y_i|X_i) + \epsilon_i \]

CEF Prediction Property

\[ \mathbb{E}[Y_i|X_i] = {arg min}_{m(x_i)} \mathbb{E}[(Y_i - m(X_i))^2] \]

ANOVA Theorem

\[ \mathbb{V}(Y_i) = \mathbb{V}(\mathbb{E}(Y_i|X_i)) + \mathbb{E}(\mathbb{V}(Y_I|X_I)) \]

Variance of Y = Variance of CEF (Explained Variance) + Variance of residual

anova(reg.model)

## Analysis of Variance Table
## 
## Response: lwklywge
##               Df Sum Sq Mean Sq F value Pr(>F)    
## educ           1  17809   17809   43783 <2e-16 ***
## Residuals 329507 134029       0                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regression Anatomy / FWL

\[ \beta_k = \frac{Cov(Y_i, \tilde{x_{ki}})}{V(\tilde{x_{ki}})} \]

where \(\tilde{x_{ki}}\) is the residual from a regression of \(x_{ki}\) on all other covariates.

Justifications for Linear Regression:

Regression reveals conditional expectatiosn if CEF is linear
Regression reveals best linear predictor of \(Y_i|X_i\)
Regression reveals best linear predictor of \(E[Y_i|X_i]\)

Conditional Independence Assumption (CIA)

\[ Y_{si} \coprod s_i | X_i , \forall s \]

Conditional on \(X_i\), \(s_i\) can be said to be ‘as good as randomly assigned’. Not Testable

Omitted Variable Bias Formula

\[ \frac{Cov(Y_i,O_i)}{V(O_i)} = \rho + \gamma' \delta_{UO} \] Where the LHS is a vector of coefficients from the ‘short’ regression (regressing Y on \(O_i\)), \(O_i\) is a vector of observed predictors (included in the regression), and \(\rho\) is the corresponding vector of coefficients from regressing Y on \(O_i\) and \(U_i\) (long regression estimates). \(U_i\) is a vector of unobserved predictors, and \(\gamma\) is the corresponding set of coefficients from the long regression, and \(\delta_{UO}\) is the vector of coefficients from regressing \(U_i\) on \(O_i\). In words, the OVB formula says:

Coefficient in Short Regression = Coefficient in long regression + effect of omitted \(\times\) regression of omitted on included

lm0 = lm(lwklywge ~ educ+age, data = pums)
lm1 = lm(lwklywge ~ educ, data = pums)
lm2 = lm(lwklywge ~ age, data = pums)
lm3 = lm(educ ~ age, data = pums)

stargazer(lm0,lm1,lm2,lm3,type=exp_type)


	Dependent variable:

	lwklywge			educ
	(1)	(2)	(3)	(4)

educ	0.071^***	0.071^***
	(0.0003)	(0.0003)

age	0.005^***		0.001	-0.060^***
	(0.0004)		(0.0004)	(0.002)

Constant	4.800^***	5.000^***	5.900^***	15.000^***
	(0.018)	(0.004)	(0.019)	(0.089)


Observations	329,509	329,509	329,509	329,509
R²	0.120	0.120	0.00001	0.003
Adjusted R²	0.120	0.120	0.00000	0.003
Residual Std. Error	0.640 (df = 329506)	0.640 (df = 329507)	0.680 (df = 329507)	3.300 (df = 329507)
F Statistic	21,984.000^*** (df = 2; 329506)	43,783.000^*** (df = 1; 329507)	2.600 (df = 1; 329507)	927.000^*** (df = 1; 329507)

Note:	p<0.1; p<0.05; p<0.01

The coefficient for education does not change across the long regression (col 1) [0.07] and the short regression (col 2) [0.07] because \(\delta_{uo}\) is effectively zero (col 3) [0].

Bad Controls:

Controls that are themselves outcomes of the notional experiment (i.e. they might as well be a DV) - white-collar jobs in mincer regression
Proxy controls - controls that are affected by \(Y\) - late ability in mincer regression

Saturated Models

Saturated regression models are regression models with discrete explanatory variables s.t. the model includes a separate parameter for all possible values take on by the explanatory variables.

sat_reg = lm(lwklywge ~ factor(educ), data=pums)
pums.data.table <- data.table(pums)
educ.means      <- pums.data.table[
                , list(mean = mean(lwklywge)), by = educ]
educ.means$yhat <- predict(sat_reg, educ.means)

# Create plot
p <- ggplot(data = educ.means, aes(x = educ))       +
     geom_point(aes(y = mean))                      +
     geom_line(aes(y = mean))                       +
     geom_line(aes(y = yhat), linetype='dashed' )   +
     labs(title='CEF (Wages | Education)',
            y = "Log weekly earnings, $2003",
            x = "Years of completed education")
print(p)

\(\hat{y}\) and \(E[Y_i|X_i]\) are identical, since the saturated regression fits the CEF perfectly

Measurement error

Let \(y^*\) be the true value and \(u_y\) is an idiosyncratic error in measurement. Then, \(y = y^* + u_y\), which means \(y = x^*\beta+\epsilon + u_y\).

Measurement error in y does not affect the consistency of OLS estimates of \(\beta\), but it increases the variance of \(\hat{\beta}\) from \((X'X)^{-1} var(\epsilon)\) to \((X'X)^{-1}[var(\epsilon) + var(u_y)]\).

Measurement error in x results in a bias of OLS estimates

\[ plim(\hat{\beta}) = \frac{var(x^*)}{var(x^*) + var(u_x)} \beta \]

\(\hat{\beta}\) converges in probability to a fraction of the true \(\beta\). This is attenuation bias.

# add random noise to educ
pums %<>% mutate(educ_noisy = educ + rnorm(length(pums$educ), 0, 2))

lm1 = lm(lwklywge ~ educ, data=pums)
lm2 = lm(lwklywge ~ educ_noisy, data=pums)

stargazer(lm1, lm2, type=exp_type)


	Dependent variable:

	lwklywge
	(1)	(2)

educ	0.071^***
	(0.0003)

educ_noisy		0.052^***
		(0.0003)

Constant	5.000^***	5.200^***
	(0.004)	(0.004)


Observations	329,509	329,509
R²	0.120	0.086
Adjusted R²	0.120	0.086
Residual Std. Error (df = 329507)	0.640	0.650
F Statistic (df = 1; 329507)	43,783.000^***	31,140.000^***

Note:	p<0.1; p<0.05; p<0.01

p1 = ggplot(data = pums) +
  geom_point(mapping = aes(x=educ, y = lwklywge)) +
  stat_smooth(mapping = aes(x=educ, y = lwklywge),
    method = "lm", col = "red") +
  stat_smooth(mapping = aes(x=educ_noisy, y = lwklywge),
    method = "lm", col = "blue")
p1

3.5 Standard Errors

Asymptotic distribution of \(\hat{\beta}\) is

\[ \sqrt{N} (\hat{\beta} - \beta) \sim N(0, \Omega) \]

where \(\Omega\) is the asymptotic covariance matrix

\[ \Omega = E[X'X]^{-1} X' \epsilon \epsilon' X (X'X)^{-1} \]

When residuals are homoskedastic (spherical error variance - i.e. \(E[\epsilon \epsilon'] = \sigma^2 N\) ), the covariance matrix simplifies to \(\Omega = \sigma^2 (X'X)^{-1}\) , where \(\sigma^2 = E(\epsilon^2)\).

This may not be true:

residuals = resid(reg.model)
x_and_err = data.frame(pums$educ, residuals)
ggplot(data = x_and_err, aes(x=pums.educ,y=residuals)) +
  geom_point()

  # geom_smooth(method=lm)

But standard errors barely change in this case

lm1 = felm(lwklywge ~ educ, data = pums)
lm2 = robustify(felm(lwklywge ~ educ, data = pums))

stargazer(lm1, lm2, type=exp_type)


	Dependent variable:

	lwklywge
	(1)	(2)

educ	0.071^***	0.071^***
	(0.0003)	(0.0004)

Constant	5.000^***	5.000^***
	(0.004)	(0.005)


Observations	329,509	329,509
R²	0.120	0.120
Adjusted R²	0.120	0.120
Residual Std. Error (df = 329507)	0.640	0.640

Note:	p<0.1; p<0.05; p<0.01

Homoscedasticity mechanically fails in LPM since

\[ \epsilon_i = \begin{cases} -x_i \beta \text{ if } Y_i = 0 \\ 1 - x_i \beta \text{ if } Y_i = 1 \end{cases} \]

Then, \(Var(\epsilon_i | X_i) = x_i \beta ( 1- x_i \beta)\)

Let \((X'X) = E(X_i X_i') = Q_{xx}\). Then, standard errors are calculated as \(V(\hat{\beta}) = Q^{-1} \Omega Q^{-1}\) for the following \(\Omega\):

\[\begin{align*} \text{Classical} & = X' \frac{e'e}{N-K} X \text{ : Homoskedastic Standard Errors}\\ \text{HC0} & = X' diag[e_i^2] X \\ \text{HC1} & = X' \frac{N}{N-K} diag[e_i^2] X \\ \text{HC2} & = X' diag[\frac{e_i^2}{1-h_{ii}}] X \\ \text{HC3} & = X' diag[\frac{e_i^2}{(1-h_{ii})^2}] X \\ \end{align*}\]

where \(h_{ii} := x_i (X'X)^{-1} x_i'\) = leverage of observation \(i\), and \(e_i := y_i - x_i \hat{\beta}\) = empirical residuals

Clustered standard errors

Assume error structure of

\[ \epsilon_{ij} = \theta_j + u_{ij} \]

where \(\theta_j\) is a common shock for each subgroup \(j\) in the dataset. Variance covariance matrix is no longer diagonal but block- diagonal. \(var(\epsilon_{ij}) = var(\theta_j) + var(u_{ij})\); \(cov(\epsilon_{ij}, \epsilon_{i'j}) = var(\theta_j) \forall i, i' \in j\), (two observations in the same group) while \(cov(\epsilon_{ij}, \epsilon_{i'j'}) =0\) for (two observations not in the same group)

\[ Var(\hat{\beta}) = (\sum_{j=1}^J \sum_{i \in C_j} x_{ij}' x_{ij})^{-1} \sum_{j=1}^J (\sum_{i \in C_j} \sum_{i' \in C_j} (x_{ij}' \hat{\epsilon}_{ij}) (\hat{\epsilon}_{i'j}' x_{i'j} ) ) (\sum_{j=1}^J \sum_{i \in C_j} x_{ij}' x_{ij})^{-1} \]

where \(J\) is the number of clusters.

Bootstrapped standard errors

Compute \(\hat{\beta}(r)\) for each random sample (drawn with replacement from the overall sample), then calcuate the empirical standard deviation of \(\hat{\beta}(r)\)

\[ Var(\hat{\beta})^{boot} = \frac{1}{R} \sum_{r=1}^R (\hat{\beta}(r) - \hat{\bar{\beta}})^2 \]

Block bootstrap to preserve group structure.

TBD: Miller, Cameron, Gelbach double-cluster, Conley spatial HAC

4. Instrumental Variables

Endogeneity : \(Cov(X_i, \epsilon_i) \neq 0\)

Instrumental Variable requirements:

Relevance : \(Cov(X_i, Z_i) \neq 0\)
Exclusion Restriction: \(Cov(Z_i, \epsilon_i) = 0\)

IV estimation

\[ \beta_{IV} = \frac{Cov(Y_i,Z_i)}{Cov(X_i, Z_i)} = \frac{\beta_{RF}}{\beta_{FS}} \]

General form:

\[ \hat{\beta}_{IV} = (X'P_z X)^{-1}X'P_z Y \] where \(P_z\) is the projection matrix of Z:= \(Z(Z'Z)^{-1}Z'\)

More general estimation strategies:

2 Stage Least Squares:
- Use OLS to regress X on Z and get \(\hat{X} = P_z X\)
- Use OLS to regress y on \(\hat{X}\) to get \(\hat{\beta}_{iv}\)
Control Function approach:
- Use OLS (or some other estimation strategy) to regress X on Z and get estimated residuals \(\hat{v} = I - P_z X = M_z X\)
- Use OLS to regress y on X and \(\hat{v}\) to get \(\hat{\beta}_{iv}\)

IV estimation with covariates

conditional independence assumption

\[ {Y_{1i}, Y_{0i}, D_{1i}, D_{0i}} \coprod Z_i | X_i \]

Angrist and Krueger (1991) - Compulsory Schooling / Quarter of Birth Instrument

Visual test of Instrument Relevance

Z = Quarter of Birth , Y = wages, X = education

pums.qob.means <- pums %>% group_by(yob, qob) %>% summarise_all(funs(mean))
#%%
# Add dates
pums.qob.means$yqob <- ymd(paste0("19",
                                  pums.qob.means$yob,
                                  pums.qob.means$qob * 3),
                           truncated = 2)
# Function for plotting data
plot.qob <- function(ggplot.obj, ggtitle, ylab) {
  gg.colours <- c("firebrick", rep("black", 3), "white")
  ggplot.obj + geom_line()                                              +
               geom_point(aes(colour = factor(qob)),
                              size = 5)                                 +
               geom_text(aes(label = qob, colour = "white"),
                         size  = 3,
                         hjust = 0.5, vjust = 0.5,
                         show_guide = FALSE)                            +
               scale_colour_manual(values = gg.colours, guide = FALSE)  +
               ggtitle(ggtitle)                                         +
               xlab("Year of birth")                                    +
               ylab(ylab)
}
# Plot
p.educ     <- plot.qob(ggplot(pums.qob.means, aes(x = yqob, y = educ)),
                       "A. Average education by quarter of birth (first stage)",
                       "Years of education")
p.lwklywge <- plot.qob(ggplot(pums.qob.means, aes(x = yqob, y = lwklywge)),
                       "B. Average weekly wage by quarter of birth (reduced form)",
                       "Log weekly earnings")

p.ivgraph  <- grid.arrange(p.educ, p.lwklywge)

Replication (Table 4.1.1 in MHE)

pums        <- fread('asciiqob.txt',
                     header           = FALSE,
                     stringsAsFactors = FALSE)
names(pums) <- c('lwklywge', 'educ', 'yob', 'qob', 'pob')

qobs      <- unique(pums$qob)
qobs.vars <- sapply(qobs, function(x) paste0('qob', x))
pums[, (qobs.vars) := lapply(qobs, function(x) 1*(qob == x))]
#%%
# Column 1: OLS
col1 <- felm(lwklywge ~ 1|0|(educ~educ)|0, pums)
# Column 2: OLS with YOB, POB dummies
col2 <- felm(lwklywge ~ 1| factor(yob) + factor(pob)|(educ~educ)|0, pums)
#%%
col3 <- felm(lwklywge ~ 1 | 0 | (educ~qob1) |0, data=pums)
# Column 3: 2SLS with instrument QOB = 1
#%%
# Column 4: 2SLS with YOB, POB dummies and instrument QOB = 1
col4 <- felm(lwklywge ~ 1 | 0 |(educ~qob1 + qob2 + qob3)|0 , data=pums)
#%%
# Column 5: 2SLS with YOB, POB dummies and instrument (QOB = 1)
col5 <- felm(lwklywge ~ 1 | factor(yob) + factor(pob) |
                        (educ ~ qob1)|0,data=pums)
# Column 6: 2SLS with YOB, POB dummies and full QOB dummies
col6 <- felm(lwklywge ~ 1 | factor(yob) + factor(pob) |
                        (educ ~ factor(qob))|0,data=pums)
# Column 7: 2SLS with YOB, POB dummies and full QOB dummies interacted with YOB
col7 <- felm(lwklywge ~ 1 | factor(yob) + factor(pob) |
                       (educ ~ factor(qob) * factor(yob)),
              data = pums)

stargazer(col1,col2,col3,col4,col5,col6,col7,
    keep.stat="n",
    covariate.labels=c('Years of education'),
    add.lines = list(
        c("YOB fixed effects", '', 'X','','','X','X','X' ),
        c("POB fixed effects", '', 'X','','','X','X','X' ),
        c("Instruments",       '', '','','','','' ),
        c("QOB=1",             '', '','X','X','X','X' ),
        c("QOB=2",             '', '','','X','','X' ),
        c("QOB=3",             '', '','','X','','X' ),
        c("QOB X YOB dummies", '', '','','','','','X' )),
    omit= "Constant",
    type=exp_type)


	Dependent variable:

	lwklywge
	(1)	(2)	(3)	(4)	(5)	(6)	(7)

Years of education	0.071^***	0.067^***	0.100^***	0.100^***	0.100^***	0.110^***	0.087^***
	(0.0003)	(0.0003)	(0.024)	(0.020)	(0.026)	(0.020)	(0.016)


YOB fixed effects		X			X	X	X
POB fixed effects		X			X	X	X
Instruments
QOB=1			X	X	X	X
QOB=2				X		X
QOB=3				X		X
QOB X YOB dummies							X
Observations	329,509	329,509	329,509	329,509	329,509	329,509	329,509

Note:	p<0.1; p<0.05; p<0.01

The Wald Estimator

Structural equation:

\[ Y_i = \alpha + \rho S_i + \eta_i \]

\(Cov(\eta_i, s_i) \neq 0\). Let \(Z_i \in {0,1}\) s.t. \(P(Z_i=1) = p\), \[ Cov(Y_i, Z_i) = \{ E[Y_i|Z_i = 1] - E[Y_i|Z_i = 0]\}p(1-p) \]

Then, \[ \rho = \frac{E[Y_i|Z_i = 1] - E[Y_i|Z_i = 0]}{E[s_i|Z_i = 1] - E[s_i|Z_i=0]} \]

pums$z <- (pums$qob == 3 | pums$qob == 4) * 1

# Compare means (and differences)
ttest.lwklywge <- t.test(lwklywge ~ z, pums)
ttest.educ     <- t.test(educ ~ z, pums)

# Compute Wald estimate
sur  <- systemfit(list(first  = educ ~ z,
                       second = lwklywge ~ z),
                  data   = pums,
                  method = "SUR")
wald <- deltaMethod(sur, "second_z / first_z")

wald.estimate <- (mean(pums$lwklywge[pums$z == 1]) - mean(pums$lwklywge[pums$z == 0])) /
                 (mean(pums$educ[pums$z == 1]) - mean(pums$educ[pums$z == 0]))
# wald.se       <- wald.estimate^2 * ()

# OLS estimate
ols <- lm(lwklywge ~ educ, pums)

# Construct table
lwklywge.row <- c(ttest.lwklywge$estimate[1],
                  ttest.lwklywge$estimate[2],
                  ttest.lwklywge$estimate[2] - ttest.lwklywge$estimate[1])
educ.row     <- c(ttest.educ$estimate[1],
                  ttest.educ$estimate[2],
                  ttest.educ$estimate[2] - ttest.educ$estimate[1])
wald.row.est <- c(NA, NA, wald$Estimate)
wald.row.se  <- c(NA, NA, wald$SE)

ols.row.est <- c(NA, NA, summary(ols)$coef['educ' , 'Estimate'])
ols.row.se  <- c(NA, NA, summary(ols)$coef['educ' , 'Std. Error'])

table           <- rbind(lwklywge.row, educ.row,
                         wald.row.est, wald.row.se,
                         ols.row.est, ols.row.se)
colnames(table) <- c("Born in the 1st or 2nd quarter of year",
                     "Born in the 3rd or 4th quarter of year",
                     "Difference")
rownames(table) <- c("ln(weekly wage)",
                     "Years of education",
                     "Wald estimate",
                     "Wald std error",
                     "OLS estimate",
                     "OLS std error")

knitr::kable(table)

	Born in the 1st or 2nd quarter of year	Born in the 3rd or 4th quarter of year	Difference
ln(weekly wage)	5.9	5.9	0.01
Years of education	12.7	12.8	0.11
Wald estimate	NA	NA	0.11
Wald std error	NA	NA	0.02
OLS estimate	NA	NA	0.07
OLS std error	NA	NA	0.00

The Local Average Treatment Effect (LATE)

Types of validity in research design:

Internal Validity - whether a given design successfully uncovers causal effects for the population being studied
External Validity - predictive value of the study’s finding in a different context

LATE Theorem

Assumptions

Instrument is as good as randomly assigned - it is independent of the vector of potential outcomes and potential treatment assignments \[ [\{ Y_i(d,z); \forall d, z \}, D_{1i}, D_{0i}] \bot z_i \] Sufficient for a causal interpretation of the reduced form
First Stage - \(Cov(d,z) \neq 0\) ; \(E[D_{1i}- D_{0i}] \neq 0\)
Exclusion restriction - \(Y_i(d,z) =f(d)\),i.e. only a function of \(d\) ; \(Y_i(d,0) = Y_i(d,1) \equiv Y_{di}\) for \(d \in {0,1}\)
Monotonicity - Either \(\pi_{1i} \geq 0 \forall i \lor \pi_{1i} \leq 0 \forall i\) (everyone affected by the instrument are affected in the same way - red flag for jackknife instruments)

Given the above conditions, The wald estimand can be interpreted as the causal effect of the endogenous variable of interest on the outcome on those whose treatment status can be changed by the instrument

\[\begin{align*} \rho & = \frac{E[Y_i|Z_i = 1] - E[Y_i|Z_i = 0]}{E[s_i|Z_i = 1] - E[s_i|Z_i=0]} \\ & = E[Y_{1i} - Y_{0i} | D_{1i} > D_{0i}] \\ & = E[\rho_i | \pi_{1i} > 0] \end{align*}\]

Populations in IV

Compliers : Subpopulation with \(D_{1i} = 1 \land D_{0i} = 0\)
Defiers: Subpopulation with \(D_{1i} = 0 \land D_{0i} = 1\) ; Violates monotonicity - should be \(\emptyset\)
Always takers: Subpopulation with \(D_{1i} = D_{0i} = 1\)
Never takers: Subpopulation with \(D_{1i} = D_{0i} = 0\)

LATE is the effect of the treatment on the population of compliers. Not informative about never-takers and always-takers

Treatment Effects

Average Treatment Effect (ATE) = \(E[Y_{1i}-Y_{0i}]\)
Intent to Treat (ITT) = \(E[Y_{i}|Z_i = 1] - E[Y_{i}|Z_i = 0]\) - difference in outcomes between those who were offered treatment and those who weren’t
Treatment on Treated (TOT) = Weighted average of effect on always- takers and effect on compliers = \(E[Y_{1i} - Y_{0i}|D_i=1]\) - mean effect for those who actually participated in the program
Local Average Treatment Effect (LATE) : \(E[Y_{1i} - Y_{0i} | D_{1i} > D_{0i}]\)

Under randomisation and perfect compliance, ITT == ATE

In a randomised trial with one-sided non-compliance:

Bloom Result

\[ ToT = E[Y_{1i} - Y_{0i}|D_i = 1] = \frac{E[Y_i|z_i = 1] - E[Y_i|Z_i=0]}{E[D_i|z_i =1]} = \frac{RF}{FS} = \frac{ITT}{Compliance} \]

Standard errors

Manually running IV using 2 stages does not give you correct standard errors.

OLS residual variance is the variance of \(\eta + \rho(s_i - \hat{s_i})\) while for proper 2SLS standard errors, we want the variance of \(\eta_i\) only.

Multiple Instruments

2 mutually exclusive dummy instruments (\(z_{1i}, z_{2i}\)). Resulting 2SLS_estimate of \(\rho\) is a weighted average of causal effects for instrument-specific compliant subpopulations

\[ \rho_j = \frac{Cov(Y_i, z_{ji})}{Cov(D_i, z_{ji})} , j = 1,2 \]

First stage fitted values are \(\hat{D_i} = \pi_{11}z_{1i} + \pi_{12}z_{2i}\)

\[\rho_{2sls} = \psi \rho_1 + (1-\psi) \rho_2 \]

where

\[ \psi = \frac{\pi_{11} Cov(D_1, Z_{1i})}{\pi_{11}Cov(D_1, Z_{1i})+ \pi_{21}Cov(D_1, Z_{2i})} \]

\(\psi\) is a number between 0 and 1 that depends on the relative strength of each instrument in the first stage

2SLS vs LIML

Only relevant When the model is overidentified. 2SLS estimation is done by fitting the first stage and reduced form separately; at each step, \(\beta\) and \(\pi\) are estimated to minimise errors in that equation. On the other hand, limited information maximum-likelihood jointly minimises the sum of squared errors in both stages.

Overidentification

Basic idea: when model is overidentified, we can test exclusion restriction (\(Cov(z, e)=0\)) by estimating the model with the two instruments separately and testing whether \(Cov(z_1, e_2)=0\) (where \(e_2\) is the error term from the structural equation when \(x\) is instrumented by \(z_2\)) and vice versa.

Steps:

estimate the model with all available instruments, keep residual \(\hat{e}\)
regress \(\hat{e}\) on \(z_1\) and \(z_2\), store \(R^2\).
Under the null that overid restrictions are satisfied, \(nR^2 \sim \chi^2(q)\), where \(q\) is the number of overidentifying restrictions (1 in this case)

Hausman Test

IV comes at a price of less precision in estimates, so better to use OLS_unless it can be shown that IV significantly changes results. Define:

\[ H = \frac{(\hat{\beta}_{2sls} - \hat{\beta}_{OLS})^2} {\hat{V}(\hat{\beta}_{2sls}) - \hat{V}(\hat{\beta}_{OLS})} \]

Under the null that OLS is consistent, \(H \sim \chi^2_1\). This is equivalent to testing \(\gamma = 0\) in the equation \(y = \beta x + \gamma \hat{u} + \epsilon\), where \(\hat{u}\) are the residuals from the first stage of IV (regression of x on z).

Bartik IV

Code from John Joseph Horton’s blog

#%%
#%% Bartik Instrument
set.seed(6152011)
# Number of cities
L <- 1000
# Number of industries
K <- 3
# National growth rates by  industry
g.raw <- runif(K)
g.k <- g.raw/sum(g.raw)
#%%
CreateCity <- function(g.k, eta.s){
    #  Give each city some industry shares
    K <- length(g.k)
    z.raw <- runif(K)
    z <- z.raw / sum(z.raw)
    # Give each city some idiosyncratic growth by industry
    g.kl.tilde <- g.raw/sum(g.raw)
    # Change in employement is inner product of the national
    # + idiosyncratic component with industry share
    x <- (g.kl.tilde + g.k) %*% z
    # Bartik instrument
    # (national growth & city industry share inner product)
    B <- g.k %*% z
    # Realized change in wage
    y <- (1/eta.s) * x + rnorm(1,0,1/10)
    list("y" = y, "B" = B, "x" = x)
}
# Create a city data frame
city.data <- rep(CreateCity(g.k, 1), L)
df <- data.frame(matrix(unlist(city.data), ncol = 3))
colnames(df) <- c("y", "B", "x")
# Estimates
m.ols <- felm(y ~ x | 0 | 0 | 0, data = df)
m.iv <- felm(y ~ 1 | 0 | (x ~ B) | 0, data = df)

stargazer(m.ols, m.iv, type = exp_type)


	Dependent variable:

	y
	(1)	(2)

x	-0.500^***
	(0.027)

`x(fit)`		1.000^***
		(0.110)

Constant	1.000^***	-0.0004
	(0.020)	(0.077)


Observations	1,000	1,000
R²	0.250	-2.000
Adjusted R²	0.250	-2.000
Residual Std. Error (df = 998)	0.180	0.350

Note:	p<0.1; p<0.05; p<0.01

Weak Instruments

Given first stage \(s_i = \pi z_i + \epsilon_i\) and structural equation \(y_i = \rho s_i + \eta_i\), define \(\sigma_{\epsilon \eta} = Cov(\epsilon, \eta)\) and \(\sigma_{\epsilon}^2 = Var(\epsilon)\). Let \(F\) be the F statistic of the first-stage. Then, bias in IV term is:

\[ \rho_{2sls} - \rho = \frac{\sigma_{\epsilon,\eta}}{\sigma_{\epsilon}^2} \frac{1}{F+1} \]

So, as \(F \rightarrow \infty\), bias \(\rightarrow 0\).

Bound, Jaeger, and Baker (1995) IV may be biased / inconsistent if the instruments are only weakly correlate with the endogenous variables

\[ \rho = \frac{Cov(Y_i, Z_i)}{Cov(D_i,Z_i)} + \frac{Cov(\eta_i, Z_i)}{Cov(D_i, Z_i)} \]

The second term gets inflated if the instrument is weak.

\[ plim \hat{\beta}_{iv} = \beta + \frac{plim \Delta \bar{\epsilon}}{plim \Delta \bar{x}} \]

\[ \frac{plim \hat{\beta}_{iv} - \beta}{plim \hat{\beta}_{ols} - \beta} = \frac{cov(\hat{x}, \epsilon) / cov(x,\epsilon)}{R^2_{x,z}} \]

Staiger and Stock (1997) rule of thumb - first-stage F statistic \(\geq\) 10 (for 1 endogenous variable)

Olea and Pflueger (2013) - construct Effective F Statistic

\[ \hat{F}_{eff} = \frac{1}{S} \frac{x'ZZ'x}{tr(\hat{W}_2)} \]

5. Differences-in-Differences, Fixed Effects, and Panel Data

Fixed Effects

Define an individual fixed effect for individual \(i\)

\[ A_i = \begin{cases} 1 \text{ if the observation involves unit i} \\ 0 \text{ otherwise} \end{cases} \]

and define the same for each time period for panel data.

If \(D_{it}\) is as good as randomly assigned conditional on \(A_i\):

\[ E[Y_{0it}|A_i, X_{it}, t, D_{it}] = E[Y_{0it}|A_i, X_{it}, t] \]

Then, assuming \(A_i\) enter linearly,

\[ E[Y_{0it}|A_i, X_{it}, t, D_{it}] = \alpha + \lambda_t + A_i'\gamma + X_{it}'\beta \]

Assuming the causal effect of the treatment is additive and constant,

\[ E[Y_{1it}|A_i, X_{it}, t] = E[Y_{0it}|A_i, X_{it}, t] + \rho \]

where \(\rho\) is the causal effect of interest.

Restrictions

Linear
Additive functional form
Variation in \(D_{it}\), over time, for \(i\), must be as good as random

Then, we can write:

\[\begin{align*} Y_it & = \alpha_i +\lambda_t +_\rho D_{it} + X_{it}' \beta + \epsilon_{it} \\ \epsilon_{it} & = Y_{0it} - E[Y_{0it}|A_i, X_{it}, t] \\ \alpha_i & = \alpha + A_i \gamma \end{align*}\]

\(\alpha_i\) captures all observable and unobservable time- invariant for individual i
\(\lambda_t\) captures all observable and unobservable time- varying characteristics that are constant across the population

Random Effects

When there is autocorrelation in time series (i.e. \(\epsilon_t\) s are correlated over time ), GLS estimates can be obtained by estimating OLS on quasi-differenced data.

\[ y_{it} - \lambda \bar{y_{i}} = (x_{it} - \lambda \bar{x_i}) \beta + (1-\lambda)\theta_i + \nu_{it} - \lambda \bar{\nu}_i \]

where \[\lambda = 1 - \left[ \frac{\sigma_\nu^2} {\sigma_\nu^2 + T \sigma_\theta^2} \right]^{\frac{1}{2}} \]

If the error component \(\theta\) is correlated with \(x\), RE estimates are not consistent. Perform Hausman test for random vs fixed effects (where under the null, \(Cov(\theta_i, x_{it}) = 0\))

Within Estimator

Calculate individual averages by averaging the FE equation

\[ \bar{Y_i} = \alpha_i + \bar{\lambda} + \rho \bar{D_i} + \bar{X_i}'\beta + \epsilon_i \]

then subtract from the FE equation

\[ Y_{it} - \bar{Y_i} = \lambda_{t} - \bar{\lambda} + \rho (D_{it} - \bar{D_i}) + (X_{it} - \bar{X_i}')\beta + (\epsilon_{it} - \epsilon_i) \]

yields the same \(\rho\)

First Differencing

Define \(\Delta Y_{i,t} = Y_{i,t} - Y_{i,t-1}\) and the corresponding first-differences for \(D\) and \(X\)

Then, the first-differences estimating equation is:

\[ \Delta Y_{it} = \Delta \lambda_{t} + \rho \Delta D_{it} + \Delta X_{it}' \beta + \Delta \epsilon_{it} \]

Numerically identical from the FD/Within estimator for \(T=2\), not otherwise. Both consistent, but FE is efficient.

Differences-in-Differences

Assuming \(E[Y_{0it}] = \gamma_i +\lambda_t\), where \(\gamma_i\) is a unit fixed effect and \(\lambda_t\) is a time fixed-effect. Let \(D_{it}\) be a dummy for treated unit-time combinations. Assuming \(E[Y_{1it} - Y_{0it}|i,t]\) is a constant denoted by \(\beta\). We have:

\[ Y_{it} = \gamma_i + \lambda_t + \beta D_{it} + \epsilon_{it} \]

In practice, the regression formulation is:

\[ Y_{it} = \alpha + \gamma d_i + \lambda Post_{t} + \beta (d_i \times Post_{t}) + \epsilon_{it} \]

DiD and Within estimator equivalence for N=2

For a sample with two time periods (\(t \in {0,1}\)), the estimate \(\hat{\rho}\) of the causal parameter \(\rho\) is identical when estimated using the two approaches (1) individual fixed effects and (2) first differences.

For Individual Fixed Effects estimation, using the FWL theorem / regression anatomy formula, we can write:

\[ \hat{\rho}_{fe} = \frac{Cov(\tilde{Y_{it}}, \tilde{X_{it}} )}{Var(\tilde{X_{it}})} \]

where \(\tilde{Y_{it}}\) and \(\tilde{X_{it}}\) are residuals from regressing \(Y_{it}\) and \(X_{it}\) on individual fixed effects \(\gamma_i\) and a constant term \(\alpha\). Regressing these variables on individual fixed effects and a constant is equivalent to estimating individual means (\(\bar{Y}_i\) and \(\bar{X}_i\)), so the residuals are deviations from the mean (\(Y_{i,t} - \bar{Y}_i\)) and \((X_{i,t} -\bar{X}_i)\).

Furthermore, since \(t = 2\), \(\bar{Y}_i = Y_{i,t}+\frac{\Delta Y_{i,t}}{2}\) and \(\bar{X}_i = X_{i,t}+\frac{\Delta X_{i,t}}{2}\)

Thus,

\[\begin{align*} \hat{\rho}_{fe} & = \frac{Cov(\tilde{Y_{it}}, \tilde{X_{it}} )}{Var(\tilde{X_{it}})} \\ & = \frac{Cov(Y_{i,t} - \bar{Y}_i, X_{i,t} -\bar{X}_i)}{Var(X_{i,t} -\bar{X}_i))}\\ & = \frac{Cov(Y_{i,t} - Y_{i,t} - \frac{\Delta Y_{i,t}}{2} , X_{i,t} - X_{i,t} - \frac{\Delta X_{i,t}}{2})}{Var(X_{i,t} - X_{i,t} - \frac{\Delta X_{i,t}}{2})}\\ & = \frac{-Cov(\Delta Y_{i,t},\Delta X_{i,t} )}{-Var(\Delta X_{i,t})} \\ & = \frac{Cov(\Delta Y_{i,t},\Delta X_{i,t} )}{Var(\Delta X_{i,t})} \\ & = \hat{\rho}_{fd} \end{align*}\]

Card and Krueger - Minimum Wage (1994)

Card and Krueger (1994)

# Import data
njmin        <- read.table('public.dat',
                          header           = FALSE,
                          stringsAsFactors = FALSE,
                          na.strings       = c("", ".", "NA"))
#%%
names(njmin) <- c('SHEET', 'CHAIN', 'CO_OWNED', 'STATE', 'SOUTHJ', 'CENTRALJ',
                  'NORTHJ', 'PA1', 'PA2', 'SHORE', 'NCALLS', 'EMPFT', 'EMPPT',
                  'NMGRS', 'WAGE_ST', 'INCTIME', 'FIRSTINC', 'BONUS', 'PCTAFF',
                  'MEALS', 'OPEN', 'HRSOPEN', 'PSODA', 'PFRY', 'PENTREE', 'NREGS',
                  'NREGS11', 'TYPE2', 'STATUS2', 'DATE2', 'NCALLS2', 'EMPFT2',
                  'EMPPT2', 'NMGRS2', 'WAGE_ST2', 'INCTIME2', 'FIRSTIN2', 'SPECIAL2',
                  'MEALS2', 'OPEN2R', 'HRSOPEN2', 'PSODA2', 'PFRY2', 'PENTREE2',
                  'NREGS2', 'NREGS112')
#%%
# Calculate FTE employement
njmin$FTE  <- njmin$EMPFT  + 0.5 * njmin$EMPPT  + njmin$NMGRS
njmin$FTE2 <- njmin$EMPFT2 + 0.5 * njmin$EMPPT2 + njmin$NMGRS2
#%%
# Create function for calculating standard errors of mean
semean <- function(x, na.rm = FALSE) {
    n <- ifelse(na.rm, sum(!is.na(x)), length(x))
    sqrt(var(x, na.rm = na.rm) / n)
}
#%%
# Calucate means
summary.means <- njmin[ , c("FTE", "FTE2", "STATE")]         %>%
                 group_by(STATE)                             %>%
                 summarise_all(funs(mean(., na.rm = TRUE)))
summary.means <- as.data.frame(t(summary.means[ , -1]))
#%%
colnames(summary.means)  <- c("PA", "NJ")
summary.means$dSTATE     <- summary.means$NJ - summary.means$PA
summary.means            <- rbind(summary.means,
                                  summary.means[2, ] - summary.means[1, ])
row.names(summary.means) <- c("FTE employment before, all available observations",
                              "FTE employment after, all available observations",
                              "Change in mean FTE employment")
#%%
# Calculate
summary.semeans <- njmin[ , c("FTE", "FTE2", "STATE")]         %>%
                 group_by(STATE)                               %>%
                 summarise_all(funs(semean(., na.rm = TRUE)))
summary.semeans <- as.data.frame(t(summary.semeans[ , -1]))
#%%
colnames(summary.semeans)  <- c("PA", "NJ")
summary.semeans$dSTATE     <- sqrt(summary.semeans$NJ + summary.semeans$PA) # / length
njmin         <- njmin[ , c("FTE", "FTE2", "STATE")]
njmin         <- melt(njmin,
                      id.vars       = c("STATE"),
                      variable.name = "Period",
                      value.name    = "FTE")

summary.means <- njmin                   %>%
                mutate(STATE = dplyr::recode(STATE, `0` = 'PA (control)',
                `1` = 'NJ (treatment)')) %>%
                group_by(STATE, Period) %>%
                summarise_all(
                     funs(
                         mean(., na.rm = TRUE),
                         semean(., na.rm = TRUE)
                     ))

difftable = data.table::dcast(setDT(summary.means),
            STATE ~ Period, value.var = c('mean','semean'))
difftable[, diff:=mean_FTE2 - mean_FTE]
knitr::kable(difftable)

STATE	mean_FTE	mean_FTE2	semean_FTE	semean_FTE2	diff
NJ (treatment)	20	21	0.51	0.52	0.59
PA (control)	23	21	1.35	0.94	-2.17

Main Diff-in-diff estimate

\[ {E[Y_{ist} | s = NJ, t = Nov] - E[Y_{ist}| s = NJ, t = Feb]} - {E[Y_{ist} | s = PA, t = Nov] - E[Y_{ist}| s = PA, t = Feb]} = \delta \]

difftable[1,'diff'] - difftable[2,'diff']

##    diff
## 1:  2.8

emps <- summary.means  %>%
    dplyr::select(state = STATE, period = Period, employment = mean)

ggplot(data=emps,
        aes(x=period, y=employment, group=state, colour=state)) +
    geom_point() +
    geom_line()

Synthetic controls

Construct synthetic control group by finding optimal weights for control regins so that pre-trends between treated (\(r=1\)) and control regions (\(r = 2 \cdots R\)) are best matched. If the treatment occurs at \(t=0\) and we have data that goes back to \(-\tau\), we minimise:

\[ min_{w_r} \sum_{t=-\tau}^{0} (\Delta Y_{r=1,t} - \sum_{r=2}^R w_r \Delta Y_{rt}) \]

Triple-difference

Useful to uncover heterogeneous treatment effects. For general formulation, let \(\mathbb{1}_{I_i = 1}\) denote a dummy for some subgroup of interest, then the estimation equation becomes:

\[ Y_{it} = \alpha + \gamma_i + \lambda^0_t\mathbb{1}_{I_i = 0} + \lambda^1_t \mathbb{1}_{I_i = 1} + \beta (d_i \times Post_t) + \psi (d_i \times Post_t \times \mathbb{1}_{I_i = 1}) + \epsilon_{it} \]

More flexibly allowing for heterogeneous treatment effects where the treatment heteregeneity is along the \(Z\) dimension lets us write:

\[ Y_{it} = \alpha + \gamma_i + \lambda_t + \beta (d_{i} \times Post_t) + \phi (d_i \times Post_t \times Z_i ) + \epsilon_{it} \]

Granger Causality

Causes should happen before consequences, and not vice-versa

Check on whether, conditional on state and year fixed effects, past \(D_{st}\) predicts \(Y_{ist}\) while future \(D_{st}\) does not. If \(D_{st}\) causes \(Y_{ist}\) but not vice versa, then dummies for future policy changes should not matter in an equation like:

\[ Y_{ist} = \gamma_s + \lambda_t + \sum_{\tau=0}^m \delta_{-\tau} D_{s, t - \tau} + \sum_{\tau=1}^q \delta_{+\tau} D_{s,t+\tau} + X_{ist}' \beta + \epsilon_{ist} \]

where the sums on the RHS allow for m lags / post-treatment effects, and q leads / pre-treatment effects.

Autor (2003) Granger tests:

autor <- haven::read_dta('table7/autor-jole-2003.dta')

# Log total employment: from BLS employment & earnings
autor$lnemp <- log(autor$annemp)

# Non-business-service sector employment from CBP
autor$nonemp  <- autor$stateemp - autor$svcemp
autor$lnnon   <- log(autor$nonemp)
autor$svcfrac <- autor$svcemp / autor$nonemp

# Total business services employment from CBP
autor$bizemp <- autor$svcemp + autor$peremp
autor$lnbiz  <- log(autor$bizemp)

# Restrict sample
autor <- autor[which(autor$year >= 79 & autor$year <= 95), ]
autor <- autor[which(autor$state != 98), ]

# State dummies, year dummies, and state*time trends
autor$t  <- autor$year - 78
autor$t2 <- autor$t^2

# Generate more aggregate demographics
autor$clp     <- autor$clg    + autor$gtc
autor$a1624   <- autor$m1619  + autor$m2024 + autor$f1619 + autor$f2024
autor$a2554   <- autor$m2554  + autor$f2554
autor$a55up   <- autor$m5564  + autor$m65up + autor$f5564 + autor$f65up
autor$fem     <- autor$f1619  + autor$f2024 + autor$f2554 + autor$f5564 + autor$f65up
autor$white   <- autor$rs_wm  + autor$rs_wf
autor$black   <- autor$rs_bm  + autor$rs_bf
autor$other   <- autor$rs_om  + autor$rs_of
autor$married <- autor$marfem + autor$marmale

# Modify union variable (1. Don't interpolate 1979, 1981; 2. Rescale into percentage)
autor$unmem[79 == autor$year | autor$year == 81] <- NA
autor$unmem                                      <- autor$unmem * 100

# Create state and year factors
autor$state <- factor(autor$state)
autor$year  <- factor(autor$year)

# Diff-in-diff regression
did <- felm(lnths ~ lnemp   + admico_2 + admico_1 + admico0  + admico1  + admico2 +
                    admico3 + mico4    + admppa_2 + admppa_1 + admppa0  + admppa1 +
                    admppa2 + admppa3  + mppa4    + admgfa_2 + admgfa_1 + admgfa0 +
                    admgfa1 + admgfa2  + admgfa3  + mgfa4
                    | state + year + state:t | 0 | state, data = autor)

Estimated Impact of implied-contract exceptions to the employment- at-will doctrine on the use of temporary workers

# Plot results
lags_leads  <- c("admico_2", "admico_1", "admico0",
                 "admico1" , "admico2" , "admico3",
                 "mico4")
labels      <- c("2 yr prior", "1 yr prior", "Yr of adopt",
                 "1 yr after", "2 yr after", "3 yr after",
                 "4+ yr after")
results.did <- data.frame(label = factor(labels, levels = labels),
                          coef  = summary(did)$coef[lags_leads, "Estimate"] * 100,
                          se    = summary(did)$coef[lags_leads, "Cluster s.e."] * 100)
g           <- ggplot(results.did, aes(label, coef, group = 1))
p           <- g + geom_point()                                +
                   geom_line(linetype = "dotted")              +
                   geom_errorbar(aes(ymax = coef + 1.96 * se,
                                     ymin = coef - 1.96 * se)) +
                   geom_hline(yintercept = 0)                  +
                   ylab("Log points")                          +
                   xlab(paste("Time passage relative to year of",
                              "adoption of implied contract exception"))
p

Dynamic panel data

For FE to be valid, we rely on strict exogeneity wherein \(E(\nu_{it}|x_i)= E(\nu_{it}|x_{i1}, \cdots x_{iT})=0\) (i.e. errors are uncorrelated with lags and leads of x). This fails when one of the x’s is a lagged dependent variable, e.g.

\[ y_{it} = x_{it}\beta + y_{it-1}\delta + \theta_i + \nu_{it} \]

this yields a correlation between a RHS variable \(y_{it-1}\) and the error term of the form:

\[ cov([y_{it-1} - \bar{y}_{it-1}], [\nu_{it} - \bar{\nu}_i]) = -(2/T)\sigma_\nu^2 + [(T-1)/T^2]\sigma_\eta^2 \]

This covariance \(\rightarrow 0\) as \(T \rightarrow \infty\).

IV solution

use lagged values of x and y (e.g. \(x_{it-1}\) and/or \(y_{it-2}\) as instruments for \(\delta y_{it-1}\). \(y_{it-2}\) will not be a valid instrument if \(\nu_{it}\) is serially correlated. Arellano and Bond (1991)

Quasi-Panel

Deaton (1985)

let \(i\) index individuals, \(j\) index birth cohorts, and \(t\) index time. Then, without a true panel to estimate the regression

\[ y_{ijt} = x_{ijt}'\beta + \theta_{ij} + \delta_{t} + \nu_{ijt} \]

we can estimate:

\[ \bar{y}_{jt} = \bar{x}_{jt}' \beta + \bar{\theta}_j + \delta_t + \bar{\nu}_{jt} \]

6. Regression Discontinuity Design

Sharp RD

Treatment changes discontinuously at some particular value \(x_0\) of x

\[ D_i = \begin{cases} 1 \text{ if} x_i \geq x_0, \\ 0 \text{ if} x_i < x_0, \end{cases} \]

This value \(x\) is likely to be arbitrary in the neighbourhood of \(x_0\)

RD_takes advantage of a discrete change in treatment in response to a small change in the running variable x. Causal inference relies on the (local) arbitrariness of whether unit \(i\) has \(x_i < x_0 | x_i > x_0\)

Formally,

\[\begin{align*} E[Y_{0i}|x_i] & = \alpha + \beta x_i \\ Y_{1i} = Y_{0i} + \rho \end{align*}\]

which leads to the regression:

\[ Y_i = \alpha + \beta x_i + \tau D_i + \eta_i \]

To permit nonlinearity, we estimate:

\[ Y_i = f(x_i) + \tau D_i + \eta_i \]

General Setup with normalised running/forcing variable

Define \(c := x_0\). The treatment effect can be uncovered by running a pooled regression

\[ Y = \alpha_l +_\tau D + f(X - c) + \epsilon \]

It is recommended to let the regression function differ on the two sides of the cutoff by including interaction terms between D and X.

Then, the pooled regression is

\[ Y = \alpha_l + \tau D + \beta_l (X - c) + (\beta_r - \beta_l) D (X-c) + \epsilon \]

Higher order polynomials can be included to better approximate the regression functions (per Stone Weirstrass thm).

# Generate series
nobs      = 100
x         <- runif(nobs)
y.linear  <- x + (x > 0.5) * 0.25 + rnorm(n = nobs, mean = 0, sd = 0.1)
y.nonlin  <- 0.5 * sin(6 * (x - 0.5)) + 0.5 + (x > 0.5) * 0.25 + rnorm(n = nobs, mean = 0, sd = 0.1)
y.mistake <- 1 / (1 + exp(-25 * (x - 0.5))) + rnorm(n = nobs, mean = 0, sd = 0.1)
rd.series <- data.frame(x, y.linear, y.nonlin, y.mistake)

# Make graph with ggplot2
g.data   <- ggplot(rd.series, aes(x = x, group = x > 0.5))

p.linear <- g.data + geom_point(aes(y = y.linear))  +
                     stat_smooth(aes(y = y.linear),
                                 method = "lm",
                                 se     = FALSE)    +
                     geom_vline(xintercept = 0.5)   +
                     ylab("Outcome")                +
                     ggtitle(bquote('A. Linear E[' * Y["0i"] * '|' * X[i] * ']'))

p.nonlin <- g.data + geom_point(aes(y = y.nonlin))  +
                     stat_smooth(aes(y = y.nonlin),
                                 method  = "lm",
                                 formula = y ~ poly(x, 2),
                                 se      = FALSE)   +
                     geom_vline(xintercept = 0.5)   +
                     ylab("Outcome")                +
                     ggtitle(bquote('B. Nonlinear E[' * Y["0i"] * '|' * X[i] * ']'))

f.mistake <- function(x) {1 / (1 + exp(-25 * (x - 0.5)))}
p.mistake <- g.data + geom_point(aes(y = y.mistake))     +
                      stat_smooth(aes(y = y.mistake),
                                  method = "lm",
                                  se     = FALSE)        +
                      stat_function(fun      = f.mistake,
                                    linetype = "dashed") +
                      geom_vline(xintercept = 0.5)       +
                      ylab("Outcome")                    +
                      ggtitle('C. Nonlinearity mistaken for discontinuity')

grid.arrange(p.linear, p.nonlin, p.mistake, ncol = 1)

Lee (2008)

Lee (2008) main figure

# Load the .dta file as data.table
lee <- data.table(haven::read_dta('Lee2008/individ_final.dta'))

# Subset by non-missing in the outcome and running variable for panel (a)
panel.a <- na.omit(lee[, c("myoutcomenext", "difshare"), with = FALSE])

# Create indicator when crossing the cut-off
panel.a <- panel.a[ , d := (difshare >= 0) * 1.0]

# Predict with local polynomial logit of degree 4
logit   <- glm(formula = myoutcomenext ~ poly(difshare, degree = 4) +
                 poly(difshare, degree = 4) * d,
                 family  = binomial(link = "logit"),
                 data    = panel.a)
panel.a <- panel.a[ , pmyoutcomenext := predict(logit, panel.a, type = "response")]

# Create local average by 0.005 interval of the running variable
breaks  <- round(seq(-1, 1, by = 0.005), 3)
panel.a <- panel.a[ , i005 := as.numeric(as.character(cut(difshare,
                                                           breaks = breaks,
                                                           labels = head(breaks, -1),
                                                           right  = TRUE))), ]

panel.a <- panel.a[ , list(m_next  = mean(myoutcomenext),
                             mp_next = mean(pmyoutcomenext)),
                   by = i005]

# Plot panel (a)
panel.a <- panel.a[which(panel.a$i005 > -0.251 & panel.a$i005 < 0.251), ]
plot.a  <- ggplot(data = panel.a, aes(x = i005))                       +
           geom_point(aes(y = m_next))                                 +
           geom_line(aes(y = mp_next, group = i005 >= 0))              +
           geom_vline(xintercept = 0, linetype = 'longdash')           +
           xlab('Democratic Vote Share Margin of Victory, Election t') +
           ylab('Probability of Victory, Election t+1')                +
           ggtitle('a')

# Subset the outcome for panel (b)
panel.b <- lee[ , i005 := as.numeric(as.character(cut(difshare,
                                                       breaks = breaks,
                                                       labels = head(breaks, -1),
                                                       right  = TRUE))), ]

panel.b <- panel.b[ , list(m_vic  = mean(mofficeexp, na.rm = TRUE),
                             mp_vic = mean(mpofficeexp, na.rm = TRUE)),
                   by = i005]

panel.b <- panel.b[which(panel.b$i005 > -0.251 & panel.b$i005 < 0.251), ]
plot.b  <- ggplot(data = panel.b, aes(x = i005))                       +
           geom_point(aes(y = m_vic))                                  +
           geom_line(aes(y = mp_vic, group = i005 >= 0))               +
           geom_vline(xintercept = 0, linetype = 'longdash')           +
           xlab('Democratic Vote Share Margin of Victory, Election t') +
           ylab('No. of Past Victories as of Election t')              +
           ggtitle('b')

grid.arrange(plot.a, plot.b)

RDRobust

Bandwidth selection per Calonico, Cattaneo, and Titiunik (2014)

#%%
rdrobust_senate=read.csv("/home/alal/Desktop/code/tutorials/revision/rdrobust_senate.csv")
attach(rdrobust_senate)

### rdplot with 95% confidence intervals
(rdplot(y=vote, x=margin, scale= 5, binselect="es", # scale = 2, ci=95,
         title="RD Plot: U.S. Senate Election Data",
         y.label="Vote Share in Election at time t+1",
         x.label="Vote Share in Election at time t"))

## Call: rdplot
## 
## Number of Obs.                 1297
## Kernel                      Uniform
## 
## Number of Obs.                 595            702
## Eff. Number of Obs.            595            702
## Order poly. fit (p)              4              4
## BW poly. fit (h)           100.000        100.000
## Number of bins scale             5              5

Visual Treatment Effects

#%%
rdplot(y=vote,x=margin,h=17.7080, subset=margin<=17.7080&margin>=-17.7080,
         binselect="esmv", kernel="triangular", p=1,
         title="RD Plot: U.S. Senate Election Data",
         y.label="Vote Share in Election at time t+1",
         x.label="Vote Share in Election at time t")

#%%

Bandwidth

#%%
summary(rdbwselect(y=vote,x=margin, all = T))

## Call: rdbwselect
## 
## Number of Obs.                 1297
## BW type                         All
## Kernel                   Triangular
## VCE method                       NN
## 
## Number of Obs.                 595         702
## Order est. (p)                   1           1
## Order bias  (q)                  2           2
## 
## =======================================================
##                   BW est. (h)    BW bias (b)
##             Left of c Right of c  Left of c Right of c
## =======================================================
##      mserd    17.708     17.708     27.984     27.984
##     msetwo    16.154     18.009     27.096     29.205
##     msesum    18.326     18.326     31.280     31.280
##   msecomb1    17.708     17.708     27.984     27.984
##   msecomb2    17.708     18.009     27.984     29.205
##      cerrd    12.374     12.374     27.984     27.984
##     certwo    11.288     12.585     27.096     29.205
##     cersum    12.806     12.806     31.280     31.280
##   cercomb1    12.374     12.374     27.984     27.984
##   cercomb2    12.374     12.585     27.984     29.205
## =======================================================

#%%

Estimation

### rdrobust default
summary(rdrobust(y=vote,x=margin, all = T))

## Call: rdrobust
## 
## Number of Obs.                 1297
## BW type                       mserd
## Kernel                   Triangular
## VCE method                       NN
## 
## Number of Obs.                 595         702
## Eff. Number of Obs.            359         322
## Order est. (p)                   1           1
## Order bias  (p)                  2           2
## BW est. (h)                 17.708      17.708
## BW bias (b)                 27.984      27.984
## rho (h/b)                    0.633       0.633
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional     7.416     1.460     5.078     0.000     [4.554 , 10.278]    
## Bias-Corrected     7.510     1.460     5.143     0.000     [4.648 , 10.372]    
##         Robust     7.510     1.743     4.310     0.000     [4.094 , 10.925]    
## =============================================================================

#%%

#%%
### rdrobust with covariates
summary(rdrobust(y=vote,x=margin,covs=cbind(class,termshouse,termssenate)))

## Call: rdrobust
## 
## Number of Obs.                 1108
## BW type                       mserd
## Kernel                   Triangular
## VCE method                       NN
## 
## Number of Obs.                 491         617
## Eff. Number of Obs.            313         283
## Order est. (p)                   1           1
## Order bias  (p)                  2           2
## BW est. (h)                 17.987      17.987
## BW bias (b)                 28.943      28.943
## rho (h/b)                    0.621       0.621
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional     6.851     1.408     4.866     0.000     [4.091 , 9.611]     
##         Robust         -         -     4.200     0.000     [3.729 , 10.254]    
## =============================================================================

### rdrobust with  cluster-robust variance estimation, and underlying data-driven bandwidth selectors.
summary(rdrobust(y=vote,x=margin,cluster=state))

## Call: rdrobust
## 
## Number of Obs.                 1297
## BW type                       mserd
## Kernel                   Triangular
## VCE method                       NN
## 
## Number of Obs.                 595         702
## Eff. Number of Obs.            359         320
## Order est. (p)                   1           1
## Order bias  (p)                  2           2
## BW est. (h)                 17.509      17.509
## BW bias (b)                 27.032      27.032
## rho (h/b)                    0.648       0.648
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional     7.422     1.522     4.875     0.000     [4.438 , 10.406]    
##         Robust         -         -     4.266     0.000     [4.091 , 11.046]    
## =============================================================================

#%%

Fuzzy RD

Discontinuity doesn’t necessarily change treatment, but only the probability of treatment. Then, we can use the discontinuity as an instrumental variable for treatment status.

\[ P[D_i = 1 | x_i] = \begin{cases} g_0 (x_i) if x_i < x_0 \\ g_1 (x_i) if x_i \geq x_0 \end{cases} \]

\(g_0(x_i) \neq g_1(x_i)\). Assuming \(g_1(x_0) > g_0(x_0)\), the probability of treatment relates to \(x_i\) via:

\[ E[D_i|x_i] = P[D_i=1|x_i] = g_0(x_i) + [g_1(x_i) - g_0(x_i)] T_i \] where \(T_i = \mathbb{1} (x_i \geq x_0)\) := point of discontinuity

\[ \rho = lim_{\delta \rightarrow 0} \frac{E[Y_i|x_0 < x_i < x_0 + \delta] - E[Y_i|x_0 - \delta < x_i < x_0]} {E[D_i|x_0 < x_i < x_0 + \delta] - E[D_i|x_0 - \delta < x_i < x_0]} \]

Angrist and Lavy (1999)

Angrist and Lavy (1999) uses Maimonides’ rule:

\[ m_{sc} = \frac{e_s}{int[\frac{e_s - 1}{40}] + 1} \]

library(foreign)

# Load the data
grade4 <- read.dta("final4.dta")
grade5 <- read.dta("final5.dta")
#%%
# Restrict sample
grade4 <- grade4[which(grade4$classize & grade4$classize < 45 & grade4$c_size > 5), ]
grade5 <- grade5[which(grade5$classize & grade5$classize < 45 & grade5$c_size > 5), ]

# Find means class size by grade size
grade4cmeans <- aggregate(grade4$classize,
                          by  = list(grade4$c_size),
                          FUN = mean,
                          na.rm = TRUE)

grade5cmeans <- aggregate(grade5$classize,
                          by  = list(grade5$c_size),
                          FUN = mean,
                          na.rm = TRUE)

# Rename aggregaed columns
colnames(grade4cmeans) <- c("c_size", "classize.mean")
colnames(grade5cmeans) <- c("c_size", "classize.mean")
#%%
# Create function for Maimonides Rule
maimonides.rule <- function(x) {x / (floor((x - 1)/40) + 1)}
#%%
# Plot each grade
g4 <- ggplot(data = grade4cmeans, aes(x = c_size))
p4 <- g4 + geom_line(aes(y = classize.mean))            +
           stat_function(fun      = maimonides.rule,
                         linetype = "dashed")           +
           expand_limits(y = 0)                         +
           scale_x_continuous(breaks = seq(0, 220, 20)) +
           ylab("Class size")                           +
           xlab("Enrollment count")                     +
           ggtitle("B. Fourth grade")

g5 <- ggplot(data = grade5cmeans, aes(x = c_size))
p5 <- g5 + geom_line(aes(y = classize.mean))            +
           stat_function(fun      = maimonides.rule,
                         linetype = "dashed")           +
           expand_limits(y = 0)                         +
           scale_x_continuous(breaks = seq(0, 220, 20)) +
           ylab("Class size")                           +
           xlab("Enrollment count")                     +
           ggtitle("A. Fifth grade")

grid.arrange(p5, p4, ncol = 1)

7. Quantile regression

based on Kleiber and Zeileis (2008)

Least-squares models the conditional mean of a response, but one may be interested in treatment effects at different quantiles of the conditional distribution of Y. The Quantile regression model is given by

\[ Q_y(\tau|x) = x_i' \beta_\tau \]

where \(Q_y(\tau|x)\) denotes the \(\tau\)-quantile of \(y\) conditional on \(x\).

The conditional quantile function at quantile \(\tau\) is defined as

\[ Q_\tau(Y_i | X_i) = F_y^{-1} (\tau|X_i) \]

CQF solves the following problem:

\[ Q_\tau (Y_i | X_i) = argmin_{q(x)} E [ \rho_\tau (Y_i - q(X_i))] \]

where \(\rho_\tau(u)\) is the check function, which weights positive and negative terms asymmetrically (unless \(\tau=0.5\), in which case it collapses to a Least-Absolute-Deviation function). \[ \rho_\tau(u) = (\tau - 1(u \leq 0))u = 1(u > 0) \tau |u| + 1(u\leq 0) (1-\tau)|u| \]

tau_25 = function(x) (x>0) * 0.25 * abs(x) + (x<=0) * (1 - 0.25) * abs(x)
tau_50 = function(x) (x>0) * 0.50 * abs(x) + (x<=0) * (1 - 0.50) * abs(x)
tau_75 = function(x) (x>0) * 0.75 * abs(x) + (x<=0) * (1 - 0.75) * abs(x)

# dummy dataset
p <- ggplot(data = data.frame(x = 0), mapping = aes(x = x))

# plot functions
p +
   stat_function(fun = tau_25, aes(colour='tau_25')) +
   stat_function(fun = tau_50, aes(colour='tau_50')) +
   stat_function(fun = tau_75, aes(colour='tau_75')) +
   xlim(-10,10)+
   labs(title='Check functions')

#%%
data('CPS1988')

cps_f <- log(wage) ~ experience + I(experience^2) + education
# models
cps_lm <- lm(cps_f, data= CPS1988)
cps_lad <- rq(cps_f, data= CPS1988)
cps_p25 <-  rq(cps_f, tau = 0.25, data= CPS1988)
cps_p75 <-  rq(cps_f, tau = 0.75, data= CPS1988)
#%%

#%%
stargazer(cps_lm, cps_lad, cps_p25, cps_p75, type=exp_type)


	Dependent variable:

	log(wage)
	OLS	quantile
		regression
	(1)	(2)	(3)	(4)

experience	0.078^***	0.077^***	0.092^***	0.064^***
	(0.001)	(0.001)	(0.002)	(0.001)

I(experience2)	-0.001^***	-0.001^***	-0.002^***	-0.001^***
	(0.00002)	(0.00003)	(0.00004)	(0.00002)

education	0.087^***	0.094^***	0.093^***	0.094^***
	(0.001)	(0.001)	(0.002)	(0.001)

Constant	4.300^***	4.200^***	3.800^***	4.700^***
	(0.019)	(0.022)	(0.029)	(0.020)


Observations	28,155	28,155	28,155	28,155
R²	0.330
Adjusted R²	0.330
Residual Std. Error	0.590 (df = 28151)
F Statistic	4,546.000^*** (df = 3; 28151)

Note:	p<0.1; p<0.05; p<0.01

#%%

Variation in coefficients as a function of \(\tau\)

#%%
cps_rqbig <- rq(cps_f, tau = seq(0.05, 0.95, by = 0.05),
  data = CPS1988)
#%%
cps_rqbigs <- summary(cps_rqbig)
plot(cps_rqbig)

8. Discrete Choice Models

Generalised Linear Models - Nelder and Baker (1972)

Adapted from Rodriguez

Let \(y_i \sim N(\mu_i, \sigma^2)\), and \(\mu_i = x_i'\beta\). Further, let \(y_i\) be drawn from an exponential PDF (which includes normal, binomial, Poisson, exponential, gamma, and inverse gaussian as special cases).

\[ f(y_i) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a_i(\phi)} + c(y_i, \phi) \right] \]

where \(\theta_i\) and \(\phi\) are parameters and \(a_i(\phi)\), \(b(\theta_i)\), and \(c(y_i, \phi)\) are known functions, and \(a_i(\phi)\) has the form \(a_i(\phi) = \phi/p_i\) where \(p_i\) is a known prior weight, usually 1.

\(\theta_i\) and \(\phi\) are location and scale parameters.

\(E(Y_i) = \mu_i = b'(\theta_i)\)

\(V(Y_i) = \sigma_i^2 = b''(\theta_i)a_i(\phi)\)

Link function - instead of directly modelling the mean, a one-to-one continuous differentiable link function \(g(\mu_i)\) is introduced s.t. \(\eta_i = g(\mu_i)\), where \(\eta_i\) is called the linear predictor. Then,

\[ \mu_i = g^{-1}(x_i'\beta) \]

Model	Error	Link
OLS	Normal	Identity
Logistic	binomial	Logit
Probit	binomial	Normal
Poisson	Poisson	Log

Binary Response

Assume latent variable \(y^*\) where \(y^* = x_i'\beta + e\), and observed \(y = 1\) if \(y^* >0, 0\) otherwise.

Assuming \(e \sim N(0,1)\), \(Pr(y = 1|x) = Pr(y^* > 0) = Pr(e> -x\beta) =1 - F(-x\beta) = F(x\beta)\).

Common choices for F are:

Probit model: standard normal \(\Phi\)
Logit model: logistic CDF \(\Lambda(X_i'\beta) = \frac{exp(X_i'\beta)}{exp(X_i'\beta) + 1} =\frac{1}{1 + exp(-X_i'\beta)}\)

For either F, the model is estimated using maximum likelihood, where the likelihood function is

\[ p(y_i, \beta | x_i) = F(x_i'\beta)^{y_i} (1-F(x_i'\beta))^{(1-y_i)} \]

Thus, the log likelihood is

\[ log L_n(\beta) = n^{-1} \sum_{i=1}^n y_i F(x_i'\beta) + n^{-1} \sum_{i=1}^n (1-y_i) log(1-F(x_i'\beta)) \]

This is maximised to obtain the ML estimate.

Marginal effects:

\[ \frac{\partial E(Y_i|X_I)}{\partial X_i} = \frac{\partial Pr(y=1|x)}{\partial x} = \frac{\partial F(x\beta)}{\partial x} = f(x\beta)\beta \]

library(margins)
e = exp(1)
logistic = function (x) (e^x/(1+e^x))
linear = function(x) x*0.05

# dummy dataset
p <- ggplot(data = data.frame(x = 0), mapping = aes(x = x))

# plot functions
p +
   stat_function(fun = linear, aes(colour='Linear')) +
   stat_function(fun = logistic, aes(colour='Logit')) +
   stat_function(fun=pnorm, aes(colour='Probit')) +
   xlim(-10,10)+
   labs(title='LDV CEFs')

Probit

x <- glm(am ~ cyl + hp * wt, data = mtcars, family = binomial(link='probit'))
summary(margins(x)) # marginal effects

##  factor     AME     SE       z      p   lower   upper
##     cyl  0.0226 0.0492  0.4598 0.6456 -0.0738  0.1190
##      hp  0.0026 0.0021  1.2244 0.2208 -0.0015  0.0067
##      wt -0.5088 0.2479 -2.0523 0.0401 -0.9948 -0.0229

Logit

x <- glm(am ~ cyl + hp * wt, data = mtcars, family = binomial)
summary(margins(x)) # marginal effects

##  factor     AME     SE       z      p   lower  upper
##     cyl  0.0216 0.0493  0.4377 0.6616 -0.0750 0.1181
##      hp  0.0027 0.0023  1.1596 0.2462 -0.0018 0.0072
##      wt -0.5158 0.2685 -1.9209 0.0547 -1.0421 0.0105

Multinomial Logit (cf McFadden (1984) )

Define random Utility model where indirect utilities (V) are latent variables.

For a choice of J classes with \(J\) as base category,

\[ Pr(y = j|x) = \frac{\exp(x\beta_j)} {1+\exp(x\beta_j) + \cdots + \exp(x \beta_{J-1})} \enspace \forall j \in {1,\cdots, J-1} \]

\[ Pr(y = J|x) = \frac{1}{1+\exp(x\beta_j) + \cdots + \exp(x \beta_{J-1})} \]

#%%
ml <- haven::read_dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
ml$prog = as.factor(ml$prog)
ml %<>% filter(complete.cases(.))
mlog_mod <- nnet::multinom(prog ~ ses + write, data = ml)

## # weights:  12 (6 variable)
## initial  value 219.722458 
## iter  10 value 182.220707
## final  value 182.220698 
## converged

summary(mlog_mod)

## Call:
## nnet::multinom(formula = prog ~ ses + write, data = ml)
## 
## Coefficients:
##   (Intercept)  ses  write
## 2        -3.6 0.64  0.058
## 3         2.4 0.19 -0.054
## 
## Std. Errors:
##   (Intercept)  ses write
## 2         1.2 0.27 0.021
## 3         1.2 0.30 0.023
## 
## Residual Deviance: 364 
## AIC: 376

#%%

Count Data (Poisson)

Poisson PDF:

\[ f_i(y_i) = \frac{e^{-\mu_i} \mu_i^{y_i}}{y_i !} \]

specification: \(\mu(x) = exp(x\beta)\).

This yields marginal effect:

\[ \frac{\partial E(y|x)}{\partial x} = exp(x\beta)\beta \]

It also follows that \(\beta\) is the semi-elasticity of y wrt x

\[ \beta = \frac{\partial E(y|x)}{\partial x} \times \frac{1}{E(y|x)} = \frac{\partial log E(y|x)}{\partial X} \]

#%%
p <- read.csv("https://stats.idre.ucla.edu/stat/data/poisson_sim.csv")
p <- within(p, {
  prog <- factor(prog, levels=1:3, labels=c("General", "Academic",
                                                     "Vocational"))
  id <- factor(id)
})
summary(m1 <- glm(num_awards ~ prog + math, family="poisson", data=p))

## 
## Call:
## glm(formula = num_awards ~ prog + math, family = "poisson", data = p)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.204  -0.844  -0.511   0.256   2.680  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -5.2471     0.6585   -7.97  1.6e-15 ***
## progAcademic     1.0839     0.3583    3.03   0.0025 ** 
## progVocational   0.3698     0.4411    0.84   0.4018    
## math             0.0702     0.0106    6.62  3.6e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287.67  on 199  degrees of freedom
## Residual deviance: 189.45  on 196  degrees of freedom
## AIC: 373.5
## 
## Number of Fisher Scoring iterations: 6

#%%

9. Models with self-selection and censoring

Tobit

Many economic variables are censored at 0 (e.g. hours of work, consumption). Variables that are observed at greater than or equal to zero (\(y \geq 0\)) are censored, and variables that are only observed when positive are called truncated.

Define a latent variable \(y^*=x\beta+\epsilon\), where \(\epsilon \sim N(0,\sigma^2)\).

\[ y = \begin{cases} y^* &\text{ if } y^* > 0\\ 0 &\text{ if } y^* \leq 0 \end{cases} \]

OLS does not yield consistent estimates. Taking conditional expectations of Y yields:

\[ E(y|x) = Pr(y^* \leq 0) \times 0 + Pr(y^* > 0) \times E(x\beta + \epsilon | x, y^* > 0) = Pr(y^* > 0) \times [x\beta + E(\epsilon|x,y^* > 0)] \neq x\beta \]

\(Pr(y^*>0) = Pr(\epsilon > -x\beta) = Pr(\epsilon/\sigma > -x\beta/\sigma) = 1 - \Phi(-x\beta/\sigma) = \Phi(x\beta/\sigma)\)

Given normality of \(\epsilon\),

\[ E(\epsilon|x, \epsilon > -x\beta) = \sigma \frac{\phi(x\beta/\sigma)}{\Phi(x\beta/\sigma)} = \sigma \lambda \]

Where \(\lambda(x\beta/\sigma) =\frac{\phi(x\beta/\sigma)}{\Phi(x\beta/\sigma)}\) is the inverse Mills Ratio.

Plugging this back into the CEF yields:

\[ E(y|x) = \Phi(x\beta/\sigma) \times [x\beta + \sigma \lambda(x\beta/\sigma)] \]

With truncation (missing instead of zero for \(y^* \leq 0\), this changes to

\[ E(y|x) = x\beta + \sigma \lambda(x\beta/\sigma) \]

Density of the observed variable is

\[ f(y|x_i) = {1 - \Phi(x_i\beta/\sigma)}^{1[y_i=0]} {(1/\sigma)\phi[(y^* - x\beta)/\sigma]}^{1[y>0]} \]

Estimation is done through MLE, where the log likelihood is

\[ l_i(\beta, \sigma) = 1[y_i=0] log[1 - \Phi(x_i\beta/\sigma)] + 1[y_i > 0]log[(1/\sigma) \phi [(y^* - x_i\beta)/\sigma]] \]

#%%
dat <- read.csv("https://stats.idre.ucla.edu/stat/data/tobit.csv")
f <- function(x, var, bw = 15) {
  dnorm(x, mean = mean(var), sd(var)) * length(var)  * bw
}

# setup base plot
p <- ggplot(dat, aes(x = apt, fill=prog))

# histogram, coloured by proportion in different programs
# with a normal distribution overlayed
p + stat_bin(binwidth=15) +
  stat_function(fun = f, size = 1,
    args = list(var = dat$apt))

summary(m <- vglm(apt ~ read + math + prog, tobit(Upper = 800), data = dat))

## 
## Call:
## vglm(formula = apt ~ read + math + prog, family = tobit(Upper = 800), 
##     data = dat)
## 
## 
## Pearson residuals:
##             Min     1Q  Median    3Q  Max
## mu       -2.568 -0.731 -0.0398 0.753 2.80
## loge(sd) -0.969 -0.636 -0.3336 0.236 4.84
## 
## Coefficients: 
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1  209.5596    32.5459    6.44  1.2e-10 ***
## (Intercept):2    4.1848     0.0523   79.94  < 2e-16 ***
## read             2.6980     0.6193    4.36  1.3e-05 ***
## math             5.9146     0.7054    8.38  < 2e-16 ***
## proggeneral    -12.7146    12.4086   -1.02  0.30552    
## progvocational -46.1433    13.7067   -3.37  0.00076 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Number of linear predictors:  2 
## 
## Names of linear predictors: mu, loge(sd)
## 
## Log-likelihood: -1041 on 394 degrees of freedom
## 
## Number of iterations: 5 
## 
## No Hauck-Donner effect found in any of the estimates

Heckit

An alternative is the 2 step Heckman procedure, where the first stage is a model to estimate \(Pr(y^* > 0) = \Phi(x\beta/\sigma) = \Phi(x\gamma)\), where \(\gamma = \beta/\sigma\). The probit yields a consistent estimate of \(\gamma\) as \(\hat{\gamma}\), which can be plugged into the inverse Mills ratio formula to get

\[ \hat{\lambda} = \frac{\phi(x\hat{\gamma})}{\Phi(x\hat{\gamma})} \]

Estimate probit to get \(\hat{\gamma}\), compute \(\hat{\lambda}_i\)
Use this generated regressor to estimate the model \(y = x \beta + \sigma \hat{\lambda} + \epsilon\)

Problem - not efficient:

\(\hat{\lambda}\) is a generated regressor - bootstrap the whole procedure
\(\hat{\lambda}\) mechanically correlated with error, leading to heteroskedasticity. Use robuset standard errors

For truncated variables, estimate probit for missing or positive in step (1) of heckit.

10. Misc

Kernel Density Estimation

Approximate density function

\[ f(y_k) = \lim_{h \rightarrow 0} \frac{Pr(y_k - h \leq Y < y_k + h/2)}{h} \]

by analogy, estimated kernel density is:

\[ \hat{f}(y_k) = \frac{1}{Nh} \sum_{i=1}^{N} K \left[ \frac{y_k - Y_i}{h} \right] \]

\(K(z)\) is the Kernel. Usual candidates are Gaussian and Epanechnikov

Commonly used kernels (wiki)

Gaussian

\[ K(z) = \frac{1}{\sqrt{2\pi}} exp\left[\frac{-z^2}{2}\right] \]

Epanechnikov

\[ K(z) = \begin{cases} \frac{3}{4} (1-\frac{1}{5}z^2) / \sqrt{5} &\text{ if } |z| < \sqrt{5}\\ 0 \text{ otherwise} \end{cases} \]

#%%
ggplot(dat, aes(x = apt)) + geom_density(kernel='epanechnikov') +
  labs(title='Epanechnikov Kernel')

#%%

#%%
ggplot(dat, aes(x = apt)) + geom_density(kernel='gaussian') +
  labs(title='Gaussian Kernel')

#%%

#%%
ggplot(dat, aes(x = apt)) + geom_density(kernel='rectangular') +
  labs(title='Rectangular Kernel')

#%%

Stacked Density Plots

#%%
ggplot(dat, aes(x = apt, colour=prog)) + geom_density(kernel='gaussian') +
  labs(title='Stacked Density Plot')

#%%

Conditional Density Plots

#%%
ggplot(dat, aes(x = apt, fill=prog)) +
  geom_density(kernel='gaussian', position='fill') +
  labs(title='Conditional Density Plots')

#%%

References

Angrist, J. D., and A. B. Krueger. (1991): “Does compulsory school attendance affect schooling and earnings?” The Quarterly Journal of Economics, 106, 979–1014.

Angrist, J. D., and V. Lavy. (1999): “Using Maimonides’ rule to estimate the effect of class size on scholastic achievement,” The Quarterly Journal of Economics, 114, 533–75.

Angrist, J. D., and J.-S. Pischke. (2008): Mostly Harmless Econometrics: An Empiricist’s Companion, Princeton university press.

Arellano, M., and S. Bond. (1991): “Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations,” The review of economic studies, 58, 277–97.

Autor, D. H. (2003): “Outsourcing at will: The contribution of unjust dismissal doctrine to the growth of employment outsourcing,” Journal of labor economics, 21, 1–42.

Bound, J., D. A. Jaeger, and R. M. Baker. (1995): “Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak,” Journal of the American statistical association, 90, 443–50.

Calonico, S., M. D. Cattaneo, and R. Titiunik. (2014): “Robust nonparametric confidence intervals for regression‐discontinuity designs,” Econometrica, 82, 2295–2326.

Cameron, A. C., and P. K. Trivedi. (2005): Microeconometrics: Methods and Applications, Cambridge university press.

Card, D., and A. B. Krueger. (1994): “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania,” American Economic Review, 84, 772–93.

Chetty, R., N. Hendren, and L. F. Katz. (2016): “The effects of exposure to better neighborhoods on children: New evidence from the moving to opportunity experiment,” American Economic Review, 106, 855–902.

Deaton, A. (1985): “Panel data from time series of cross-sections,” Journal of econometrics, 30, 109–26.

Finkelstein, A., S. Taubman, B. Wright, M. Bernstein, J. Gruber, J. P. Newhouse, H. Allen, K. Baicker, and O. H. S. Group. (2012): “The Oregon health insurance experiment: Evidence from the first year,” The Quarterly journal of economics, 127, 1057–1106.

Kleiber, C., and A. Zeileis. (2008): Applied Econometrics with R, Springer Science & Business Media.

Lee, D. S. (2008): “Randomized experiments from non-random selection in US House elections,” Journal of Econometrics, 142, 675–97.

McFadden, D. L. (1984): “Econometric analysis of qualitative response models,” Handbook of econometrics, 2, 1395–1457.

Nelder, J. A., and R. J. Baker. (1972): Generalized Linear Models, Wiley Online Library.

Olea, J. L. M., and C. Pflueger. (2013): “A robust test for weak instruments,” Journal of Business & Economic Statistics, 31, 358–69.

Pallais, A. (2014): “Inefficient hiring in entry-level labor markets,” American Economic Review, 104, 3565–99.

Rodriguez, G. “WWS 509,”http://data.princeton.edu/wws509/notes, Accessed 05/31/2018.

Staiger, D., and J. H. Stock. (1997): “Instrumental Variables Regression with Weak Instruments,” Econometrica, 65, 557–86.

Applied Econometrics and Causal Inference Notes

Apoorva Lal