options(digits=2)
rm(list=ls())
library(LalRUtils)
load_or_install(c('tidyverse','data.table', 'stargazer','lfe','lmtest', 'panelView', 'gsynth',
'magrittr', 'gridExtra', 'rio', 'plm', 'lme4', 'Rcpp', 'foreach', 'doParallel',
'GGally', 'abind'))
theme_set(theme_bw())
exp_type = 'text'
root = '/home/alal/Dropbox/0_GradSchool/0_classes/450b/applied-econometrics-notes/'
data = paste0(root, 'data/')
Notation:
Given this, we can write the DiD Estimand:
$$ E \left[ Y _ { 1 } ( 1 ) - Y _ { 0 } ( 1 ) | D = 1 \right] = \{ E [ Y ( 1 ) | D = 1 ] - E [ Y ( 1 ) | D = 0 ] \} - \{ E [ Y ( 0 ) | D = 1 ] - E [ Y ( 0 ) | D = 0 ] \} $$Rearranging the parallel trends assumption yields:
$$ E \left[ Y _ { 0 } ( 1 ) | D = 0 \right] = E \left[ Y _ { 0 } ( 1 ) | D = 1 \right] - E \left[ Y _ { 0 } ( 0 ) | D = 1 \right] + E \left[ Y _ { 0 } ( 0 ) | D = 0 \right] $$# Import data
njmin <- read.table(paste0(data, 'public.dat'),
header = FALSE,
stringsAsFactors = FALSE,
na.strings = c("", ".", "NA"))
#%%
names(njmin) <- c('SHEET', 'CHAIN', 'CO_OWNED', 'STATE', 'SOUTHJ', 'CENTRALJ',
'NORTHJ', 'PA1', 'PA2', 'SHORE', 'NCALLS', 'EMPFT', 'EMPPT',
'NMGRS', 'WAGE_ST', 'INCTIME', 'FIRSTINC', 'BONUS', 'PCTAFF',
'MEALS', 'OPEN', 'HRSOPEN', 'PSODA', 'PFRY', 'PENTREE', 'NREGS',
'NREGS11', 'TYPE2', 'STATUS2', 'DATE2', 'NCALLS2', 'EMPFT2',
'EMPPT2', 'NMGRS2', 'WAGE_ST2', 'INCTIME2', 'FIRSTIN2', 'SPECIAL2',
'MEALS2', 'OPEN2R', 'HRSOPEN2', 'PSODA2', 'PFRY2', 'PENTREE2',
'NREGS2', 'NREGS112')
njmin %>% head()
#%%
# Calculate FTE employement
njmin$FTE <- njmin$EMPFT + 0.5 * njmin$EMPPT + njmin$NMGRS
njmin$FTE2 <- njmin$EMPFT2 + 0.5 * njmin$EMPPT2 + njmin$NMGRS2
#%%
# Create function for calculating standard errors of mean
semean <- function(x, na.rm = FALSE) {
n <- ifelse(na.rm, sum(!is.na(x)), length(x))
sqrt(var(x, na.rm = na.rm) / n)
}
#%%
# Calucate means
summary.means <- njmin[ , c("FTE", "FTE2", "STATE")] %>%
group_by(STATE) %>%
summarise_all(funs(mean(., na.rm = TRUE)))
summary.means <- as.data.frame(t(summary.means[ , -1]))
#%%
colnames(summary.means) <- c("PA", "NJ")
summary.means$dSTATE <- summary.means$NJ - summary.means$PA
summary.means <- rbind(summary.means,
summary.means[2, ] - summary.means[1, ])
row.names(summary.means) <- c("FTE employment before, all available observations",
"FTE employment after, all available observations",
"Change in mean FTE employment")
#%%
# Calculate
summary.semeans <- njmin[ , c("FTE", "FTE2", "STATE")] %>%
group_by(STATE) %>%
summarise_all(funs(semean(., na.rm = TRUE)))
summary.semeans <- as.data.frame(t(summary.semeans[ , -1]))
#%%
colnames(summary.semeans) <- c("PA", "NJ")
summary.semeans$dSTATE <- sqrt(summary.semeans$NJ + summary.semeans$PA) # / length
njmin <- njmin[ , c("FTE", "FTE2", "STATE")]
njmin <- melt(njmin,
id.vars = c("STATE"),
variable.name = "Period",
value.name = "FTE")
summary.means <- njmin %>%
mutate(STATE = dplyr::recode(STATE, `0` = 'PA (control)',
`1` = 'NJ (treatment)')) %>%
group_by(STATE, Period) %>%
summarise_all(
funs(
mean(., na.rm = TRUE),
semean(., na.rm = TRUE)
))
difftable = data.table::dcast(setDT(summary.means),
STATE ~ Period, value.var = c('mean','semean'))
difftable[, diff:=mean_FTE2 - mean_FTE]
knitr::kable(difftable)
difftable[1,'diff'] - difftable[2,'diff']
emps <- summary.means %>%
dplyr::select(state = STATE, period = Period, employment = mean)
ggplot(data=emps,
aes(x=period, y=employment, group=state, colour=state)) +
geom_point() +
geom_line()
Decompose error into terms:
$$ e_i = \mathbf{1}_i u_i + \epsilon_i $$Observation level:
$$ y_{it} = x_{it}'\beta + \theta_i + \epsilon_{it} $$Individual Level:
$$ y_i = \mathbf{X_i} \beta + \mathbf{1}_i \theta_i + \epsilon_i $$One-way fixed effects and Random effects both use this form, although they make different assumptions about $\theta_i$
Define an individual fixed effect for individual $i$
$$ A_i = \begin{cases} 1 \text{ if the observation involves unit i} \\ 0 \text{ otherwise} \end{cases} $$and define the same for each time period for panel data.
If $D_{it}$ is as good as randomly assigned conditional on $A_i$:
$$ E[Y_{0it}|A_i, X_{it}, t, D_{it}] = E[Y_{0it}|A_i, X_{it}, t] $$Then, assuming $A_i$ enter linearly,
$$ E[Y_{0it}|A_i, X_{it}, t, D_{it}] = \alpha + \lambda_t + A_i'\gamma + X_{it}'\beta $$Assuming the causal effect of the treatment is additive and constant,
$$ E[Y_{1it}|A_i, X_{it}, t] = E[Y_{0it}|A_i, X_{it}, t] + \rho $$where $\rho$ is the causal effect of interest.
Restrictions
Then, we can write:
$$ \begin{aligned} Y_{it} & = \alpha_i +\lambda_t +_\rho D_{it} + X_{it}' \beta + \epsilon_{it} \\ \epsilon_{it} & = Y_{0it} - E[Y_{0it}|A_i, X_{it}, t] \\ \alpha_i & = \alpha + A_i \gamma \end{aligned} $$Include individual-specific time trends for more flexible estimation
Strict Exogeneity:
Calculate individual averages by averaging the FE equation
$$ \bar{Y_i} = \alpha_i + \bar{\lambda} + \rho \bar{D_i} + \bar{X_i}'\beta + \epsilon_i $$then subtract from the FE equation
$$ Y_{it} - \bar{Y_i} = \lambda_{t} - \bar{\lambda} + \rho (D_{it} - \bar{D_i}) + (X_{it} - \bar{X_i}')\beta + (\epsilon_{it} - \epsilon_i) $$yields the same $\rho$
Alternatively, demeaning can be done by premultiplying by the individual specific demeaning operator
$$ \mathbf{M}_i = \mathbf{I}_i - \mathbf{1}_i (\mathbf{1}_i ' \mathbf{1}_i)^{-1} \mathbf{1}_i' $$Demeaned regressors:
Define $\Delta Y_{i,t} = Y_{i,t} - Y_{i,t-1}$ and the corresponding first-differences for $D$ and $X$
The transformation is
$$ \Delta y_{it} = y_{it} - y_{it-1} $$At the individual level, this is written as:
$$ \Delta y_i = F_i y_i $$$$ \Delta X_i = F_i X_i $$where $\mathbf{F}_i$ is the $(T_i - 1) \times T_i$ matrix differencing operator
$$ F_i = \begin{bmatrix} -1 & 1 & 0 & \dots & 0 & 0 \\ 0 & -1 & 1 & \dots & 0 & 0 \\ \vdots & & \ddots & & \vdots & \vdots \\ 0 & 0 & 0 & \dots & -1 & 1 \\ \end{bmatrix} $$Then, the first-differences estimating equation is:
$$ \Delta Y_{it} = \Delta \lambda_{t} + \rho \Delta D_{it} + \Delta X_{it}' \beta + \Delta \epsilon_{it} $$Numerically identical from the FD/Within estimator for $T=2$, not otherwise. Both consistent, but FE is efficient.
Assume $\theta_i \coprod X_i$ - strong assumption
In other words, entire error term $e_{it} = \nu_{it} + \theta_{i}$ is independent of $X$.
When there is autocorrelation in time series (i.e. $\epsilon_t$ s are correlated over time ), GLS estimates can be obtained by estimating OLS on quasi-differenced data. This allows us to estimate the effects of time-invariant characteristics (assuming the independence condition is met).
Equi-correlation model:
$$ \begin{aligned} E (e_{it} | \mathbf{X}_i) & = 0 ) \\ E (e_{it}^2 | \mathbf{X}_i) & = \sigma^2) \\ E (e_{ij} \; e_{it} | \mathbf{X}_i) & = \rho \sigma^2 \\ \end{aligned} $$This implies the following:
Implementation
$$ y_{it} - \lambda \bar{y_{i}} = (x_{it} - \lambda \bar{x_i}) \beta + (1-\lambda)\theta_i + \nu_{it} - \lambda \bar{\nu}_i $$where $$\lambda = 1 - \left[ \frac{\sigma_\nu^2} {\sigma_\nu^2 + T \sigma_\theta^2} \right]^{\frac{1}{2}} $$
If the error component $\theta$ is correlated with $x$, RE estimates are not consistent. Perform Hausman test for random vs fixed effects (where under the null, $Cov(\theta_i, x_{it}) = 0$)
freak = import(paste0(data, 'prison.dta'))
freak %>% head()
sumvars = c(
'pris', 'criv', 'crip', 'polpc', 'incpc', 'unem',
'final1',
'black', 'metro', 'ag0_14', 'ag15_17', 'ag18_24', 'ag25_34'
)
sumlabs = c(
'Prison Population', 'Violent Crime', 'Property Crime',
'Police Employees', 'Per Capita Income', 'Unemployment Rate',
'Final Decision (Litigation)',
'Share Black', 'Metro Areas', 'Age 0-14', 'Age 15-17',
'Age 18-24', 'Age 25-34'
)
stargazer(freak[, sumvars], type = exp_type, header=F,
covariate.labels=sumlabs, digits = 2,
title = 'Summary Statistics: Levitt 1996',
notes = c('Per 100,000 residents where Applicable',
paste0(nrow(freak), ' observations'),
'Levitt (1996) has 1173 observations'),
summary.stat = c('mean', 'sd', 'min', 'max')
)
Pooled OLS, Random Effects, and Fixed Effects
freak %<>% group_by(state) %>%
mutate(lag_pris = dplyr::lag(pris, order_by = year)) %>% ungroup()
outcome = 'log(criv)'
covars = c('lag_pris', 'log(incpc)', 'unem', 'log(polpc)', 'black', 'metro',
'ag0_14', 'ag15_17', 'ag18_24', 'ag25_34')
#formulas for felm / plm
base = formula_stitcher(outcome, covars)
clust = formula_lfe(outcome, covars, C = ('state'))
m1 = robustify(felm(base, freak)) # pooled OLS
m2 = felm(clust, freak) # pooled OLS with clustered standard errors
# random effects
m3 <- plm(base, data=freak, index = c("state", "year"), model="random")
# Cluster by state
m3_se = coeftest(m3, vcov=vcovHC(m3,type="HC2",cluster="group"))[, 2]
# fixed effects
femod = formula_lfe(outcome, covars, D = c('state'), C = ('state'))
m4 = felm(femod, freak)
# export
sumlabs2 = c( 'Lagged Prison Population', 'Log Per Capita Income',
'Unemployment Rate', 'Log Police', 'Share Black', 'Metro Areas',
'Age 0-14', 'Age 15-17', 'Age 18-24', 'Age 25-34' )
stargazer(m1, m2, m3, m4, type = exp_type, header=F,
title = "Levitt 1996 Replication",
column.separate = c(2, 1, 1),
column.labels = c("Pooled OLS", "Random Effects", "Fixed Effects"),
se = list(NULL, NULL, m3_se, NULL),
dep.var.caption = "Outcome: Log Crime", model.names = F,
dep.var.labels.include = FALSE, multicolumn = F,
omit.stat = c('f', 'adj.rsq', 'ser'),
covariate.labels=sumlabs2, digits = 2, df=F,
notes = c('SE in Col 1: Robust (HC2)',
'SE in Columns 2-4: Cluster Robust by State')
)
swiss = import(paste0(data, 'swissnat_PS6.Rdata'))
# time trends
swiss$t = swiss$year - min(swiss$year)
swiss$t2 = swiss$t^2
swiss$year2 = factor(swiss$year)
swiss$ort_name= factor(swiss$ort_name)
# for formulae
outcome = 'nat_rate_ord'
treat = 'DirDem'
muni = 'ort_name'
time = 'year2'
swiss <- pdata.frame(swiss, index = c(muni, time))
(fm1 = formula_lfe(outcome, treat, C = muni)) # pooled OLS
(fm2 = formula_lfe(outcome, treat, D = muni, C = muni)) # muni fixed effects
(fm3 = formula_lfe(outcome, c(treat, 't'), # muni FE + common time trend
D = muni, C = muni))
(fm4 = formula_lfe(outcome, treat, # muni FE + year FE
D = c(muni, time), C = muni))
(fm6 = formula_lfe(outcome, treat, # muni FE + year FE + muni time trends
D = c(muni, time, 'ort_name:t', 'ort_name:t2'), C = muni))
# fit models
m1 = felm(fm1, swiss)
m2 = felm(fm2, swiss)
m3 = felm(fm3, swiss)
m4 = felm(fm4, swiss)
m5 = felm(fm4, swiss %>% filter(sprachreg == 'G'))
panelView(nat_rate_ord ~ DirDem,data = swiss, index = c('ort_name','year'),
legend.labs = c('Direct Democracy', 'Committee'), by.treatment = TRUE)
# export
sumlabs3 = c( 'Direct Democracy', 'Time trend')
stargazer(m1, m2, m3, m4, m5, type = exp_type, header=F,
title = "Hainmueller Hangartner (2015) Replication",
dep.var.caption = "Outcome: Naturalisation Rate", model.names = F,
dep.var.labels.include = FALSE, multicolumn = F,
keep = c('DirDem'),
omit.stat = c('f', 'adj.rsq', 'ser'),
column.separate = c(4, 1),
column.labels = c("All Municipalities", "German Speaking"),
add.lines = list(
c('Municipality Fixed Effects', '', 'X', 'X', 'X', 'X', 'X', 'X'),
c('Year Fixed Effects', '', '', '', 'X', 'X', 'X', 'X'),
c('Common Linear Time Trend', '', '', 'X', '', '', '', '')
),
notes = c('Standard Errors clustered by Municipality'),
covariate.labels=sumlabs3, digits = 2, df=F
)
(fm5 = formula_lfe(outcome, treat, # muni FE + year FE + muni time trends
D = c(muni, time, 'ort_name:t'), C = muni))
(fm6 = formula_lfe(outcome, treat, # muni FE + year FE + muni time trends
D = c(muni, time, 'ort_name:t', 'ort_name:t2'), C = muni))
m6 = felm(fm5, swiss)
m8 = felm(fm6, swiss)
m7 = felm(fm5, swiss %>% filter(sprachreg == 'G'))
m9 = felm(fm6, swiss %>% filter(sprachreg == 'G'))
stargazer(m6, m8, m7, m9, type = exp_type, header=F,
title = "Hainmueller Hangartner (2015) Replication with Municipality Time Trends",
dep.var.caption = "Outcome: Naturalisation Rate", model.names = F,
dep.var.labels.include = FALSE, multicolumn = F,
keep = c('DirDem'),
omit.stat = c('f', 'adj.rsq', 'ser'),
column.separate = c(2, 2),
column.labels = c("All Municipalities", "German Speaking"),
add.lines = list(
c('Muni Fixed Effects', 'X', 'X', 'X', 'X'),
c('Year Fixed Effects', 'X', 'X', 'X', 'X'),
c('Muni Linear Time Trend','X', 'X', 'X', 'X'),
c('Muni Quadratic Time Trend','', 'X', '', 'X')
),
notes = c('Standard Errors clustered by Municipality'),
covariate.labels=sumlabs3, digits = 2, df=F
)
$\beta_{FE}$ is assumed to be consistent. Oft-abused test as a result.
Causes should happen before consequences, and not vice-versa
Define switching indicator $D_{it}$ as 1 if $i$ switched from control to treatment between $t-1$ and $t$.
Check on whether, conditional on state and year fixed effects, past $D_{st}$ predicts $Y_{ist}$ while future $D_{st}$ does not. If $D_{st}$ causes $Y_{ist}$ but not vice versa, then dummies for future policy changes should not matter in an equation like:
where the sums on the RHS allow for m lags / post-treatment effects, and q leads / pre-treatment effects.
autor <- import(paste0(data, 'table7/autor-jole-2003.dta'))
# Log total employment: from BLS employment & earnings
autor$lnemp <- log(autor$annemp)
# Non-business-service sector employment from CBP
autor$nonemp <- autor$stateemp - autor$svcemp
autor$lnnon <- log(autor$nonemp)
autor$svcfrac <- autor$svcemp / autor$nonemp
# Total business services employment from CBP
autor$bizemp <- autor$svcemp + autor$peremp
autor$lnbiz <- log(autor$bizemp)
# Restrict sample
autor <- autor[which(autor$year >= 79 & autor$year <= 95), ]
autor <- autor[which(autor$state != 98), ]
# State dummies, year dummies, and state*time trends
autor$t <- autor$year - 78
autor$t2 <- autor$t^2
# Generate more aggregate demographics
autor$clp <- autor$clg + autor$gtc
autor$a1624 <- autor$m1619 + autor$m2024 + autor$f1619 + autor$f2024
autor$a2554 <- autor$m2554 + autor$f2554
autor$a55up <- autor$m5564 + autor$m65up + autor$f5564 + autor$f65up
autor$fem <- autor$f1619 + autor$f2024 + autor$f2554 + autor$f5564 + autor$f65up
autor$white <- autor$rs_wm + autor$rs_wf
autor$black <- autor$rs_bm + autor$rs_bf
autor$other <- autor$rs_om + autor$rs_of
autor$married <- autor$marfem + autor$marmale
# Modify union variable (1. Don't interpolate 1979, 1981; 2. Rescale into percentage)
autor$unmem[79 == autor$year | autor$year == 81] <- NA
autor$unmem <- autor$unmem * 100
# Create state and year factors
autor$state <- factor(autor$state)
autor$year <- factor(autor$year)
# Diff-in-diff regression
did <- felm(lnths ~ lnemp + admico_2 + admico_1 + admico0 + admico1 + admico2 +
admico3 + mico4 + admppa_2 + admppa_1 + admppa0 + admppa1 +
admppa2 + admppa3 + mppa4 + admgfa_2 + admgfa_1 + admgfa0 +
admgfa1 + admgfa2 + admgfa3 + mgfa4
| state + year + state:t | 0 | state, data = autor)
Estimated Impact of implied-contract exceptions to the employment- at-will doctrine on the use of temporary workers
# Plot results
lags_leads <- c("admico_2", "admico_1", "admico0",
"admico1" , "admico2" , "admico3",
"mico4")
labels <- c("2 yr prior", "1 yr prior", "Yr of adopt",
"1 yr after", "2 yr after", "3 yr after",
"4+ yr after")
results.did <- data.frame(label = factor(labels, levels = labels),
coef = summary(did)$coef[lags_leads, "Estimate"] * 100,
se = summary(did)$coef[lags_leads, "Cluster s.e."] * 100)
g <- ggplot(results.did, aes(label, coef, group = 1))
p <- g + geom_point() +
geom_line(linetype = "dotted") +
geom_errorbar(aes(ymax = coef + 1.96 * se,
ymin = coef - 1.96 * se)) +
geom_hline(yintercept = 0) +
ylab("Log points") +
xlab(paste("Time passage relative to year of",
"adoption of implied contract exception"))
p
For FE to be valid, we rely on strict exogeneity wherein $E(\nu_{it}|x_i)= E(\nu_{it}|x_{i1}, \cdots x_{iT})=0$ (i.e. errors are uncorrelated with lags and leads of x). This fails when one of the x's is a lagged dependent variable, e.g.
$$ y_{it} = x_{it}\beta + y_{it-1}\delta + \theta_i + \nu_{it} $$this yields a correlation between a RHS variable $y_{it-1}$ and the error term of the form:
$$ cov([y_{it-1} - \bar{y}_{it-1}], [\nu_{it} - \bar{\nu}_i]) = -(2/T)\sigma_\nu^2 + [(T-1)/T^2]\sigma_\eta^2 $$This covariance $\rightarrow 0$ as $T \rightarrow \infty$.
use lagged values of x and y (e.g. $x_{it-1}$ and/or $y_{it-2}$ as instruments for $\delta y_{it-1}$. $y_{it-2}$ will not be a valid instrument if $\nu_{it}$ is serially correlated.
let $i$ index individuals, $j$ index birth cohorts, and $t$ index time. Then, without a true panel to estimate the regression
$$ y_{ijt} = x_{ijt}'\beta + \theta_{ij} + \delta_{t} + \nu_{ijt} $$we can estimate:
$$ \bar{y}_{jt} = \bar{x}_{jt}' \beta + \bar{\theta}_j + \delta_t + \bar{\nu}_{jt} $$$\implies \lambda_i' f_t$ is $F \times F$
$F$ is chosen based on factor analysis of the residuals. Usually low $1,2,3$ (i.e. low number)
data(gsynth)
head(simdata)
system.time(
out <- gsynth(Y ~ D + X1 + X2, data = simdata, index = c("id","time"), force = "two-way", CV = TRUE, r = c(0, 5), se = TRUE, inference = "parametric", nboots = 1000, parallel = TRUE, cores = detectCores())
)
print(out)
plot(out) # by default
Construct synthetic control group by finding optimal weights for control regins so that pre-trends between treated ($r=1$) and control regions ($r = 2 \cdots R$) are best matched. If the treatment occurs at $t=0$ and we have data that goes back to $-\tau$, we minimise:
$$ min_{w_r} \sum_{t=-\tau}^{0} (\Delta Y_{r=1,t} - \sum_{r=2}^R w_r \Delta Y_{rt}) $$$\left\|X_{1}-X_{0} W\right\|_{V}=\sqrt{\left(X_{1}-X_{0} W\right)^{\prime} V\left(X_{1}-X_{0} W\right)}$