3. Regression Fundamentals
Conditional Expectation Function (CEF)
\[
\mathbb{E}[Y|X=x] = \int y f(y|x) dy
\]
Law of Iterated Expectation \[\mathbb{E}[Y_i] = \mathbb{E}[\mathbb{E}[Y_i|X_i]] \]
#%%
# Download data and unzip the data - run once in working directory
# download.file('http://economics.mit.edu/files/397', 'asciiqob.zip')
# unzip('asciiqob.zip')
#%%
# Read the data into a dataframe
pums = read.table('asciiqob.txt',
header = FALSE,
stringsAsFactors = FALSE)
names(pums) <- c('lwklywge', 'educ', 'yob', 'qob', 'pob')
pums$age <- 80 - pums$yob
#%%
# Estimate OLS regression
reg.model <- lm(lwklywge ~ educ, data = pums)
#%%
# Calculate means by educ attainment and predicted values
pums.data.table <- data.table(pums)
educ.means <- pums.data.table[
, list(mean = mean(lwklywge)), by = educ]
educ.means$yhat <- predict(reg.model, educ.means)
CEF \(E(Wages|Education)\)
# Create plot
p <- ggplot(data = educ.means, aes(x = educ)) +
geom_point(aes(y = mean)) +
geom_line(aes(y = mean)) +
geom_line(aes(y = yhat), linetype='dashed' ) +
labs(title='CEF (Wages | Education)',
y = "Log weekly earnings, $2003",
x = "Years of completed education")
print(p)
stargazer(reg.model,type=exp_type)
|
|
Dependent variable:
|
|
|
|
lwklywge
|
|
educ
|
0.071***
|
|
(0.0003)
|
|
|
Constant
|
5.000***
|
|
(0.004)
|
|
|
|
Observations
|
329,509
|
R2
|
0.120
|
Adjusted R2
|
0.120
|
Residual Std. Error
|
0.640 (df = 329507)
|
F Statistic
|
43,783.000*** (df = 1; 329507)
|
|
Note:
|
p<0.1; p<0.05; p<0.01
|
CEF Decomposition
\[
Y_i = \mathbb{E}(Y_i|X_i) + \epsilon_i
\]
CEF Prediction Property
\[
\mathbb{E}[Y_i|X_i] = {arg min}_{m(x_i)} \mathbb{E}[(Y_i - m(X_i))^2]
\]
ANOVA Theorem
\[
\mathbb{V}(Y_i) = \mathbb{V}(\mathbb{E}(Y_i|X_i)) +
\mathbb{E}(\mathbb{V}(Y_I|X_I))
\]
Variance of Y = Variance of CEF (Explained Variance) + Variance of residual
anova(reg.model)
## Analysis of Variance Table
##
## Response: lwklywge
## Df Sum Sq Mean Sq F value Pr(>F)
## educ 1 17809 17809 43783 <2e-16 ***
## Residuals 329507 134029 0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regression Anatomy / FWL
\[
\beta_k = \frac{Cov(Y_i, \tilde{x_{ki}})}{V(\tilde{x_{ki}})}
\]
where \(\tilde{x_{ki}}\) is the residual from a regression of \(x_{ki}\) on all other covariates.
Justifications for Linear Regression:
- Regression reveals conditional expectatiosn if CEF is linear
- Regression reveals best linear predictor of \(Y_i|X_i\)
- Regression reveals best linear predictor of \(E[Y_i|X_i]\)
Conditional Independence Assumption (CIA)
\[
Y_{si} \coprod s_i | X_i , \forall s
\]
Conditional on \(X_i\), \(s_i\) can be said to be ‘as good as randomly assigned’. Not Testable
Saturated Models
Saturated regression models are regression models with discrete explanatory variables s.t. the model includes a separate parameter for all possible values take on by the explanatory variables.
sat_reg = lm(lwklywge ~ factor(educ), data=pums)
pums.data.table <- data.table(pums)
educ.means <- pums.data.table[
, list(mean = mean(lwklywge)), by = educ]
educ.means$yhat <- predict(sat_reg, educ.means)
# Create plot
p <- ggplot(data = educ.means, aes(x = educ)) +
geom_point(aes(y = mean)) +
geom_line(aes(y = mean)) +
geom_line(aes(y = yhat), linetype='dashed' ) +
labs(title='CEF (Wages | Education)',
y = "Log weekly earnings, $2003",
x = "Years of completed education")
print(p)
\(\hat{y}\) and \(E[Y_i|X_i]\) are identical, since the saturated regression fits the CEF perfectly
Measurement error
Let \(y^*\) be the true value and \(u_y\) is an idiosyncratic error in measurement. Then, \(y = y^* + u_y\), which means \(y = x^*\beta+\epsilon + u_y\).
Measurement error in y does not affect the consistency of OLS estimates of \(\beta\), but it increases the variance of \(\hat{\beta}\) from \((X'X)^{-1} var(\epsilon)\) to \((X'X)^{-1}[var(\epsilon) + var(u_y)]\).
Measurement error in x results in a bias of OLS estimates
\[
plim(\hat{\beta}) = \frac{var(x^*)}{var(x^*) + var(u_x)} \beta
\]
\(\hat{\beta}\) converges in probability to a fraction of the true \(\beta\). This is attenuation bias.
# add random noise to educ
pums %<>% mutate(educ_noisy = educ + rnorm(length(pums$educ), 0, 2))
lm1 = lm(lwklywge ~ educ, data=pums)
lm2 = lm(lwklywge ~ educ_noisy, data=pums)
stargazer(lm1, lm2, type=exp_type)
|
|
Dependent variable:
|
|
|
|
lwklywge
|
|
(1)
|
(2)
|
|
educ
|
0.071***
|
|
|
(0.0003)
|
|
|
|
|
educ_noisy
|
|
0.052***
|
|
|
(0.0003)
|
|
|
|
Constant
|
5.000***
|
5.200***
|
|
(0.004)
|
(0.004)
|
|
|
|
|
Observations
|
329,509
|
329,509
|
R2
|
0.120
|
0.086
|
Adjusted R2
|
0.120
|
0.086
|
Residual Std. Error (df = 329507)
|
0.640
|
0.650
|
F Statistic (df = 1; 329507)
|
43,783.000***
|
31,140.000***
|
|
Note:
|
p<0.1; p<0.05; p<0.01
|
p1 = ggplot(data = pums) +
geom_point(mapping = aes(x=educ, y = lwklywge)) +
stat_smooth(mapping = aes(x=educ, y = lwklywge),
method = "lm", col = "red") +
stat_smooth(mapping = aes(x=educ_noisy, y = lwklywge),
method = "lm", col = "blue")
p1
3.5 Standard Errors
Asymptotic distribution of \(\hat{\beta}\) is
\[
\sqrt{N} (\hat{\beta} - \beta) \sim N(0, \Omega)
\]
where \(\Omega\) is the asymptotic covariance matrix
\[
\Omega = E[X'X]^{-1} X' \epsilon \epsilon' X (X'X)^{-1}
\]
When residuals are homoskedastic (spherical error variance - i.e. \(E[\epsilon \epsilon'] = \sigma^2 N\) ), the covariance matrix simplifies to \(\Omega = \sigma^2 (X'X)^{-1}\) , where \(\sigma^2 = E(\epsilon^2)\).
This may not be true:
residuals = resid(reg.model)
x_and_err = data.frame(pums$educ, residuals)
ggplot(data = x_and_err, aes(x=pums.educ,y=residuals)) +
geom_point()
# geom_smooth(method=lm)
But standard errors barely change in this case
lm1 = felm(lwklywge ~ educ, data = pums)
lm2 = robustify(felm(lwklywge ~ educ, data = pums))
stargazer(lm1, lm2, type=exp_type)
|
|
Dependent variable:
|
|
|
|
lwklywge
|
|
(1)
|
(2)
|
|
educ
|
0.071***
|
0.071***
|
|
(0.0003)
|
(0.0004)
|
|
|
|
Constant
|
5.000***
|
5.000***
|
|
(0.004)
|
(0.005)
|
|
|
|
|
Observations
|
329,509
|
329,509
|
R2
|
0.120
|
0.120
|
Adjusted R2
|
0.120
|
0.120
|
Residual Std. Error (df = 329507)
|
0.640
|
0.640
|
|
Note:
|
p<0.1; p<0.05; p<0.01
|
Homoscedasticity mechanically fails in LPM since
\[
\epsilon_i =
\begin{cases}
-x_i \beta \text{ if } Y_i = 0 \\
1 - x_i \beta \text{ if } Y_i = 1
\end{cases}
\]
Then, \(Var(\epsilon_i | X_i) = x_i \beta ( 1- x_i \beta)\)
Let \((X'X) = E(X_i X_i') = Q_{xx}\). Then, standard errors are calculated as \(V(\hat{\beta}) = Q^{-1} \Omega Q^{-1}\) for the following \(\Omega\):
\[\begin{align*}
\text{Classical} & = X' \frac{e'e}{N-K} X
\text{ : Homoskedastic Standard Errors}\\
\text{HC0} & = X' diag[e_i^2] X \\
\text{HC1} & = X' \frac{N}{N-K} diag[e_i^2] X \\
\text{HC2} & = X' diag[\frac{e_i^2}{1-h_{ii}}] X \\
\text{HC3} & = X' diag[\frac{e_i^2}{(1-h_{ii})^2}] X \\
\end{align*}\]
where \(h_{ii} := x_i (X'X)^{-1} x_i'\) = leverage of observation \(i\), and \(e_i := y_i - x_i \hat{\beta}\) = empirical residuals
Clustered standard errors
Assume error structure of
\[
\epsilon_{ij} = \theta_j + u_{ij}
\]
where \(\theta_j\) is a common shock for each subgroup \(j\) in the dataset. Variance covariance matrix is no longer diagonal but block- diagonal. \(var(\epsilon_{ij}) = var(\theta_j) + var(u_{ij})\); \(cov(\epsilon_{ij}, \epsilon_{i'j}) = var(\theta_j) \forall i, i' \in j\), (two observations in the same group) while \(cov(\epsilon_{ij}, \epsilon_{i'j'}) =0\) for (two observations not in the same group)
\[
Var(\hat{\beta}) =
(\sum_{j=1}^J \sum_{i \in C_j} x_{ij}' x_{ij})^{-1}
\sum_{j=1}^J (\sum_{i \in C_j} \sum_{i' \in C_j} (x_{ij}' \hat{\epsilon}_{ij})
(\hat{\epsilon}_{i'j}' x_{i'j} ) )
(\sum_{j=1}^J \sum_{i \in C_j} x_{ij}' x_{ij})^{-1}
\]
where \(J\) is the number of clusters.
Bootstrapped standard errors
Compute \(\hat{\beta}(r)\) for each random sample (drawn with replacement from the overall sample), then calcuate the empirical standard deviation of \(\hat{\beta}(r)\)
\[
Var(\hat{\beta})^{boot} = \frac{1}{R} \sum_{r=1}^R (\hat{\beta}(r) -
\hat{\bar{\beta}})^2
\]
Block bootstrap to preserve group structure.
TBD: Miller, Cameron, Gelbach double-cluster, Conley spatial HAC
4. Instrumental Variables
Endogeneity : \(Cov(X_i, \epsilon_i) \neq 0\)
Instrumental Variable requirements:
- Relevance : \(Cov(X_i, Z_i) \neq 0\)
- Exclusion Restriction: \(Cov(Z_i, \epsilon_i) = 0\)
IV estimation
\[
\beta_{IV} = \frac{Cov(Y_i,Z_i)}{Cov(X_i, Z_i)}
= \frac{\beta_{RF}}{\beta_{FS}}
\]
General form:
\[
\hat{\beta}_{IV} = (X'P_z X)^{-1}X'P_z Y
\] where \(P_z\) is the projection matrix of Z:= \(Z(Z'Z)^{-1}Z'\)
More general estimation strategies:
- 2 Stage Least Squares:
- Use OLS to regress X on Z and get \(\hat{X} = P_z X\)
- Use OLS to regress y on \(\hat{X}\) to get \(\hat{\beta}_{iv}\)
- Control Function approach:
- Use OLS (or some other estimation strategy) to regress X on Z and get estimated residuals \(\hat{v} = I - P_z X = M_z X\)
- Use OLS to regress y on X and \(\hat{v}\) to get \(\hat{\beta}_{iv}\)
IV estimation with covariates
conditional independence assumption
\[
{Y_{1i}, Y_{0i}, D_{1i}, D_{0i}} \coprod Z_i | X_i
\]
Angrist and Krueger (1991) - Compulsory Schooling / Quarter of Birth Instrument
Visual test of Instrument Relevance
Z = Quarter of Birth , Y = wages, X = education
pums.qob.means <- pums %>% group_by(yob, qob) %>% summarise_all(funs(mean))
#%%
# Add dates
pums.qob.means$yqob <- ymd(paste0("19",
pums.qob.means$yob,
pums.qob.means$qob * 3),
truncated = 2)
# Function for plotting data
plot.qob <- function(ggplot.obj, ggtitle, ylab) {
gg.colours <- c("firebrick", rep("black", 3), "white")
ggplot.obj + geom_line() +
geom_point(aes(colour = factor(qob)),
size = 5) +
geom_text(aes(label = qob, colour = "white"),
size = 3,
hjust = 0.5, vjust = 0.5,
show_guide = FALSE) +
scale_colour_manual(values = gg.colours, guide = FALSE) +
ggtitle(ggtitle) +
xlab("Year of birth") +
ylab(ylab)
}
# Plot
p.educ <- plot.qob(ggplot(pums.qob.means, aes(x = yqob, y = educ)),
"A. Average education by quarter of birth (first stage)",
"Years of education")
p.lwklywge <- plot.qob(ggplot(pums.qob.means, aes(x = yqob, y = lwklywge)),
"B. Average weekly wage by quarter of birth (reduced form)",
"Log weekly earnings")
p.ivgraph <- grid.arrange(p.educ, p.lwklywge)
Replication (Table 4.1.1 in MHE)
pums <- fread('asciiqob.txt',
header = FALSE,
stringsAsFactors = FALSE)
names(pums) <- c('lwklywge', 'educ', 'yob', 'qob', 'pob')
qobs <- unique(pums$qob)
qobs.vars <- sapply(qobs, function(x) paste0('qob', x))
pums[, (qobs.vars) := lapply(qobs, function(x) 1*(qob == x))]
#%%
# Column 1: OLS
col1 <- felm(lwklywge ~ 1|0|(educ~educ)|0, pums)
# Column 2: OLS with YOB, POB dummies
col2 <- felm(lwklywge ~ 1| factor(yob) + factor(pob)|(educ~educ)|0, pums)
#%%
col3 <- felm(lwklywge ~ 1 | 0 | (educ~qob1) |0, data=pums)
# Column 3: 2SLS with instrument QOB = 1
#%%
# Column 4: 2SLS with YOB, POB dummies and instrument QOB = 1
col4 <- felm(lwklywge ~ 1 | 0 |(educ~qob1 + qob2 + qob3)|0 , data=pums)
#%%
# Column 5: 2SLS with YOB, POB dummies and instrument (QOB = 1)
col5 <- felm(lwklywge ~ 1 | factor(yob) + factor(pob) |
(educ ~ qob1)|0,data=pums)
# Column 6: 2SLS with YOB, POB dummies and full QOB dummies
col6 <- felm(lwklywge ~ 1 | factor(yob) + factor(pob) |
(educ ~ factor(qob))|0,data=pums)
# Column 7: 2SLS with YOB, POB dummies and full QOB dummies interacted with YOB
col7 <- felm(lwklywge ~ 1 | factor(yob) + factor(pob) |
(educ ~ factor(qob) * factor(yob)),
data = pums)
stargazer(col1,col2,col3,col4,col5,col6,col7,
keep.stat="n",
covariate.labels=c('Years of education'),
add.lines = list(
c("YOB fixed effects", '', 'X','','','X','X','X' ),
c("POB fixed effects", '', 'X','','','X','X','X' ),
c("Instruments", '', '','','','','' ),
c("QOB=1", '', '','X','X','X','X' ),
c("QOB=2", '', '','','X','','X' ),
c("QOB=3", '', '','','X','','X' ),
c("QOB X YOB dummies", '', '','','','','','X' )),
omit= "Constant",
type=exp_type)
|
|
Dependent variable:
|
|
|
|
lwklywge
|
|
(1)
|
(2)
|
(3)
|
(4)
|
(5)
|
(6)
|
(7)
|
|
Years of education
|
0.071***
|
0.067***
|
0.100***
|
0.100***
|
0.100***
|
0.110***
|
0.087***
|
|
(0.0003)
|
(0.0003)
|
(0.024)
|
(0.020)
|
(0.026)
|
(0.020)
|
(0.016)
|
|
|
|
|
|
|
|
|
|
YOB fixed effects
|
|
X
|
|
|
X
|
X
|
X
|
POB fixed effects
|
|
X
|
|
|
X
|
X
|
X
|
Instruments
|
|
|
|
|
|
|
|
QOB=1
|
|
|
X
|
X
|
X
|
X
|
|
QOB=2
|
|
|
|
X
|
|
X
|
|
QOB=3
|
|
|
|
X
|
|
X
|
|
QOB X YOB dummies
|
|
|
|
|
|
|
X
|
Observations
|
329,509
|
329,509
|
329,509
|
329,509
|
329,509
|
329,509
|
329,509
|
|
Note:
|
p<0.1; p<0.05; p<0.01
|
The Wald Estimator
Structural equation:
\[
Y_i = \alpha + \rho S_i + \eta_i
\]
\(Cov(\eta_i, s_i) \neq 0\). Let \(Z_i \in {0,1}\) s.t. \(P(Z_i=1) = p\), \[
Cov(Y_i, Z_i) = \{ E[Y_i|Z_i = 1] - E[Y_i|Z_i = 0]\}p(1-p)
\]
Then, \[
\rho = \frac{E[Y_i|Z_i = 1] - E[Y_i|Z_i = 0]}{E[s_i|Z_i = 1] - E[s_i|Z_i=0]}
\]
pums$z <- (pums$qob == 3 | pums$qob == 4) * 1
# Compare means (and differences)
ttest.lwklywge <- t.test(lwklywge ~ z, pums)
ttest.educ <- t.test(educ ~ z, pums)
# Compute Wald estimate
sur <- systemfit(list(first = educ ~ z,
second = lwklywge ~ z),
data = pums,
method = "SUR")
wald <- deltaMethod(sur, "second_z / first_z")
wald.estimate <- (mean(pums$lwklywge[pums$z == 1]) - mean(pums$lwklywge[pums$z == 0])) /
(mean(pums$educ[pums$z == 1]) - mean(pums$educ[pums$z == 0]))
# wald.se <- wald.estimate^2 * ()
# OLS estimate
ols <- lm(lwklywge ~ educ, pums)
# Construct table
lwklywge.row <- c(ttest.lwklywge$estimate[1],
ttest.lwklywge$estimate[2],
ttest.lwklywge$estimate[2] - ttest.lwklywge$estimate[1])
educ.row <- c(ttest.educ$estimate[1],
ttest.educ$estimate[2],
ttest.educ$estimate[2] - ttest.educ$estimate[1])
wald.row.est <- c(NA, NA, wald$Estimate)
wald.row.se <- c(NA, NA, wald$SE)
ols.row.est <- c(NA, NA, summary(ols)$coef['educ' , 'Estimate'])
ols.row.se <- c(NA, NA, summary(ols)$coef['educ' , 'Std. Error'])
table <- rbind(lwklywge.row, educ.row,
wald.row.est, wald.row.se,
ols.row.est, ols.row.se)
colnames(table) <- c("Born in the 1st or 2nd quarter of year",
"Born in the 3rd or 4th quarter of year",
"Difference")
rownames(table) <- c("ln(weekly wage)",
"Years of education",
"Wald estimate",
"Wald std error",
"OLS estimate",
"OLS std error")
knitr::kable(table)
ln(weekly wage) |
5.9 |
5.9 |
0.01 |
Years of education |
12.7 |
12.8 |
0.11 |
Wald estimate |
NA |
NA |
0.11 |
Wald std error |
NA |
NA |
0.02 |
OLS estimate |
NA |
NA |
0.07 |
OLS std error |
NA |
NA |
0.00 |
The Local Average Treatment Effect (LATE)
Types of validity in research design:
- Internal Validity - whether a given design successfully uncovers causal effects for the population being studied
- External Validity - predictive value of the study’s finding in a different context
LATE Theorem
Assumptions
- Instrument is as good as randomly assigned - it is independent of the vector of potential outcomes and potential treatment assignments \[
[\{ Y_i(d,z); \forall d, z \}, D_{1i}, D_{0i}] \bot z_i
\] Sufficient for a causal interpretation of the reduced form
- First Stage - \(Cov(d,z) \neq 0\) ; \(E[D_{1i}- D_{0i}] \neq 0\)
- Exclusion restriction - \(Y_i(d,z) =f(d)\),i.e. only a function of \(d\) ; \(Y_i(d,0) = Y_i(d,1) \equiv Y_{di}\) for \(d \in {0,1}\)
- Monotonicity - Either \(\pi_{1i} \geq 0 \forall i \lor \pi_{1i} \leq 0 \forall i\) (everyone affected by the instrument are affected in the same way - red flag for jackknife instruments)
Given the above conditions, The wald estimand can be interpreted as the causal effect of the endogenous variable of interest on the outcome on those whose treatment status can be changed by the instrument
\[\begin{align*}
\rho & = \frac{E[Y_i|Z_i = 1] - E[Y_i|Z_i = 0]}{E[s_i|Z_i = 1] - E[s_i|Z_i=0]} \\
& = E[Y_{1i} - Y_{0i} | D_{1i} > D_{0i}] \\
& = E[\rho_i | \pi_{1i} > 0]
\end{align*}\]
Populations in IV
- Compliers : Subpopulation with \(D_{1i} = 1 \land D_{0i} = 0\)
- Defiers: Subpopulation with \(D_{1i} = 0 \land D_{0i} = 1\) ; Violates monotonicity - should be \(\emptyset\)
- Always takers: Subpopulation with \(D_{1i} = D_{0i} = 1\)
- Never takers: Subpopulation with \(D_{1i} = D_{0i} = 0\)
LATE is the effect of the treatment on the population of compliers. Not informative about never-takers and always-takers
Treatment Effects
- Average Treatment Effect (ATE) = \(E[Y_{1i}-Y_{0i}]\)
- Intent to Treat (ITT) = \(E[Y_{i}|Z_i = 1] - E[Y_{i}|Z_i = 0]\) - difference in outcomes between those who were offered treatment and those who weren’t
- Treatment on Treated (TOT) = Weighted average of effect on always- takers and effect on compliers = \(E[Y_{1i} - Y_{0i}|D_i=1]\) - mean effect for those who actually participated in the program
- Local Average Treatment Effect (LATE) : \(E[Y_{1i} - Y_{0i} | D_{1i} > D_{0i}]\)
Under randomisation and perfect compliance, ITT == ATE
In a randomised trial with one-sided non-compliance:
Bloom Result
\[
ToT = E[Y_{1i} - Y_{0i}|D_i = 1] = \frac{E[Y_i|z_i = 1] -
E[Y_i|Z_i=0]}{E[D_i|z_i =1]} = \frac{RF}{FS} = \frac{ITT}{Compliance}
\]
Standard errors
Manually running IV using 2 stages does not give you correct standard errors.
OLS residual variance is the variance of \(\eta + \rho(s_i - \hat{s_i})\) while for proper 2SLS standard errors, we want the variance of \(\eta_i\) only.
Multiple Instruments
2 mutually exclusive dummy instruments (\(z_{1i}, z_{2i}\)). Resulting 2SLS_estimate of \(\rho\) is a weighted average of causal effects for instrument-specific compliant subpopulations
\[
\rho_j = \frac{Cov(Y_i, z_{ji})}{Cov(D_i, z_{ji})} , j = 1,2
\]
First stage fitted values are \(\hat{D_i} = \pi_{11}z_{1i} + \pi_{12}z_{2i}\)
\[\rho_{2sls} = \psi \rho_1 + (1-\psi) \rho_2 \]
where
\[
\psi = \frac{\pi_{11} Cov(D_1, Z_{1i})}{\pi_{11}Cov(D_1, Z_{1i})+
\pi_{21}Cov(D_1, Z_{2i})}
\]
\(\psi\) is a number between 0 and 1 that depends on the relative strength of each instrument in the first stage
2SLS vs LIML
Only relevant When the model is overidentified. 2SLS estimation is done by fitting the first stage and reduced form separately; at each step, \(\beta\) and \(\pi\) are estimated to minimise errors in that equation. On the other hand, limited information maximum-likelihood jointly minimises the sum of squared errors in both stages.
Overidentification
Basic idea: when model is overidentified, we can test exclusion restriction (\(Cov(z, e)=0\)) by estimating the model with the two instruments separately and testing whether \(Cov(z_1, e_2)=0\) (where \(e_2\) is the error term from the structural equation when \(x\) is instrumented by \(z_2\)) and vice versa.
Steps:
- estimate the model with all available instruments, keep residual \(\hat{e}\)
- regress \(\hat{e}\) on \(z_1\) and \(z_2\), store \(R^2\).
- Under the null that overid restrictions are satisfied, \(nR^2 \sim \chi^2(q)\), where \(q\) is the number of overidentifying restrictions (1 in this case)
Hausman Test
IV comes at a price of less precision in estimates, so better to use OLS_unless it can be shown that IV significantly changes results. Define:
\[
H = \frac{(\hat{\beta}_{2sls} - \hat{\beta}_{OLS})^2}
{\hat{V}(\hat{\beta}_{2sls}) - \hat{V}(\hat{\beta}_{OLS})}
\]
Under the null that OLS is consistent, \(H \sim \chi^2_1\). This is equivalent to testing \(\gamma = 0\) in the equation \(y = \beta x + \gamma \hat{u} + \epsilon\), where \(\hat{u}\) are the residuals from the first stage of IV (regression of x on z).
Bartik IV
Code from John Joseph Horton’s blog
#%%
#%% Bartik Instrument
set.seed(6152011)
# Number of cities
L <- 1000
# Number of industries
K <- 3
# National growth rates by industry
g.raw <- runif(K)
g.k <- g.raw/sum(g.raw)
#%%
CreateCity <- function(g.k, eta.s){
# Give each city some industry shares
K <- length(g.k)
z.raw <- runif(K)
z <- z.raw / sum(z.raw)
# Give each city some idiosyncratic growth by industry
g.kl.tilde <- g.raw/sum(g.raw)
# Change in employement is inner product of the national
# + idiosyncratic component with industry share
x <- (g.kl.tilde + g.k) %*% z
# Bartik instrument
# (national growth & city industry share inner product)
B <- g.k %*% z
# Realized change in wage
y <- (1/eta.s) * x + rnorm(1,0,1/10)
list("y" = y, "B" = B, "x" = x)
}
# Create a city data frame
city.data <- rep(CreateCity(g.k, 1), L)
df <- data.frame(matrix(unlist(city.data), ncol = 3))
colnames(df) <- c("y", "B", "x")
# Estimates
m.ols <- felm(y ~ x | 0 | 0 | 0, data = df)
m.iv <- felm(y ~ 1 | 0 | (x ~ B) | 0, data = df)
stargazer(m.ols, m.iv, type = exp_type)
|
|
Dependent variable:
|
|
|
|
y
|
|
(1)
|
(2)
|
|
x
|
-0.500***
|
|
|
(0.027)
|
|
|
|
|
x(fit)
|
|
1.000***
|
|
|
(0.110)
|
|
|
|
Constant
|
1.000***
|
-0.0004
|
|
(0.020)
|
(0.077)
|
|
|
|
|
Observations
|
1,000
|
1,000
|
R2
|
0.250
|
-2.000
|
Adjusted R2
|
0.250
|
-2.000
|
Residual Std. Error (df = 998)
|
0.180
|
0.350
|
|
Note:
|
p<0.1; p<0.05; p<0.01
|
Weak Instruments
Given first stage \(s_i = \pi z_i + \epsilon_i\) and structural equation \(y_i = \rho s_i + \eta_i\), define \(\sigma_{\epsilon \eta} = Cov(\epsilon, \eta)\) and \(\sigma_{\epsilon}^2 = Var(\epsilon)\). Let \(F\) be the F statistic of the first-stage. Then, bias in IV term is:
\[
\rho_{2sls} - \rho = \frac{\sigma_{\epsilon,\eta}}{\sigma_{\epsilon}^2}
\frac{1}{F+1}
\]
So, as \(F \rightarrow \infty\), bias \(\rightarrow 0\).
Bound, Jaeger, and Baker (1995) IV may be biased / inconsistent if the instruments are only weakly correlate with the endogenous variables
\[
\rho = \frac{Cov(Y_i, Z_i)}{Cov(D_i,Z_i)} + \frac{Cov(\eta_i, Z_i)}{Cov(D_i, Z_i)}
\]
The second term gets inflated if the instrument is weak.
\[
plim \hat{\beta}_{iv} = \beta + \frac{plim \Delta \bar{\epsilon}}{plim
\Delta \bar{x}}
\]
\[
\frac{plim \hat{\beta}_{iv} - \beta}{plim \hat{\beta}_{ols} - \beta} =
\frac{cov(\hat{x}, \epsilon) / cov(x,\epsilon)}{R^2_{x,z}}
\]
Staiger and Stock (1997) rule of thumb - first-stage F statistic \(\geq\) 10 (for 1 endogenous variable)
Olea and Pflueger (2013) - construct Effective F Statistic
\[
\hat{F}_{eff} = \frac{1}{S} \frac{x'ZZ'x}{tr(\hat{W}_2)}
\]
5. Differences-in-Differences, Fixed Effects, and Panel Data
Fixed Effects
Define an individual fixed effect for individual \(i\)
\[
A_i =
\begin{cases}
1 \text{ if the observation involves unit i} \\
0 \text{ otherwise}
\end{cases}
\]
and define the same for each time period for panel data.
If \(D_{it}\) is as good as randomly assigned conditional on \(A_i\):
\[
E[Y_{0it}|A_i, X_{it}, t, D_{it}] = E[Y_{0it}|A_i, X_{it}, t]
\]
Then, assuming \(A_i\) enter linearly,
\[
E[Y_{0it}|A_i, X_{it}, t, D_{it}] = \alpha + \lambda_t + A_i'\gamma
+ X_{it}'\beta
\]
Assuming the causal effect of the treatment is additive and constant,
\[
E[Y_{1it}|A_i, X_{it}, t] = E[Y_{0it}|A_i, X_{it}, t] + \rho
\]
where \(\rho\) is the causal effect of interest.
Restrictions
- Linear
- Additive functional form
- Variation in \(D_{it}\), over time, for \(i\), must be as good as random
Then, we can write:
\[\begin{align*}
Y_it & = \alpha_i +\lambda_t +_\rho D_{it} + X_{it}' \beta + \epsilon_{it} \\
\epsilon_{it} & = Y_{0it} - E[Y_{0it}|A_i, X_{it}, t] \\
\alpha_i & = \alpha + A_i \gamma
\end{align*}\]
- \(\alpha_i\) captures all observable and unobservable time- invariant for individual i
- \(\lambda_t\) captures all observable and unobservable time- varying characteristics that are constant across the population
Random Effects
When there is autocorrelation in time series (i.e. \(\epsilon_t\) s are correlated over time ), GLS estimates can be obtained by estimating OLS on quasi-differenced data.
\[
y_{it} - \lambda \bar{y_{i}} = (x_{it} - \lambda \bar{x_i}) \beta +
(1-\lambda)\theta_i + \nu_{it} - \lambda \bar{\nu}_i
\]
where \[\lambda = 1 -
\left[ \frac{\sigma_\nu^2}
{\sigma_\nu^2 + T \sigma_\theta^2} \right]^{\frac{1}{2}}
\]
If the error component \(\theta\) is correlated with \(x\), RE estimates are not consistent. Perform Hausman test for random vs fixed effects (where under the null, \(Cov(\theta_i, x_{it}) = 0\))
Within Estimator
Calculate individual averages by averaging the FE equation
\[
\bar{Y_i} = \alpha_i + \bar{\lambda} + \rho \bar{D_i} +
\bar{X_i}'\beta + \epsilon_i
\]
then subtract from the FE equation
\[
Y_{it} - \bar{Y_i} = \lambda_{t} - \bar{\lambda} + \rho (D_{it} -
\bar{D_i}) + (X_{it} - \bar{X_i}')\beta + (\epsilon_{it} -
\epsilon_i)
\]
yields the same \(\rho\)
First Differencing
Define \(\Delta Y_{i,t} = Y_{i,t} - Y_{i,t-1}\) and the corresponding first-differences for \(D\) and \(X\)
Then, the first-differences estimating equation is:
\[
\Delta Y_{it} = \Delta \lambda_{t} + \rho \Delta D_{it} +
\Delta X_{it}' \beta + \Delta \epsilon_{it}
\]
Numerically identical from the FD/Within estimator for \(T=2\), not otherwise. Both consistent, but FE is efficient.
Differences-in-Differences
Assuming \(E[Y_{0it}] = \gamma_i +\lambda_t\), where \(\gamma_i\) is a unit fixed effect and \(\lambda_t\) is a time fixed-effect. Let \(D_{it}\) be a dummy for treated unit-time combinations. Assuming \(E[Y_{1it} - Y_{0it}|i,t]\) is a constant denoted by \(\beta\). We have:
\[
Y_{it} = \gamma_i + \lambda_t + \beta D_{it} + \epsilon_{it}
\]
In practice, the regression formulation is:
\[
Y_{it} = \alpha + \gamma d_i + \lambda Post_{t} + \beta (d_i \times
Post_{t}) + \epsilon_{it}
\]
DiD and Within estimator equivalence for N=2
For a sample with two time periods (\(t \in {0,1}\)), the estimate \(\hat{\rho}\) of the causal parameter \(\rho\) is identical when estimated using the two approaches (1) individual fixed effects and (2) first differences.
For Individual Fixed Effects estimation, using the FWL theorem / regression anatomy formula, we can write:
\[ \hat{\rho}_{fe} = \frac{Cov(\tilde{Y_{it}}, \tilde{X_{it}}
)}{Var(\tilde{X_{it}})} \]
where \(\tilde{Y_{it}}\) and \(\tilde{X_{it}}\) are residuals from regressing \(Y_{it}\) and \(X_{it}\) on individual fixed effects \(\gamma_i\) and a constant term \(\alpha\). Regressing these variables on individual fixed effects and a constant is equivalent to estimating individual means (\(\bar{Y}_i\) and \(\bar{X}_i\)), so the residuals are deviations from the mean (\(Y_{i,t} - \bar{Y}_i\)) and \((X_{i,t} -\bar{X}_i)\).
Furthermore, since \(t = 2\), \(\bar{Y}_i = Y_{i,t}+\frac{\Delta Y_{i,t}}{2}\) and \(\bar{X}_i = X_{i,t}+\frac{\Delta X_{i,t}}{2}\)
Thus,
\[\begin{align*}
\hat{\rho}_{fe} & = \frac{Cov(\tilde{Y_{it}}, \tilde{X_{it}}
)}{Var(\tilde{X_{it}})} \\
& = \frac{Cov(Y_{i,t} - \bar{Y}_i, X_{i,t}
-\bar{X}_i)}{Var(X_{i,t} -\bar{X}_i))}\\
& = \frac{Cov(Y_{i,t} - Y_{i,t} - \frac{\Delta Y_{i,t}}{2} , X_{i,t}
- X_{i,t} - \frac{\Delta X_{i,t}}{2})}{Var(X_{i,t}
- X_{i,t} - \frac{\Delta X_{i,t}}{2})}\\
& = \frac{-Cov(\Delta Y_{i,t},\Delta X_{i,t} )}{-Var(\Delta X_{i,t})} \\
& = \frac{Cov(\Delta Y_{i,t},\Delta X_{i,t} )}{Var(\Delta X_{i,t})} \\
& = \hat{\rho}_{fd}
\end{align*}\]
Card and Krueger - Minimum Wage (1994)
Card and Krueger (1994)
# Import data
njmin <- read.table('public.dat',
header = FALSE,
stringsAsFactors = FALSE,
na.strings = c("", ".", "NA"))
#%%
names(njmin) <- c('SHEET', 'CHAIN', 'CO_OWNED', 'STATE', 'SOUTHJ', 'CENTRALJ',
'NORTHJ', 'PA1', 'PA2', 'SHORE', 'NCALLS', 'EMPFT', 'EMPPT',
'NMGRS', 'WAGE_ST', 'INCTIME', 'FIRSTINC', 'BONUS', 'PCTAFF',
'MEALS', 'OPEN', 'HRSOPEN', 'PSODA', 'PFRY', 'PENTREE', 'NREGS',
'NREGS11', 'TYPE2', 'STATUS2', 'DATE2', 'NCALLS2', 'EMPFT2',
'EMPPT2', 'NMGRS2', 'WAGE_ST2', 'INCTIME2', 'FIRSTIN2', 'SPECIAL2',
'MEALS2', 'OPEN2R', 'HRSOPEN2', 'PSODA2', 'PFRY2', 'PENTREE2',
'NREGS2', 'NREGS112')
#%%
# Calculate FTE employement
njmin$FTE <- njmin$EMPFT + 0.5 * njmin$EMPPT + njmin$NMGRS
njmin$FTE2 <- njmin$EMPFT2 + 0.5 * njmin$EMPPT2 + njmin$NMGRS2
#%%
# Create function for calculating standard errors of mean
semean <- function(x, na.rm = FALSE) {
n <- ifelse(na.rm, sum(!is.na(x)), length(x))
sqrt(var(x, na.rm = na.rm) / n)
}
#%%
# Calucate means
summary.means <- njmin[ , c("FTE", "FTE2", "STATE")] %>%
group_by(STATE) %>%
summarise_all(funs(mean(., na.rm = TRUE)))
summary.means <- as.data.frame(t(summary.means[ , -1]))
#%%
colnames(summary.means) <- c("PA", "NJ")
summary.means$dSTATE <- summary.means$NJ - summary.means$PA
summary.means <- rbind(summary.means,
summary.means[2, ] - summary.means[1, ])
row.names(summary.means) <- c("FTE employment before, all available observations",
"FTE employment after, all available observations",
"Change in mean FTE employment")
#%%
# Calculate
summary.semeans <- njmin[ , c("FTE", "FTE2", "STATE")] %>%
group_by(STATE) %>%
summarise_all(funs(semean(., na.rm = TRUE)))
summary.semeans <- as.data.frame(t(summary.semeans[ , -1]))
#%%
colnames(summary.semeans) <- c("PA", "NJ")
summary.semeans$dSTATE <- sqrt(summary.semeans$NJ + summary.semeans$PA) # / length
njmin <- njmin[ , c("FTE", "FTE2", "STATE")]
njmin <- melt(njmin,
id.vars = c("STATE"),
variable.name = "Period",
value.name = "FTE")
summary.means <- njmin %>%
mutate(STATE = dplyr::recode(STATE, `0` = 'PA (control)',
`1` = 'NJ (treatment)')) %>%
group_by(STATE, Period) %>%
summarise_all(
funs(
mean(., na.rm = TRUE),
semean(., na.rm = TRUE)
))
difftable = data.table::dcast(setDT(summary.means),
STATE ~ Period, value.var = c('mean','semean'))
difftable[, diff:=mean_FTE2 - mean_FTE]
knitr::kable(difftable)
NJ (treatment) |
20 |
21 |
0.51 |
0.52 |
0.59 |
PA (control) |
23 |
21 |
1.35 |
0.94 |
-2.17 |
Main Diff-in-diff estimate
\[
{E[Y_{ist} | s = NJ, t = Nov] - E[Y_{ist}| s = NJ, t = Feb]} -
{E[Y_{ist} | s = PA, t = Nov] - E[Y_{ist}| s = PA, t = Feb]} = \delta
\]
difftable[1,'diff'] - difftable[2,'diff']
## diff
## 1: 2.8
emps <- summary.means %>%
dplyr::select(state = STATE, period = Period, employment = mean)
ggplot(data=emps,
aes(x=period, y=employment, group=state, colour=state)) +
geom_point() +
geom_line()
Synthetic controls
Construct synthetic control group by finding optimal weights for control regins so that pre-trends between treated (\(r=1\)) and control regions (\(r = 2 \cdots R\)) are best matched. If the treatment occurs at \(t=0\) and we have data that goes back to \(-\tau\), we minimise:
\[
min_{w_r} \sum_{t=-\tau}^{0} (\Delta Y_{r=1,t} - \sum_{r=2}^R w_r
\Delta Y_{rt})
\]
Triple-difference
Useful to uncover heterogeneous treatment effects. For general formulation, let \(\mathbb{1}_{I_i = 1}\) denote a dummy for some subgroup of interest, then the estimation equation becomes:
\[
Y_{it} = \alpha + \gamma_i + \lambda^0_t\mathbb{1}_{I_i = 0} +
\lambda^1_t \mathbb{1}_{I_i = 1} + \beta (d_i \times Post_t) +
\psi (d_i \times Post_t \times \mathbb{1}_{I_i = 1}) + \epsilon_{it}
\]
More flexibly allowing for heterogeneous treatment effects where the treatment heteregeneity is along the \(Z\) dimension lets us write:
\[
Y_{it} = \alpha + \gamma_i + \lambda_t + \beta (d_{i} \times Post_t) +
\phi (d_i \times Post_t \times Z_i ) + \epsilon_{it}
\]
Granger Causality
Causes should happen before consequences, and not vice-versa
Check on whether, conditional on state and year fixed effects, past \(D_{st}\) predicts \(Y_{ist}\) while future \(D_{st}\) does not. If \(D_{st}\) causes \(Y_{ist}\) but not vice versa, then dummies for future policy changes should not matter in an equation like:
\[
Y_{ist} = \gamma_s + \lambda_t + \sum_{\tau=0}^m \delta_{-\tau}
D_{s, t - \tau} + \sum_{\tau=1}^q \delta_{+\tau} D_{s,t+\tau} +
X_{ist}' \beta + \epsilon_{ist}
\]
where the sums on the RHS allow for m lags / post-treatment effects, and q leads / pre-treatment effects.
Autor (2003) Granger tests:
autor <- haven::read_dta('table7/autor-jole-2003.dta')
# Log total employment: from BLS employment & earnings
autor$lnemp <- log(autor$annemp)
# Non-business-service sector employment from CBP
autor$nonemp <- autor$stateemp - autor$svcemp
autor$lnnon <- log(autor$nonemp)
autor$svcfrac <- autor$svcemp / autor$nonemp
# Total business services employment from CBP
autor$bizemp <- autor$svcemp + autor$peremp
autor$lnbiz <- log(autor$bizemp)
# Restrict sample
autor <- autor[which(autor$year >= 79 & autor$year <= 95), ]
autor <- autor[which(autor$state != 98), ]
# State dummies, year dummies, and state*time trends
autor$t <- autor$year - 78
autor$t2 <- autor$t^2
# Generate more aggregate demographics
autor$clp <- autor$clg + autor$gtc
autor$a1624 <- autor$m1619 + autor$m2024 + autor$f1619 + autor$f2024
autor$a2554 <- autor$m2554 + autor$f2554
autor$a55up <- autor$m5564 + autor$m65up + autor$f5564 + autor$f65up
autor$fem <- autor$f1619 + autor$f2024 + autor$f2554 + autor$f5564 + autor$f65up
autor$white <- autor$rs_wm + autor$rs_wf
autor$black <- autor$rs_bm + autor$rs_bf
autor$other <- autor$rs_om + autor$rs_of
autor$married <- autor$marfem + autor$marmale
# Modify union variable (1. Don't interpolate 1979, 1981; 2. Rescale into percentage)
autor$unmem[79 == autor$year | autor$year == 81] <- NA
autor$unmem <- autor$unmem * 100
# Create state and year factors
autor$state <- factor(autor$state)
autor$year <- factor(autor$year)
# Diff-in-diff regression
did <- felm(lnths ~ lnemp + admico_2 + admico_1 + admico0 + admico1 + admico2 +
admico3 + mico4 + admppa_2 + admppa_1 + admppa0 + admppa1 +
admppa2 + admppa3 + mppa4 + admgfa_2 + admgfa_1 + admgfa0 +
admgfa1 + admgfa2 + admgfa3 + mgfa4
| state + year + state:t | 0 | state, data = autor)
Estimated Impact of implied-contract exceptions to the employment- at-will doctrine on the use of temporary workers
# Plot results
lags_leads <- c("admico_2", "admico_1", "admico0",
"admico1" , "admico2" , "admico3",
"mico4")
labels <- c("2 yr prior", "1 yr prior", "Yr of adopt",
"1 yr after", "2 yr after", "3 yr after",
"4+ yr after")
results.did <- data.frame(label = factor(labels, levels = labels),
coef = summary(did)$coef[lags_leads, "Estimate"] * 100,
se = summary(did)$coef[lags_leads, "Cluster s.e."] * 100)
g <- ggplot(results.did, aes(label, coef, group = 1))
p <- g + geom_point() +
geom_line(linetype = "dotted") +
geom_errorbar(aes(ymax = coef + 1.96 * se,
ymin = coef - 1.96 * se)) +
geom_hline(yintercept = 0) +
ylab("Log points") +
xlab(paste("Time passage relative to year of",
"adoption of implied contract exception"))
p
Dynamic panel data
For FE to be valid, we rely on strict exogeneity wherein \(E(\nu_{it}|x_i)= E(\nu_{it}|x_{i1}, \cdots x_{iT})=0\) (i.e. errors are uncorrelated with lags and leads of x). This fails when one of the x’s is a lagged dependent variable, e.g.
\[
y_{it} = x_{it}\beta + y_{it-1}\delta + \theta_i + \nu_{it}
\]
this yields a correlation between a RHS variable \(y_{it-1}\) and the error term of the form:
\[
cov([y_{it-1} - \bar{y}_{it-1}], [\nu_{it} - \bar{\nu}_i]) =
-(2/T)\sigma_\nu^2 + [(T-1)/T^2]\sigma_\eta^2
\]
This covariance \(\rightarrow 0\) as \(T \rightarrow \infty\).
IV solution
use lagged values of x and y (e.g. \(x_{it-1}\) and/or \(y_{it-2}\) as instruments for \(\delta y_{it-1}\). \(y_{it-2}\) will not be a valid instrument if \(\nu_{it}\) is serially correlated. Arellano and Bond (1991)
Quasi-Panel
Deaton (1985)
let \(i\) index individuals, \(j\) index birth cohorts, and \(t\) index time. Then, without a true panel to estimate the regression
\[
y_{ijt} = x_{ijt}'\beta + \theta_{ij} + \delta_{t} + \nu_{ijt}
\]
we can estimate:
\[
\bar{y}_{jt} = \bar{x}_{jt}' \beta + \bar{\theta}_j + \delta_t
+ \bar{\nu}_{jt}
\]
6. Regression Discontinuity Design
Sharp RD
Treatment changes discontinuously at some particular value \(x_0\) of x
\[
D_i =
\begin{cases}
1 \text{ if} x_i \geq x_0, \\
0 \text{ if} x_i < x_0,
\end{cases}
\]
This value \(x\) is likely to be arbitrary in the neighbourhood of \(x_0\)
RD_takes advantage of a discrete change in treatment in response to a small change in the running variable x. Causal inference relies on the (local) arbitrariness of whether unit \(i\) has \(x_i < x_0 | x_i > x_0\)
Formally,
\[\begin{align*}
E[Y_{0i}|x_i] & = \alpha + \beta x_i \\
Y_{1i} = Y_{0i} + \rho
\end{align*}\]
which leads to the regression:
\[
Y_i = \alpha + \beta x_i + \tau D_i + \eta_i
\]
To permit nonlinearity, we estimate:
\[
Y_i = f(x_i) + \tau D_i + \eta_i
\]
General Setup with normalised running/forcing variable
Define \(c := x_0\). The treatment effect can be uncovered by running a pooled regression
\[
Y = \alpha_l +_\tau D + f(X - c) + \epsilon
\]
It is recommended to let the regression function differ on the two sides of the cutoff by including interaction terms between D and X.
Then, the pooled regression is
\[
Y = \alpha_l + \tau D + \beta_l (X - c) + (\beta_r - \beta_l) D (X-c)
+ \epsilon
\]
Higher order polynomials can be included to better approximate the regression functions (per Stone Weirstrass thm).
# Generate series
nobs = 100
x <- runif(nobs)
y.linear <- x + (x > 0.5) * 0.25 + rnorm(n = nobs, mean = 0, sd = 0.1)
y.nonlin <- 0.5 * sin(6 * (x - 0.5)) + 0.5 + (x > 0.5) * 0.25 + rnorm(n = nobs, mean = 0, sd = 0.1)
y.mistake <- 1 / (1 + exp(-25 * (x - 0.5))) + rnorm(n = nobs, mean = 0, sd = 0.1)
rd.series <- data.frame(x, y.linear, y.nonlin, y.mistake)
# Make graph with ggplot2
g.data <- ggplot(rd.series, aes(x = x, group = x > 0.5))
p.linear <- g.data + geom_point(aes(y = y.linear)) +
stat_smooth(aes(y = y.linear),
method = "lm",
se = FALSE) +
geom_vline(xintercept = 0.5) +
ylab("Outcome") +
ggtitle(bquote('A. Linear E[' * Y["0i"] * '|' * X[i] * ']'))
p.nonlin <- g.data + geom_point(aes(y = y.nonlin)) +
stat_smooth(aes(y = y.nonlin),
method = "lm",
formula = y ~ poly(x, 2),
se = FALSE) +
geom_vline(xintercept = 0.5) +
ylab("Outcome") +
ggtitle(bquote('B. Nonlinear E[' * Y["0i"] * '|' * X[i] * ']'))
f.mistake <- function(x) {1 / (1 + exp(-25 * (x - 0.5)))}
p.mistake <- g.data + geom_point(aes(y = y.mistake)) +
stat_smooth(aes(y = y.mistake),
method = "lm",
se = FALSE) +
stat_function(fun = f.mistake,
linetype = "dashed") +
geom_vline(xintercept = 0.5) +
ylab("Outcome") +
ggtitle('C. Nonlinearity mistaken for discontinuity')
grid.arrange(p.linear, p.nonlin, p.mistake, ncol = 1)
Lee (2008)
Lee (2008) main figure
# Load the .dta file as data.table
lee <- data.table(haven::read_dta('Lee2008/individ_final.dta'))
# Subset by non-missing in the outcome and running variable for panel (a)
panel.a <- na.omit(lee[, c("myoutcomenext", "difshare"), with = FALSE])
# Create indicator when crossing the cut-off
panel.a <- panel.a[ , d := (difshare >= 0) * 1.0]
# Predict with local polynomial logit of degree 4
logit <- glm(formula = myoutcomenext ~ poly(difshare, degree = 4) +
poly(difshare, degree = 4) * d,
family = binomial(link = "logit"),
data = panel.a)
panel.a <- panel.a[ , pmyoutcomenext := predict(logit, panel.a, type = "response")]
# Create local average by 0.005 interval of the running variable
breaks <- round(seq(-1, 1, by = 0.005), 3)
panel.a <- panel.a[ , i005 := as.numeric(as.character(cut(difshare,
breaks = breaks,
labels = head(breaks, -1),
right = TRUE))), ]
panel.a <- panel.a[ , list(m_next = mean(myoutcomenext),
mp_next = mean(pmyoutcomenext)),
by = i005]
# Plot panel (a)
panel.a <- panel.a[which(panel.a$i005 > -0.251 & panel.a$i005 < 0.251), ]
plot.a <- ggplot(data = panel.a, aes(x = i005)) +
geom_point(aes(y = m_next)) +
geom_line(aes(y = mp_next, group = i005 >= 0)) +
geom_vline(xintercept = 0, linetype = 'longdash') +
xlab('Democratic Vote Share Margin of Victory, Election t') +
ylab('Probability of Victory, Election t+1') +
ggtitle('a')
# Subset the outcome for panel (b)
panel.b <- lee[ , i005 := as.numeric(as.character(cut(difshare,
breaks = breaks,
labels = head(breaks, -1),
right = TRUE))), ]
panel.b <- panel.b[ , list(m_vic = mean(mofficeexp, na.rm = TRUE),
mp_vic = mean(mpofficeexp, na.rm = TRUE)),
by = i005]
panel.b <- panel.b[which(panel.b$i005 > -0.251 & panel.b$i005 < 0.251), ]
plot.b <- ggplot(data = panel.b, aes(x = i005)) +
geom_point(aes(y = m_vic)) +
geom_line(aes(y = mp_vic, group = i005 >= 0)) +
geom_vline(xintercept = 0, linetype = 'longdash') +
xlab('Democratic Vote Share Margin of Victory, Election t') +
ylab('No. of Past Victories as of Election t') +
ggtitle('b')
grid.arrange(plot.a, plot.b)
RDRobust
Bandwidth selection per Calonico, Cattaneo, and Titiunik (2014)
#%%
rdrobust_senate=read.csv("/home/alal/Desktop/code/tutorials/revision/rdrobust_senate.csv")
attach(rdrobust_senate)
### rdplot with 95% confidence intervals
(rdplot(y=vote, x=margin, scale= 5, binselect="es", # scale = 2, ci=95,
title="RD Plot: U.S. Senate Election Data",
y.label="Vote Share in Election at time t+1",
x.label="Vote Share in Election at time t"))
## Call: rdplot
##
## Number of Obs. 1297
## Kernel Uniform
##
## Number of Obs. 595 702
## Eff. Number of Obs. 595 702
## Order poly. fit (p) 4 4
## BW poly. fit (h) 100.000 100.000
## Number of bins scale 5 5
Visual Treatment Effects
#%%
rdplot(y=vote,x=margin,h=17.7080, subset=margin<=17.7080&margin>=-17.7080,
binselect="esmv", kernel="triangular", p=1,
title="RD Plot: U.S. Senate Election Data",
y.label="Vote Share in Election at time t+1",
x.label="Vote Share in Election at time t")
#%%
Bandwidth
#%%
summary(rdbwselect(y=vote,x=margin, all = T))
## Call: rdbwselect
##
## Number of Obs. 1297
## BW type All
## Kernel Triangular
## VCE method NN
##
## Number of Obs. 595 702
## Order est. (p) 1 1
## Order bias (q) 2 2
##
## =======================================================
## BW est. (h) BW bias (b)
## Left of c Right of c Left of c Right of c
## =======================================================
## mserd 17.708 17.708 27.984 27.984
## msetwo 16.154 18.009 27.096 29.205
## msesum 18.326 18.326 31.280 31.280
## msecomb1 17.708 17.708 27.984 27.984
## msecomb2 17.708 18.009 27.984 29.205
## cerrd 12.374 12.374 27.984 27.984
## certwo 11.288 12.585 27.096 29.205
## cersum 12.806 12.806 31.280 31.280
## cercomb1 12.374 12.374 27.984 27.984
## cercomb2 12.374 12.585 27.984 29.205
## =======================================================
#%%
Estimation
### rdrobust default
summary(rdrobust(y=vote,x=margin, all = T))
## Call: rdrobust
##
## Number of Obs. 1297
## BW type mserd
## Kernel Triangular
## VCE method NN
##
## Number of Obs. 595 702
## Eff. Number of Obs. 359 322
## Order est. (p) 1 1
## Order bias (p) 2 2
## BW est. (h) 17.708 17.708
## BW bias (b) 27.984 27.984
## rho (h/b) 0.633 0.633
##
## =============================================================================
## Method Coef. Std. Err. z P>|z| [ 95% C.I. ]
## =============================================================================
## Conventional 7.416 1.460 5.078 0.000 [4.554 , 10.278]
## Bias-Corrected 7.510 1.460 5.143 0.000 [4.648 , 10.372]
## Robust 7.510 1.743 4.310 0.000 [4.094 , 10.925]
## =============================================================================
#%%
#%%
### rdrobust with covariates
summary(rdrobust(y=vote,x=margin,covs=cbind(class,termshouse,termssenate)))
## Call: rdrobust
##
## Number of Obs. 1108
## BW type mserd
## Kernel Triangular
## VCE method NN
##
## Number of Obs. 491 617
## Eff. Number of Obs. 313 283
## Order est. (p) 1 1
## Order bias (p) 2 2
## BW est. (h) 17.987 17.987
## BW bias (b) 28.943 28.943
## rho (h/b) 0.621 0.621
##
## =============================================================================
## Method Coef. Std. Err. z P>|z| [ 95% C.I. ]
## =============================================================================
## Conventional 6.851 1.408 4.866 0.000 [4.091 , 9.611]
## Robust - - 4.200 0.000 [3.729 , 10.254]
## =============================================================================
### rdrobust with cluster-robust variance estimation, and underlying data-driven bandwidth selectors.
summary(rdrobust(y=vote,x=margin,cluster=state))
## Call: rdrobust
##
## Number of Obs. 1297
## BW type mserd
## Kernel Triangular
## VCE method NN
##
## Number of Obs. 595 702
## Eff. Number of Obs. 359 320
## Order est. (p) 1 1
## Order bias (p) 2 2
## BW est. (h) 17.509 17.509
## BW bias (b) 27.032 27.032
## rho (h/b) 0.648 0.648
##
## =============================================================================
## Method Coef. Std. Err. z P>|z| [ 95% C.I. ]
## =============================================================================
## Conventional 7.422 1.522 4.875 0.000 [4.438 , 10.406]
## Robust - - 4.266 0.000 [4.091 , 11.046]
## =============================================================================
#%%
Fuzzy RD
Discontinuity doesn’t necessarily change treatment, but only the probability of treatment. Then, we can use the discontinuity as an instrumental variable for treatment status.
\[
P[D_i = 1 | x_i] =
\begin{cases}
g_0 (x_i) if x_i < x_0 \\
g_1 (x_i) if x_i \geq x_0
\end{cases}
\]
\(g_0(x_i) \neq g_1(x_i)\). Assuming \(g_1(x_0) > g_0(x_0)\), the probability of treatment relates to \(x_i\) via:
\[
E[D_i|x_i] = P[D_i=1|x_i] = g_0(x_i) + [g_1(x_i) - g_0(x_i)] T_i
\] where \(T_i = \mathbb{1} (x_i \geq x_0)\) := point of discontinuity
\[
\rho = lim_{\delta \rightarrow 0}
\frac{E[Y_i|x_0 < x_i < x_0 + \delta] - E[Y_i|x_0 - \delta < x_i < x_0]}
{E[D_i|x_0 < x_i < x_0 + \delta] - E[D_i|x_0 - \delta < x_i < x_0]}
\]
Angrist and Lavy (1999)
Angrist and Lavy (1999) uses Maimonides’ rule:
\[
m_{sc} = \frac{e_s}{int[\frac{e_s - 1}{40}] + 1}
\]
library(foreign)
# Load the data
grade4 <- read.dta("final4.dta")
grade5 <- read.dta("final5.dta")
#%%
# Restrict sample
grade4 <- grade4[which(grade4$classize & grade4$classize < 45 & grade4$c_size > 5), ]
grade5 <- grade5[which(grade5$classize & grade5$classize < 45 & grade5$c_size > 5), ]
# Find means class size by grade size
grade4cmeans <- aggregate(grade4$classize,
by = list(grade4$c_size),
FUN = mean,
na.rm = TRUE)
grade5cmeans <- aggregate(grade5$classize,
by = list(grade5$c_size),
FUN = mean,
na.rm = TRUE)
# Rename aggregaed columns
colnames(grade4cmeans) <- c("c_size", "classize.mean")
colnames(grade5cmeans) <- c("c_size", "classize.mean")
#%%
# Create function for Maimonides Rule
maimonides.rule <- function(x) {x / (floor((x - 1)/40) + 1)}
#%%
# Plot each grade
g4 <- ggplot(data = grade4cmeans, aes(x = c_size))
p4 <- g4 + geom_line(aes(y = classize.mean)) +
stat_function(fun = maimonides.rule,
linetype = "dashed") +
expand_limits(y = 0) +
scale_x_continuous(breaks = seq(0, 220, 20)) +
ylab("Class size") +
xlab("Enrollment count") +
ggtitle("B. Fourth grade")
g5 <- ggplot(data = grade5cmeans, aes(x = c_size))
p5 <- g5 + geom_line(aes(y = classize.mean)) +
stat_function(fun = maimonides.rule,
linetype = "dashed") +
expand_limits(y = 0) +
scale_x_continuous(breaks = seq(0, 220, 20)) +
ylab("Class size") +
xlab("Enrollment count") +
ggtitle("A. Fifth grade")
grid.arrange(p5, p4, ncol = 1)
7. Quantile regression
based on Kleiber and Zeileis (2008)
Least-squares models the conditional mean of a response, but one may be interested in treatment effects at different quantiles of the conditional distribution of Y. The Quantile regression model is given by
\[
Q_y(\tau|x) = x_i' \beta_\tau
\]
where \(Q_y(\tau|x)\) denotes the \(\tau\)-quantile of \(y\) conditional on \(x\).
The conditional quantile function at quantile \(\tau\) is defined as
\[
Q_\tau(Y_i | X_i) = F_y^{-1} (\tau|X_i)
\]
CQF solves the following problem:
\[
Q_\tau (Y_i | X_i) = argmin_{q(x)} E [ \rho_\tau (Y_i - q(X_i))]
\]
where \(\rho_\tau(u)\) is the check function, which weights positive and negative terms asymmetrically (unless \(\tau=0.5\), in which case it collapses to a Least-Absolute-Deviation function). \[
\rho_\tau(u) = (\tau - 1(u \leq 0))u = 1(u > 0) \tau |u| +
1(u\leq 0) (1-\tau)|u|
\]
tau_25 = function(x) (x>0) * 0.25 * abs(x) + (x<=0) * (1 - 0.25) * abs(x)
tau_50 = function(x) (x>0) * 0.50 * abs(x) + (x<=0) * (1 - 0.50) * abs(x)
tau_75 = function(x) (x>0) * 0.75 * abs(x) + (x<=0) * (1 - 0.75) * abs(x)
# dummy dataset
p <- ggplot(data = data.frame(x = 0), mapping = aes(x = x))
# plot functions
p +
stat_function(fun = tau_25, aes(colour='tau_25')) +
stat_function(fun = tau_50, aes(colour='tau_50')) +
stat_function(fun = tau_75, aes(colour='tau_75')) +
xlim(-10,10)+
labs(title='Check functions')
#%%
data('CPS1988')
cps_f <- log(wage) ~ experience + I(experience^2) + education
# models
cps_lm <- lm(cps_f, data= CPS1988)
cps_lad <- rq(cps_f, data= CPS1988)
cps_p25 <- rq(cps_f, tau = 0.25, data= CPS1988)
cps_p75 <- rq(cps_f, tau = 0.75, data= CPS1988)
#%%
#%%
stargazer(cps_lm, cps_lad, cps_p25, cps_p75, type=exp_type)
|
|
Dependent variable:
|
|
|
|
log(wage)
|
|
OLS
|
quantile
|
|
|
regression
|
|
(1)
|
(2)
|
(3)
|
(4)
|
|
experience
|
0.078***
|
0.077***
|
0.092***
|
0.064***
|
|
(0.001)
|
(0.001)
|
(0.002)
|
(0.001)
|
|
|
|
|
|
I(experience2)
|
-0.001***
|
-0.001***
|
-0.002***
|
-0.001***
|
|
(0.00002)
|
(0.00003)
|
(0.00004)
|
(0.00002)
|
|
|
|
|
|
education
|
0.087***
|
0.094***
|
0.093***
|
0.094***
|
|
(0.001)
|
(0.001)
|
(0.002)
|
(0.001)
|
|
|
|
|
|
Constant
|
4.300***
|
4.200***
|
3.800***
|
4.700***
|
|
(0.019)
|
(0.022)
|
(0.029)
|
(0.020)
|
|
|
|
|
|
|
Observations
|
28,155
|
28,155
|
28,155
|
28,155
|
R2
|
0.330
|
|
|
|
Adjusted R2
|
0.330
|
|
|
|
Residual Std. Error
|
0.590 (df = 28151)
|
|
|
|
F Statistic
|
4,546.000*** (df = 3; 28151)
|
|
|
|
|
Note:
|
p<0.1; p<0.05; p<0.01
|
#%%
Variation in coefficients as a function of \(\tau\)
#%%
cps_rqbig <- rq(cps_f, tau = seq(0.05, 0.95, by = 0.05),
data = CPS1988)
#%%
cps_rqbigs <- summary(cps_rqbig)
plot(cps_rqbig)
8. Discrete Choice Models
Generalised Linear Models - Nelder and Baker (1972)
Adapted from Rodriguez
Let \(y_i \sim N(\mu_i, \sigma^2)\), and \(\mu_i = x_i'\beta\). Further, let \(y_i\) be drawn from an exponential PDF (which includes normal, binomial, Poisson, exponential, gamma, and inverse gaussian as special cases).
\[
f(y_i) = \exp \left[ \frac{y_i \theta_i - b(\theta_i)}{a_i(\phi)}
+ c(y_i, \phi) \right]
\]
where \(\theta_i\) and \(\phi\) are parameters and \(a_i(\phi)\), \(b(\theta_i)\), and \(c(y_i, \phi)\) are known functions, and \(a_i(\phi)\) has the form \(a_i(\phi) = \phi/p_i\) where \(p_i\) is a known prior weight, usually 1.
\(\theta_i\) and \(\phi\) are location and scale parameters.
\(E(Y_i) = \mu_i = b'(\theta_i)\)
\(V(Y_i) = \sigma_i^2 = b''(\theta_i)a_i(\phi)\)
Link function - instead of directly modelling the mean, a one-to-one continuous differentiable link function \(g(\mu_i)\) is introduced s.t. \(\eta_i = g(\mu_i)\), where \(\eta_i\) is called the linear predictor. Then,
\[
\mu_i = g^{-1}(x_i'\beta)
\]
OLS |
Normal |
Identity |
Logistic |
binomial |
Logit |
Probit |
binomial |
Normal |
Poisson |
Poisson |
Log |
Binary Response
Assume latent variable \(y^*\) where \(y^* = x_i'\beta + e\), and observed \(y = 1\) if \(y^* >0, 0\) otherwise.
Assuming \(e \sim N(0,1)\), \(Pr(y = 1|x) = Pr(y^* > 0) = Pr(e> -x\beta) =1 - F(-x\beta) = F(x\beta)\).
Common choices for F are:
- Probit model: standard normal \(\Phi\)
- Logit model: logistic CDF \(\Lambda(X_i'\beta) = \frac{exp(X_i'\beta)}{exp(X_i'\beta) + 1} =\frac{1}{1 + exp(-X_i'\beta)}\)
For either F, the model is estimated using maximum likelihood, where the likelihood function is
\[
p(y_i, \beta | x_i) = F(x_i'\beta)^{y_i} (1-F(x_i'\beta))^{(1-y_i)}
\]
Thus, the log likelihood is
\[
log L_n(\beta) = n^{-1} \sum_{i=1}^n y_i F(x_i'\beta)
+ n^{-1} \sum_{i=1}^n (1-y_i) log(1-F(x_i'\beta))
\]
This is maximised to obtain the ML estimate.
Marginal effects:
\[
\frac{\partial E(Y_i|X_I)}{\partial X_i}
= \frac{\partial Pr(y=1|x)}{\partial x}
= \frac{\partial F(x\beta)}{\partial x} = f(x\beta)\beta
\]
library(margins)
e = exp(1)
logistic = function (x) (e^x/(1+e^x))
linear = function(x) x*0.05
# dummy dataset
p <- ggplot(data = data.frame(x = 0), mapping = aes(x = x))
# plot functions
p +
stat_function(fun = linear, aes(colour='Linear')) +
stat_function(fun = logistic, aes(colour='Logit')) +
stat_function(fun=pnorm, aes(colour='Probit')) +
xlim(-10,10)+
labs(title='LDV CEFs')
Probit
x <- glm(am ~ cyl + hp * wt, data = mtcars, family = binomial(link='probit'))
summary(margins(x)) # marginal effects
## factor AME SE z p lower upper
## cyl 0.0226 0.0492 0.4598 0.6456 -0.0738 0.1190
## hp 0.0026 0.0021 1.2244 0.2208 -0.0015 0.0067
## wt -0.5088 0.2479 -2.0523 0.0401 -0.9948 -0.0229
Logit
x <- glm(am ~ cyl + hp * wt, data = mtcars, family = binomial)
summary(margins(x)) # marginal effects
## factor AME SE z p lower upper
## cyl 0.0216 0.0493 0.4377 0.6616 -0.0750 0.1181
## hp 0.0027 0.0023 1.1596 0.2462 -0.0018 0.0072
## wt -0.5158 0.2685 -1.9209 0.0547 -1.0421 0.0105
Multinomial Logit (cf McFadden (1984) )
Define random Utility model where indirect utilities (V) are latent variables.
For a choice of J classes with \(J\) as base category,
\[
Pr(y = j|x) = \frac{\exp(x\beta_j)}
{1+\exp(x\beta_j) + \cdots + \exp(x \beta_{J-1})}
\enspace \forall j \in {1,\cdots, J-1}
\]
\[
Pr(y = J|x) = \frac{1}{1+\exp(x\beta_j) + \cdots + \exp(x \beta_{J-1})}
\]
#%%
ml <- haven::read_dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
ml$prog = as.factor(ml$prog)
ml %<>% filter(complete.cases(.))
mlog_mod <- nnet::multinom(prog ~ ses + write, data = ml)
## # weights: 12 (6 variable)
## initial value 219.722458
## iter 10 value 182.220707
## final value 182.220698
## converged
summary(mlog_mod)
## Call:
## nnet::multinom(formula = prog ~ ses + write, data = ml)
##
## Coefficients:
## (Intercept) ses write
## 2 -3.6 0.64 0.058
## 3 2.4 0.19 -0.054
##
## Std. Errors:
## (Intercept) ses write
## 2 1.2 0.27 0.021
## 3 1.2 0.30 0.023
##
## Residual Deviance: 364
## AIC: 376
#%%
Count Data (Poisson)
Poisson PDF:
\[
f_i(y_i) = \frac{e^{-\mu_i} \mu_i^{y_i}}{y_i !}
\]
specification: \(\mu(x) = exp(x\beta)\).
This yields marginal effect:
\[
\frac{\partial E(y|x)}{\partial x} = exp(x\beta)\beta
\]
It also follows that \(\beta\) is the semi-elasticity of y wrt x
\[
\beta = \frac{\partial E(y|x)}{\partial x} \times \frac{1}{E(y|x)} =
\frac{\partial log E(y|x)}{\partial X}
\]
#%%
p <- read.csv("https://stats.idre.ucla.edu/stat/data/poisson_sim.csv")
p <- within(p, {
prog <- factor(prog, levels=1:3, labels=c("General", "Academic",
"Vocational"))
id <- factor(id)
})
summary(m1 <- glm(num_awards ~ prog + math, family="poisson", data=p))
##
## Call:
## glm(formula = num_awards ~ prog + math, family = "poisson", data = p)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.204 -0.844 -0.511 0.256 2.680
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.2471 0.6585 -7.97 1.6e-15 ***
## progAcademic 1.0839 0.3583 3.03 0.0025 **
## progVocational 0.3698 0.4411 0.84 0.4018
## math 0.0702 0.0106 6.62 3.6e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 287.67 on 199 degrees of freedom
## Residual deviance: 189.45 on 196 degrees of freedom
## AIC: 373.5
##
## Number of Fisher Scoring iterations: 6
#%%
9. Models with self-selection and censoring
Tobit
Many economic variables are censored at 0 (e.g. hours of work, consumption). Variables that are observed at greater than or equal to zero (\(y \geq 0\)) are censored, and variables that are only observed when positive are called truncated.
Define a latent variable \(y^*=x\beta+\epsilon\), where \(\epsilon \sim N(0,\sigma^2)\).
\[
y =
\begin{cases}
y^* &\text{ if } y^* > 0\\
0 &\text{ if } y^* \leq 0
\end{cases}
\]
OLS does not yield consistent estimates. Taking conditional expectations of Y yields:
\[
E(y|x) = Pr(y^* \leq 0) \times 0 + Pr(y^* > 0) \times
E(x\beta + \epsilon | x, y^* > 0)
= Pr(y^* > 0) \times [x\beta + E(\epsilon|x,y^* > 0)] \neq x\beta
\]
\(Pr(y^*>0) = Pr(\epsilon > -x\beta) = Pr(\epsilon/\sigma > -x\beta/\sigma) = 1 - \Phi(-x\beta/\sigma) = \Phi(x\beta/\sigma)\)
Given normality of \(\epsilon\),
\[
E(\epsilon|x, \epsilon > -x\beta) = \sigma
\frac{\phi(x\beta/\sigma)}{\Phi(x\beta/\sigma)} = \sigma \lambda
\]
Where \(\lambda(x\beta/\sigma) =\frac{\phi(x\beta/\sigma)}{\Phi(x\beta/\sigma)}\) is the inverse Mills Ratio.
Plugging this back into the CEF yields:
\[
E(y|x) = \Phi(x\beta/\sigma) \times [x\beta + \sigma \lambda(x\beta/\sigma)]
\]
With truncation (missing instead of zero for \(y^* \leq 0\), this changes to
\[
E(y|x) = x\beta + \sigma \lambda(x\beta/\sigma)
\]
Density of the observed variable is
\[
f(y|x_i) = {1 - \Phi(x_i\beta/\sigma)}^{1[y_i=0]}
{(1/\sigma)\phi[(y^* - x\beta)/\sigma]}^{1[y>0]}
\]
Estimation is done through MLE, where the log likelihood is
\[
l_i(\beta, \sigma) = 1[y_i=0] log[1 - \Phi(x_i\beta/\sigma)] +
1[y_i > 0]log[(1/\sigma) \phi [(y^* - x_i\beta)/\sigma]]
\]
#%%
dat <- read.csv("https://stats.idre.ucla.edu/stat/data/tobit.csv")
f <- function(x, var, bw = 15) {
dnorm(x, mean = mean(var), sd(var)) * length(var) * bw
}
# setup base plot
p <- ggplot(dat, aes(x = apt, fill=prog))
# histogram, coloured by proportion in different programs
# with a normal distribution overlayed
p + stat_bin(binwidth=15) +
stat_function(fun = f, size = 1,
args = list(var = dat$apt))
summary(m <- vglm(apt ~ read + math + prog, tobit(Upper = 800), data = dat))
##
## Call:
## vglm(formula = apt ~ read + math + prog, family = tobit(Upper = 800),
## data = dat)
##
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## mu -2.568 -0.731 -0.0398 0.753 2.80
## loge(sd) -0.969 -0.636 -0.3336 0.236 4.84
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 209.5596 32.5459 6.44 1.2e-10 ***
## (Intercept):2 4.1848 0.0523 79.94 < 2e-16 ***
## read 2.6980 0.6193 4.36 1.3e-05 ***
## math 5.9146 0.7054 8.38 < 2e-16 ***
## proggeneral -12.7146 12.4086 -1.02 0.30552
## progvocational -46.1433 13.7067 -3.37 0.00076 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of linear predictors: 2
##
## Names of linear predictors: mu, loge(sd)
##
## Log-likelihood: -1041 on 394 degrees of freedom
##
## Number of iterations: 5
##
## No Hauck-Donner effect found in any of the estimates
Heckit
An alternative is the 2 step Heckman procedure, where the first stage is a model to estimate \(Pr(y^* > 0) = \Phi(x\beta/\sigma) = \Phi(x\gamma)\), where \(\gamma = \beta/\sigma\). The probit yields a consistent estimate of \(\gamma\) as \(\hat{\gamma}\), which can be plugged into the inverse Mills ratio formula to get
\[
\hat{\lambda} = \frac{\phi(x\hat{\gamma})}{\Phi(x\hat{\gamma})}
\]
- Estimate probit to get \(\hat{\gamma}\), compute \(\hat{\lambda}_i\)
- Use this generated regressor to estimate the model \(y = x \beta + \sigma \hat{\lambda} + \epsilon\)
Problem - not efficient:
- \(\hat{\lambda}\) is a generated regressor - bootstrap the whole procedure
- \(\hat{\lambda}\) mechanically correlated with error, leading to heteroskedasticity. Use robuset standard errors
For truncated variables, estimate probit for missing or positive in step (1) of heckit.