Chapter 03: Causal Inference

This note is a high-fidelity Markdown migration of the Causal Inference chapter from the LaTeX source.

Parent map: index Prerequisites: probability-and-mathstats, linear-regression

Concept map

flowchart TD
  A[Potential Outcomes] --> B[Randomized Experiments]
  A --> C[Selection on Observables]
  C --> D[Regression Adjustment]
  C --> E[Matching / IPW / AIPW]
  A --> F[Instrumental Variables]
  F --> G[LATE / Compliers]
  F --> H[MTE]
  A --> I[RD / RDD]
  A --> J[DiD / Panel]
  J --> K[Staggered Adoption]
  J --> L[Synthetic Control]
  A --> M[Decomposition Methods]
  A --> N[Causal DAGs]

Foundations, Experiments

Potential Outcomes

Exposition from

$Y_{i}$ is the observed outcome, $D_{i}$ is the treatment with levels $d \in D$ ,

potential outcomes denoted $Y_{d i}, Y_{i}^{d}, Y_{i} (d)$ (interchangeably).

Y_{i}^{obs} = Y_{i} (D_{i}) = {Y_{1 i}, Y_{0 i}, D_{i} = 1, D_{i} = 0.

Equivalently, we have the switching equation

Y_{i} = D_{i} Y_{1 i} + (1 - D_{i}) Y_{0 i} = Y_{0 i} + (Y_{1 i} - Y_{0 i}) D_{i} .

This encodes what is known as the causal-consistency assumption (/ SUTVA).

Generally, define a potential outcome

$Y_{i}^{d} = φ (d, X_{i}, U_{i})$

where $X_{i}$ is a vector of observed covariates and $U_{i}$ is a vector of unobservables, and $φ$ is an unknown measurable function. Typically, we are interested in non-parametric identification of $φ$ or some features of it.

Given a population of $n$ units, the assignment mechanism is a row-exchangeable function $Pr (D ∣ X, Y (0), Y (1))$ taking values on $[0, 1]$ and satisfying

d \in {0, 1}^{N} \sum Pr (D = d ∣ X, Y (0), Y (1)) = 1.

A unit level assignment probability for unit $i$ is

p_{i} (X, Y (0), Y (1)) = d : d_{i} = 1 \sum Pr (D = d ∣ X, Y (0), Y (1)) .

A finite population propensity score is

e (x) = \frac{1}{N ( x )} i : X_{i} = x \sum p_{i} (X, Y (0), Y (1)) .

where $N (x) = # {i = 1, \dots, N : X_{i} = x}$ is the number of units in each stratum defined by $X_{i} = x$ .

is a row-exchangeable function of potential outcomes, treatment assignment, and covariates.

$τ = τ (Y (0), Y (1), X, d)$

$Y (0), Y (1)$ are $n -$ vectors of potential outcomes, $X$ is a $n \times p$ covariate matrix, and $d$ is an assignment vector.

The most intuitive estimand is a $n -$ vector $τ = Y (1) - Y (0)$ . This is impossible to estimate because of the FPCI, so we instead use summaries, such as its sample average, or subgroup averages.

We never see both potential outcomes for any given unit.

Decompositions of Observed Differences:

observed difference E [Y_{i} ∣ D_{i} = 1] - E [Y_{i} ∣ D_{i} = 0] = ATT E [Y_{1 i} ∣ D_{i} = 1] - E [Y_{0 i} ∣ D_{i} = 1] + selection bias E [Y_{0 i} ∣ D_{i} = 1] - E [Y_{0 i} ∣ D_{i} = 0] .

observed difference E [Y_{i} ∣ D_{i} = 1] - E [Y_{i} ∣ D_{i} = 0] = ATE E [Y_{1}] - E [Y_{0}] + selection bias E [Y_{0 i} ∣ D_{i} = 1] - E [Y_{0 i} ∣ D_{i} = 0] + heterogeneous treatment bias (1 - π) (A TT - A T U) .

where $π = E [D]$ is the share of the sample treated.

$(Y_{1 i}, Y_{0 i}) ⊥ ⊥ D_{i}$

This is a Missing Completely at Random (MCAR) assumption on potential outcomes.

Writing outcomes generated by the switching regression assumes that potential outcomes for any unit do not vary with the treatment assigned to other units. In practice, this is equivalent to a no spillovers assumption.

$(Y_{1 i}, Y_{0 i}) ⊥ ⊥ D_{- i}$

Equivalently, let $D$ denote a treatment vector for $N$ units, and $Y (D)$ be the potential outcome vector that would be observed if was based on allocation $D$ . Then, SUTVA requires that for allocations $D, D^{'}$ ,

$Y_{i} (D) = Y_{i} (D^{'}) if D_{i} = D_{i}^{'}$

Intuitively, SUTVA ensures that the ‘science table’ (Imbens & Rubin 2015) has 2 columns for the two potential outcomes as opposed to $2^{n}$ (number of potential outcomes with arbitrary interference).

Treatment Effects

Estimands

$τ_{ATE} := E (Y_{1 i} - Y_{0 i})$
$τ_{ATT} := E (Y_{1 i} - Y_{0 i} ∣ D_{i} = 1) = E [Y_{1 i} ∣ D_{i} = 1] - E [Y_{0 i} ∣ D_{i} = 1]$

Under randomisation, $τ_{A TE} = τ_{A TT}$ , since the treated are a random sample of the population. Under weak(er) assumption of $Y_{0 i} ⊥ ⊥ D_{i}$ , only $τ_{A TT}$ is identified.

Difference in Means

τ = \frac{1}{N _{1}} i = 1 \sum N D_{i} Y_{i} - \frac{1}{N _{0}} i = 1 \sum N (1 - D_{i}) Y_{i} .

Variance of Difference in means estimator is given by

Var (\overset{τ}{^}_{DiM}) = \frac{S _{0}^{2}}{N _{0}} + \frac{S _{1}^{2}}{N _{1}} - \frac{S _{01}^{2}}{N} .

where $S_{0}, S_{1}$ are sample variances of $Y^{0}, Y^{1}$ respectively, and $S_{01}$ is the variance of the unit level treatment effect

\frac{1}{N - 1} i = 1 \sum N (Y_{i} (1) - Y_{i} (0) - τ) .

This is not identifiable because of the last term. If the treatment effect is constant in the population, the last term is zero.

A (conservative) variance estimator is given by

V (τ_{DiM}) = \frac{σ _{1}^{2}}{N _{1}} + \frac{σ _{0}^{2}}{N _{0}} .

where

σ_{(d)}^{2} = \frac{1}{N _{d} - 1} i : d_{i} = d \sum (Y_{i} - \overline{Y}_{d})^{2}, d \in {0, 1} .

These variance estimates can be used to construct 95% confidence intervals

C_{0.95} (τ) = (τ - 1.96 V, τ + 1.96 V) .

Regression Adjustment

Y_{i} = α + τ_{REG} D_{i} + η_{i} = \overline{Y}_{0} + (\overline{Y}_{1} - \overline{Y}_{0}) D_{i} + η_{i} .

$α = E [Y_{0 i}]$
$τ = E [Y_{1 i} - Y_{0 i}]$
$η_{i} = Y_{0 i} - E (Y_{0 i})$ [extra terms above come from allowing for heterogeneous TEs]

Selection bias: $Cov (D_{i}, η_{i}) \neq = 0$

Suppose $50$ percent of the population gets the treatment. Let $X_{i}^{'} = [D_{i} 1]$ . Then,

\overset{τ}{^} = \overset{ˉ}{Y}_{T} - \overset{ˉ}{Y}_{C}, \overset{α}{^} = \overset{ˉ}{Y}_{C} .

Generalise to $p$ fraction treated

VCV under homoscedasticity

Var (\overset{τ}{^}) \approx \frac{σ ^ ^{2}}{p ( 1 - p ) N}, \overset{σ}{^}^{2} = \frac{\sum _{i} u ^ _{i}^{2}}{N - 2} .

VCV under heteroskedasticity

Var_{H C} (\overset{τ}{^}) = \frac{1}{N ^{2}} i = 1 \sum N (\frac{D _{i} u ^ _{i}^{2}}{p ^ ^{2}} + \frac{( 1 - D _{i} ) u ^ _{i}^{2}}{( 1 - p ^ ) ^{2}}), \overset{p}{^} = \frac{N _{1}}{N} .

Including controls:

$Y_{i} = α + τ D_{i} + X_{i}^{'} β + η_{i}$

Corrects for chance covariate imbalances, improves precision by removing variation in outcome accounted for by pre-treatment characteristics.

Freedman (2008) Critique

Regression of the form $Y_{i} = α + τ_{re g} D_{i} + β_{1} X_{i} + ϵ_{i}$

$\overset{τ}{^}_{re g}$ is consistent for ATE but has small sample bias (unless model is true); bias is on the order of $1/ n$

$\overset{τ}{^}_{re g}$ precision does not improve through the inclusion of controls; including controls is harmful to precision if more than $3/4$ units are assigned to one treatment condition

Recommends fitting

Y_{i} = α + τ_{lin} D_{i} + β_{0}^{⊤} (X_{i} - \overset{ˉ}{X}) + β_{1}^{⊤} D_{i} (X_{i} - \overset{ˉ}{X}) + ϵ_{i}

Where the two potential outcomes are stipulated to follow

\hat{Y} (1) = \overset{ˉ}{Y}_{1} + (\overset{ˉ}{X} - \overset{ˉ}{X}_{1})^{⊤} (\hat{β}_{0} + \hat{β}_{1}), \hat{Y} (0) = \overset{ˉ}{Y}_{0} + (\overset{ˉ}{X} - \overset{ˉ}{X}_{0})^{⊤} \hat{β}_{0} .

which has same small sample bias, but cannot hurt asymptotic precision even if the model is incorrect and will likely increase precision if covariates are predictive of the outcomes.

Randomisation Inference

sharp null: $Y_{1 i} = Y_{0 i} \forall i$ . Implies $H_{0} : E [Y_{i}] = E [Y_{0}]; H_{1} : E [Y_{i}] \neq = E [Y_{0}]$ .

To test sharp null, set $Y_{1} = Y_{0}$ for all units and re-randomize treatment. Complete randomisation of $2 N$ units with $N$ treated. $(N 2 N)$ assignment vectors. $P$ value can be as small as $1/ (N 2 N)$ .

$Ω$ is the full set of randomisation realisations, and $ω$ is an element in the set (drawn either under complete randomization or binomial randomization), with associated probability $1/ (N 2 N)$

One sided P-value : $Pr ((\overset{α}{^} (ω) \geq \overset{τ}{^}_{ATE}))$

Blocking

Stratify randomisation to ensure that groups start out with identical observable characteristics on blocked factors.

$V [τ_{BR}] < V [τ_{CR}]$ if $\frac{SS R _{ε^{*}}}{n - k - 1} < \frac{SS R _{\overset{ε}{^}}}{n - 2}$ where $\overset{ϵ}{^}$ and $\hat{ϵ^{*}}$ are errors from specification omitting and including block dummies respectively.

For $J$ blocks,

Point estimate $\overset{τ}{^}_{B} = \sum_{j = 1}^{J} \frac{N _{j}}{N} \overset{τ}{^}_{j}$

Variance Randomisations within each block are independent, so the variances are simple means (with squared weights). $Var (\overset{τ}{^}_{B}) = \sum_{j = 1}^{J} (\frac{N _{j}}{N})^{2} Var (\overset{τ}{^}_{j})$

Regression Formulation

y_{i} = τ D_{i} + j = 2 \sum J β_{j} B_{ij} + ϵ_{i}

If treatment probabilities vary by block, then weight by

w_{ij} = \frac{D _{i}}{p _{ij}} + \frac{1 - D _{i}}{1 - p _{ij}} .

Efficiency Gains from Blocking

Complete Randomisation : $Y_i = \alpha + \tau_{CR} D_i

\epsilon_i$

Block Randomisation: $Y_{i} = α + τ_{BR} D_{i} + \sum_{j = 2}^{J} β_{j} B_{ij} + ϵ_{i}^{*}$

Var (τ_{CR}) = \frac{σ _{ε}^{2}}{\sum _{i = 1}^{n} ( D _{i} - D ) ^{2}}, σ_{ε}^{2} = \frac{\sum _{i = 1}^{n} ε _{i}^{2}}{n - 2} = \frac{SS R _{ε}}{n - 2} .

Var (τ_{BR}) = \frac{σ _{ε^{*}}^{2}}{\sum _{i = 1}^{n} ( D _{i} - D ) ^{2} ( 1 - R _{j}^{2} )}, σ_{ε^{*}}^{2} = \frac{\sum _{i = 1}^{n} ( ε _{i}^{*} ) ^{2}}{n - k - 1} = \frac{SS R _{ε^{*}}}{n - k - 1} .

Where $R_{j}^{2}$ is the fit from regressing $D$ on all $B_{j}$ dummies. Since $R_{j}^{2} \approx 0$ by randomisation,

Var (\overset{τ}{^}_{BR}) < Var (\overset{τ}{^}_{CR}) ⟺ \frac{SS R _{\overset{ε}{^}^{*}}}{n - k - 1} < \frac{SS R _{\overset{ε}{^}}}{n - 2} .

Power Calculations

Basic idea: With large enough samples, $V [\overset{ˉ}{Y}_{1} - \overset{ˉ}{Y}_{0}] \approx \frac{σ _{1}^{2}}{pN} + \frac{σ _{0}^{2}}{( 1 - p ) N}$ [where $p = N_{1} / N$ is the share of sample treated]. Set $p$ to minimise overall variance. Yields $p^{*} = \frac{σ _{1}}{σ _{1} + σ _{0}}$ . With homoskedasticity, this is $\frac{1}{2}$ Treatment, $\frac{1}{2}$ control.

$τ = μ_{1} - μ_{0}$ (effect size)

Test for $τ > (t_{1 - κ} + t_{α /2} SE (\hat{β})$ .

For common variance $σ$ ,

π = Pr (∣ t ∣ > 1.96) = Φ (- 1.96 - \frac{τ N}{2 σ}) + (1 - Φ (1.96 - \frac{τ N}{2 σ})) .

General formula for Power with unequal variances

π = Φ - 1.96 + \frac{τ}{\frac{σ _{1}^{2}}{pN} + \frac{σ _{0}^{2}}{( 1 - p ) N}} + 1 - Φ 1.96 - \frac{τ}{\frac{σ _{1}^{2}}{pN} + \frac{σ _{0}^{2}}{( 1 - p ) N}} .

This yields

Common variance (assumed)

MDE (τ) = M_{n - 2} \frac{σ ^{2}}{Np ( 1 - p )} .

where $M_{n - 2} = t_{(1 - α /2)} + t_{1 - κ}$ = Critical t-value to reject null + t-value for alternative (where $1 - κ$ ) is power.

MDES (Minimum Detectable Effect Size in Standard Deviation Units):

$MDES (τ) = M_{n - 2} \frac{1}{Np ( 1 - p )}$

Multiplier $M_{n - 2}$ simplifies to $1.96 + 0.84 \approx 2.8$

MDE \approx (0.84 + 1.96) SE (\overset{τ}{^}) \approx 2.8 SE (\overset{τ}{^}) .

Rearrange to get necessary sample size for any given hypothesised MDE and expected variance.

N = (z_{1 - κ} + z_{α /2})^{2} \cdot \frac{1}{p ( 1 - p )} \cdot \frac{σ ^{2}}{MDE ^{2}} .

MDES for Blocking

MDES (τ_{BR}) = M_{n - k - 1} \frac{1 - R _{B}^{2}}{Np ( 1 - p )} .

where $R_{B}^{2}$ is the R-squared from regressing $Y$ on block dummies.

To test $H_{0} : E [Y_{i} (1) - Y_{i} (0)] = 0$ against the alternative, we look at the T Statistic

T = \frac{Y _{t}^{o b s} - Y _{c}^{o b s}}{S _{y}^{2} / N _{t} + S _{y}^{2} / N _{c}} \approx N (\frac{τ}{σ ^{2} / N _{t} + σ ^{2} / N _{c}}, 1) .

Inverting this for size $α /2$ gives us a required sample size

Required Sample Size = N = \frac{( Φ ^{- 1} ( β ) + Φ ^{- 1} ( 1 - α /2 ) ) ^{2}}{( τ / σ ) ^{2} \cdot γ \cdot ( 1 - γ )} .

typically, $β = 0.8$ , $α = 0.05, γ = 0.5$ , so by substitution:

N = \frac{( Φ ^{- 1} ( 0.8 ) + Φ ^{- 1} ( 0.975 ) ) ^{2}}{( τ / σ ) ^{2} \cdot 0. 5 ^{2}} .

Selection On Observables

typology

Regression estimators: rely on consistent estimation of $μ_{0} (x), μ_{1} (x)$ Matching estimators Propensity score estimators: rely on estimation of $π (x)$ Combination methods (augmented IPW, bias-corrected Matching, etc)

Regression Anatomy / FWL

β_{k} = \frac{Cov ( Y _{i} , x ~ _{ki} )}{V [ x ~ _{ki} ]} .

where $\tilde{x}_{ki}$ is the residual from a regression of $x_{ki}$ on all other covariates.

If structural (long) equation is $Y_{i} = α + τ D_{i} + W_{i}^{'} γ + ϵ_{i}$ , with $W_{i}$ vector of unobserved, and we estimate short $Y_{i} = α + ρ D_{i} + ϵ_{i}$ , then we can write the specification as $y = τ D_{i} + ν_{i} W_{i}^{'} γ + ϵ$

$ρ = \frac{Cov ( Y _{i} , D _{i} )}{V [ D ]} = τ + γ^{'} δ_{W D}$

equivalently,

plim \overset{τ}{^}_{OLS} = τ + δ γ = τ + Omitted Variables Bias plim [(N^{- 1} D^{'} D)^{- 1} N^{- 1} D^{'} W] γ .

Coefficient in Short Regression = Coefficient in long regression + effect of omitted $\times$ regression of omitted on included. This bias can be arbitrarily large.

Identification of Treatment Effects under Unconfoundedness

Unconfoundedness / Selection on Observables / Ignorability / Conditional Independence Assumption: $(Y_{0}, Y_{1}) ⊥ ⊥ D ∣ X$ In terms of densities, this is equivalent to the validity of the following density factorisation

f_{Y (d), D ∣ X} (y, d ∣ x) = f_{Y (d) ∣ X} (y ∣ x) f_{D ∣ X} (d ∣ x) = f_{Y ∣ D, X} (y ∣ d, x) f_{D ∣ X} (d ∣ x) .

common support $0 < Pr (D = 1∣ X) < 1$

E [Y_{d}] = \int E [Y^{d} ∣ X = x] d P_{x} (x) = \int E [Y^{d} ∣ D = d, X = x] d P_{x} (x) = \int E [Y ∣ D = d, X = x] d P_{x} (x) .

The third quantity is estimable using observed data.

Estimators:

Discrete Case: $x$ has finite values indexed by $k = 1, \dots, K$ with generic entry $x_{k}$

τ_{ATE} = k = 1 \sum K (E [Y ∣ D = 1, X = x_{k}] - E [Y ∣ D = 0, X = x_{k}]) Pr (X = x_{k}) .

τ_{ATT} = k = 1 \sum K (E [Y ∣ D = 1, X = x_{k}] - E [Y ∣ D = 0, X = x_{k}]) Pr (X = x_{k} ∣ D = 1) .

Multi-valued and Continuous Treatments

Treatment values: $D$ finite if multi-valued / $\subset R$ for continuous, with corresponding dose-responses $Y_{i} (d)$ . We are interested in dose-response function $μ (d) = E [Y_{i} (d)]$ , and contrasts.

First define Generalised propensity score : $R : = r (d, x) = f_{D ∣ X} (d ∣ x)$

Assumptions: Weak unconfoundedness: $Y (d) ⊥ ⊥ D ∣ X = x \forall D \in D$ Conditional density overlap: $f (D = d ∣ X = x) > 0$

Bias removal using the generalised propensity score: Estimate the conditional expectation of the outcome as a function of treatment level $d$ and GPS $R$ as $β (d, r) = E [Y (d) ∣ r (d, X) = r] = E [Y ∣ D = d ∣ R = r]$ Estimate the dose-response function of the treatment by averaging the conditional expectation at that particular level of treatment $μ (d) = E [β (d, r (d, X))]$

Then compute contrasts to get first derivative (MTE)

$\frac{\partial}{\partial d} E [μ (d)]$

Estimators of $E [Y^{d}]$

which can be used to construct estimators of ATE( $γ_{1} - γ_{0}$ ), ATT( $(γ_{1} - γ_{0} ∣ D = 1)$ , and other estimands. reference: , David Childers’ lecture notes.

Regression Adjustment

Estimate $μ_{d} (x) = E [Y ∣ D = d, X = x]$ by a nonparametric regression estimator $μ_{d} (x)$ .

γ_{d}^{reg} : = \frac{1}{n} i = 1 \sum n μ_{d} (x_{i}) .

Since the average predicted treated outcome for the treated equals the average predicted control outcome for controls, one ATE form is

τ_{reg}^{ATE} = \frac{1}{n} i = 1 \sum n [D_{i} (Y_{i} - μ_{0} (X_{i})) + (1 - D_{i}) (μ_{1} (X_{i}) - Y_{i})] .

SATT only requires imputation of one potential outcome:

τ_{reg}^{ATT} = \frac{1}{N _{t}} i = 1 \sum n D_{i} [Y_{i} - μ_{0} (X_{i})] .

Inverse Propensity Weighting

Estimate propensity score $π (x) = Pr (D = d ∣ X = x)$ by $π (d ∣ x)$ . Then

γ_{d}^{IPW} : = \frac{1}{n} i = 1 \sum n \frac{Y _{i} 1 _{D_{i} = d}}{π ( d ∣ x _{i} )} .

Augmented Inverse Propensity Weighting / Combination Methods

Estimate $μ_{d} (x)$ and $π (x)$ , then average

γ_{d}^{AIPW} : = \frac{1}{n} i = 1 \sum n [μ_{d} (x_{i}) + \frac{( Y _{i} - μ _{d} ( x _{i} ) ) 1 _{D_{i} = d}}{π ( d ∣ x _{i} )}] .

Normalized outcome regression:

μ_{1} = \frac{μ _{1} ( x )}{π ( x )}, μ_{0} = \frac{μ _{0} ( x )}{1 - π ( x )} .

Subclassification / Blocking

Weighted combination of $K$ subclasses of covariate values, which partition the population

\overset{τ}{^}^{ATE} = k = 1 \sum K (\overline{Y}_{1}^{k} - \overline{Y}_{0}^{k}) \cdot (\frac{N ^{k}}{N}) .

\overset{τ}{^}_{ATT} = k = 1 \sum K (\overline{Y}_{1}^{k} - \overline{Y}_{0}^{k}) \cdot (\frac{N _{1}^{k}}{N _{1}}) .

Regression Adjustment

A single regression with controls $X$ is potentially problematic because of Simpson’s paradox. To account for this in a parametric setup, assume a set of iid subjects $i = 1, \dots n$ we observe a tuple $(X_{i}, Y_{i}, D_{i})$ , comprised of

feature vector $X_{i} \in R^{p}$

response $Y_{i} \in R$

treatment assignment $D_{i} \in {0, 1}$

Define conditional response surfaces as

$μ_{(d)} (x) : = E [Y_{i} ∣ X_{i} = x, D_{i} = d]$

First pass regression adjustment estimator (using OLS)

τ = \frac{1}{n} i = 1 \sum n [μ_{(1)} (X_{i}) - μ_{(0)} (X_{i})] .

where $μ_{(d)} (x)$ is obtained via OLS. This generically doesn’t work for regularised regression.

With known propensity score $π (X)$ (as in case of regression), an efficient estimator weights all estimated treatment effects $μ_{1} (X_{i}) - μ_{0} (X_{i})$ by the propensity score:

τ_{reg}^{ATT} = \frac{\sum _{i = 1}^{n} π ( X _{i} ) [ μ _{1} ( X _{i} ) - μ _{0} ( X _{i} ) ]}{\sum _{i = 1}^{n} π ( X _{i} )} .

Additional Assumptions for consistent estimate of ATE from OLS:

Constant treatment effects
Outcomes linear in X

$⟹ τ$ will provide unbiased and consistent estimates of ATE.

1. fails - $τ_{O L S}$ is Best Linear Approximation of average causal response function $E [Y ∣ D = 1, X] - E [Y ∣ D = 0, X]$ .
1. fails - $τ_{O L S}$ is conditional variance weighted average of underlying $τ$ s.

Pretend there are $m$ strata of $X$ . Then, OLS estimates

τ_{OLS} = k = 1 \sum m (E [Y ∣ X = x_{k}, D = 1] - E [Y ∣ X = x_{k}, D = 0]) ω_{k} .

where the weight

ω_{k} = \frac{V [ D ∣ X = x _{k} ] Pr ( X = x _{k} )}{\sum _{r = 1}^{m} V [ D ∣ X = x _{r} ] Pr ( X = x _{r} )} .

$τ_{OLS}$ weighs up groups where the size of the treated and untreated population are roughly equal, and weighs down groups with large imbalances in the size of these two groups.

$τ_{OLS}$ is true effect IFF constant treatment effects holds.

Matching

Regression estimators impute missing potential outcomes by imputing it using $μ_{d} (X_{i})$ . Matching estimators proceed by imputing the potential outcome using the observed outcome from the ‘closest’ control unit.

Define $ℓ_{m} (i)$ as the index that satisfies

j : d_{j} \neq = d_{i} \sum 1 {∥ X_{j} - X_{i} ∥ \leq X_{ℓ_{m} (i)} - X_{i}} = m .

So, $ℓ_{m} (i)$ is the index of the unit in the opposite treatment group that is $m -$ th closest to unit $i$ in terms of covariate values in terms of the norm $∥ \cdot ∥$ . Let $J_{M} (i) : = {ℓ_{1} (i), \dots, ℓ_{M} (i)}$ denote the indices of the first $M$ matches for unit $i$ . Then, impute potential outcomes as

Y_{i} (0) = {Y_{i} \frac{1}{M} \sum_{j \in J_{M} (i)} Y_{j} if D_{i} = 0, if D_{i} = 1.

Y_{i} (1) = {\frac{1}{M} \sum_{j \in J_{M} (i)} Y_{j} Y_{i} if D_{i} = 0, if D_{i} = 1.

then, the simple matching (with replacement) estimator for ATE is

τ_{Match}^{ATE} = \frac{1}{n} i = 1 \sum n [Y_{i} (1) - Y_{i} (0)] .

and corresponding ATT

\overset{τ}{^}_{Match}^{ATT} = \frac{1}{N _{1}} i : D_{i} = 1 \sum (Y_{i} - Y_{j (i)}) .

where $M = 1$ corresponds with one-to-one matching and $M > 1$ is many-to-one. Many-to-one matching is not $n$ consistent (Abadie and Imbens (2006)) and has a bias of $O (N^{- 1/ k})$ where $k$ is the number of continuous covariates.

Bias-corrected (Abadie-Imbens)

\overset{τ}{^}_{ATT} = \frac{1}{N _{1}} i : D_{i} = 1 \sum [(Y_{i} - Y_{j (i)}) - (\overset{μ}{^}_{0} (X_{i}) - \overset{μ}{^}_{0} (X_{j (i)}))] .

Where $μ_{0} (x) = E [Y ∣ X = x, D = 0]$ is the regression function under the control.

Metrics

Euclidian Distance

∥ X_{i} - X_{j} ∥ = E D (X_{i}, X_{j}) = (X_{i} - X_{j})^{'} (X_{i} - X_{j}) .

Stata diagonal distance

StataD (X_{i}, X_{j}) = (X_{i} - X_{j})^{'} diag (Σ_{X})^{- 1} (X_{i} - X_{j}) .

where the normalisation factor is the diagonal element of $\hat{\Sigma}$, the estimated variance covariance matrix.

Mahalanobis distance (scale-invariant)

MD (X_{i}, X_{j}) = (X_{i} - X_{j})^{'} Σ^{- 1} (X_{i} - X_{j}) .

Where $\hat{Σ}$ is the variance-covariance matrix.

Matching estimators have a normal distribution in large samples provided that bias is small.

For matching without replacement,

σ_{ATT}^{2} = \frac{1}{N _{T}} D_{i} = 1 \sum (Y_{i} - \frac{1}{M} m = 1 \sum M Y_{jm (i)} - \hat{δ}_{ATT})^{2} .

For matching with replacement,

σ_{ATT}^{2} = \frac{1}{N _{T}} D_{i} = 1 \sum (Y_{i} - \frac{1}{M} m = 1 \sum M Y_{jm (i)} - \hat{δ}_{ATT})^{2} + \frac{1}{N _{T}} D_{i} = 0 \sum (\frac{K _{i} ( K _{i} - 1 )}{M ^{2}}) V [ϵ ∣ X_{i}, D_{i} = 0] .

where $K_{i}$ is the number of times observation $i$ is used in a match, and the last error variance term is estimated by matching also. the bootstrap doesn’t work for matching.

PScore is a balancing score: conditioning on propensity score is equivalent to conditioning on covariates.

Pr (D = 1 ∣ Y_{0}, Y_{1}, π (X)) = Pr (D = 1 ∣ π (X)) = π (X) .

{Y_{i} (0), Y_{i} (1)} ⊥ ⊥ D_{i} ∣ X_{i} ⟺ {Y_{i} (0), Y_{i} (1)} ⊥ ⊥ D_{i} ∣ π (X_{i}) .

defines the semiparametric Efficiency Bound for ATE: the asymptotic variance of any regular estimator of $τ$ of the population ATE obeys

$n (τ + τ^{P}) d N (0, V)$

where

V \geq V_{eff}^{PATE} : = E [\frac{σ _{1}^{2} ( X )}{π ( X )} + \frac{σ _{0}^{2} ( X )}{1 - π ( X )} + (τ (X) - τ)^{2}] .

and for PATE ( $γ$ )

V_{eff}^{PATT} = E [\frac{π ( X ) σ _{1}^{2} ( X )}{p ^{2}} + \frac{π ( X ) ^{2} σ _{0}^{2} ( X )}{p ^{2} ( 1 - π ( X ) )} + \frac{( τ ( X ) - γ ) ^{2} π ( X )}{p ^{2}}] .

where $σ_{d}^{2} (X) = V [Y^{d} ∣ X]$ , $τ (X) : = E [Y^{1} - Y^{0} ∣ X]$ , and $p : = E [π (X)]$ .

Any regular estimator whose asymptotic variance achieves this efficiency bound is equal to $\frac{1}{n} \sum_{i = 1}^{n} ψ_{i} (μ) + O_{P} (n)$ , where

ψ_{i} (μ) = μ (1, X_{i}) - μ (0, X_{i}) + \frac{D _{i} ( Y _{i} - μ ( 1 , X _{i} ) )}{π ( X _{i} )} - \frac{( 1 - D _{i} ) ( Y _{i} - μ ( 0 , X _{i} ) )}{1 - π ( X _{i} )} .

is the Efficient Influence Function for estimating $τ$ .

shows that

V_{eff}^{SATE} = V_{eff}^{PATE} - Variance of treatment effect E [[(Y_{1} - Y_{0}) - τ^{P}]^{2}] .

Estimators in this section try to attain the SPEB.

τ_{ATE}^{IPW} = E [Y \cdot \frac{D - π ( X )}{π ( X ) ( 1 - π ( X ))}] = E [\frac{Y D}{π ( X )} - \frac{Y ( 1 - D )}{1 - π ( X )}] .

and

τ_{ATT}^{IPW} = \frac{1}{Pr ( D = 1 )} E [Y \cdot \frac{D - π ( X )}{1 - π ( X )}] .

The counterfactual mean $E [Y^{0} ∣ D = 1] = μ_{0}^{1}$ can be identified as

$E [\frac{π ( X _{i} )}{ρ} \frac{1 - D _{i}}{1 - π ( X _{i} )} Y_{i}]$

where $ρ = Pr (D = 1)$ .

τ_{ipw}^{ate} = \frac{1}{n} i = 1 \sum n [\frac{Y _{i} D _{i}}{π ^ ( X _{i} )} - \frac{Y _{i} ( 1 - D _{i} )}{1 - π ^ ( X _{i} )}] = \frac{1}{n} i = 1 \sum n Y_{i} [\frac{D _{i}}{π ( X _{i} )} - \frac{1 - D _{i}}{1 - π ( X _{i} )}] .

normalise both pieces using a Hajek-style adjustment, since extreme values of $π$ makes variance explode. Often advisable to trim or use Hajek weights, which introduces limited bias at the cost of large decreases in variance.

τ_{ipw2}^{ate} = \frac{\sum _{i = 1}^{n} \frac{Y _{i} D _{i}}{π ( X _{i} )}}{\sum _{i = 1}^{n} \frac{D _{i}}{π ( X _{i} )}} - \frac{\sum _{i = 1}^{n} \frac{Y _{i} ( 1 - D _{i} )}{1 - π ( X _{i} )}}{\sum _{i = 1}^{n} \frac{1 - D _{i}}{1 - π ( X _{i} )}} .

Similarly, for the effect on the treated

τ_{ipw}^{att} = \frac{1}{n} i = 1 \sum n [\frac{Y _{i}}{Pr ( D _{i} = 1 )} \cdot \frac{D _{i} - π ( X _{i} )}{1 - π ( X _{i} )}] .

τ_{ipw2}^{att} = [\frac{1}{N _{1}} i : D_{i} = 1 \sum Y_{i}] - \frac{\sum _{i : D_{i} = 0} Y _{i} \cdot \frac{π ( X _{i} )}{1 - π ( X _{i} )}}{\sum _{i : D_{i} = 0} \frac{π ( X _{i} )}{1 - π ( X _{i} )}} .

Horvitz-Thompson Estimator as Regression

$Y_{i} = α + τ D_{i} + ϵ_{i}$ with IPW weights $λ_{i} = \frac{D _{i}}{π ( X _{i} )} + \frac{1 - D _{i}}{1 - π ( X _{i} )}$

define the Weighted ATE (WATE) as

τ_{ATE} = \frac{\int E [ Y ^{1} - Y ^{0} ∣ X = x ] g ( x ) d F ( x )}{\int g ( x ) d F ( x )} .

where $g (x)$ is a weighting function. ATT is constructed when $g (x) = π (x)$

the corresponding estimator is

τ_{WATE} = \frac{\sum _{i = 1}^{n} g ( x _{i} ) [ \frac{Y _{i} D _{i}}{π ( x _{i} )} - \frac{Y _{i} ( 1 - D _{i} )}{1 - π ( x _{i} )} ]}{\sum _{i = 1}^{n} g ( x _{i} )} .

Sample drawn from $f (X)$ , and can represent a target population as $g (X) \propto f (X) h (X)$ where $h (\cdot)$ is the tilting function.

Define $f_{d} (x) = Pr (X = x ∣ D = d)$ , which gives $f_{1} (x) \propto f (x) π (x); f_{0} (x) \propto f (x) (1 - π (x)$

For a given tilting function, to estimate $τ_{h}$ , weight $f_{d} (x)$

w_{1} (x) = \frac{h ( x )}{π ( x )}, w_{0} (x) = \frac{h ( x )}{1 - π ( x )} .

Target $h (x)$ Estimand $w_{1}, w_{0}$

Combined 1 ATE $(\frac{1}{π ( x )}, \frac{1}{1 - π ( x )})$ [IPW] Treated $π (x)$ ATT $(1, \frac{π ( x )}{1 - π ( x )})$ Control $1 - π (x)$ ATC $(\frac{1 - π ( x )}{π ( x )}, 1)$ Overlap $π (x) (1 - π (x))$ ATO $(1 - π (x), π (x))$

Overlap weights are defined by choosing $h (x)$ that minimises asymptotic variance of $τ^{h}$ . The achieve exact balance on covariates included in the propensity score estimation.

τ = μ_{1}^{h} - μ_{0}^{h} = \frac{\sum _{i = 1}^{N} w _{1} ( x _{i} ) D _{i} Y _{i}}{\sum _{i = 1}^{N} w _{1} ( x _{i} ) D _{i}} - \frac{\sum _{i = 1}^{N} w _{0} ( x _{i} ) ( 1 - D _{i} ) Y _{i}}{\sum _{i = 1}^{N} w _{0} ( x _{i} ) ( 1 - D _{i} )} .

$τ^{O W}$ can be interpreted as treatment effect among population that have good balance on observables.

Implemented in PSweight.

Entropy weights $w_{i}$ for each control unit are chosen by a reweighting scheme

$max_{w_{i}} H (w) = - \sum_{i : D = 0} w_{i} lo g (w_{i})$

subject to balance/moment-condition and normalising constraints

i : D_{i} = 0 \sum w_{i} c_{r i} (X_{i}) = m_{r}, r \in {1, \dots, R} .

i : D_{i} = 0 \sum w_{i} = 1, w_{i} \geq 0 \forall i such that D_{i} = 0.

The above problem is convex but has dimensionality of $n_{0}$ (nonnegativity) + $p$ (moment conditions) + $1$ (normalisation). The dual, on the other hand, only has dimensionality $p + 1$ and unconstrained, which is considerably easier to solve using Newton-Raphson.

propose CBPS, which is a method that involves modifying an initial propensity score estimate (e.g. by changing coefficients from a logistic model) iteratively until a balance criterion is reached.

Their basic insight is that when we use a logistic regression to estimate a propensity score, we assert that the pscore takes the form $π_{β} (x_{i}) = Λ (x_{i}^{⊤} β) = \frac{e x p ( x _{i}^{⊤} β )}{1 + e x p ( x _{i}^{⊤} β )}$ , and maximise the bernoulli log likelihood

i = 1 \sum n [d_{i} lo g (π_{β} (x_{i})) + (1 - d_{i}) lo g (1 - π_{β} (x_{i}))] .

which is then solved by the corresponding score

\frac{1}{n} i = 1 \sum n [\frac{d _{i} π _{β}^{'} ( x _{i} )}{π _{β} ( x _{i} )} - \frac{( 1 - d _{i} ) π _{β}^{'} ( x _{i} )}{1 - π _{β} ( x _{i} )}] = 0.

this score balances a particular function of covariates: $π_{β}^{'} (x_{i})$ . Alternatively, we could choose that function by specifying a moment condition

E [\frac{d _{i} f ( x _{i} )}{π _{β} ( x _{i} )} - \frac{( 1 - d _{i} ) f ( x _{i} )}{1 - π _{β} ( x _{i} )}] = 0.

Analogously for ATT, this moment condition is

E [d_{i} f (x_{i}) - \frac{π _{β} ( x _{i} )}{1 - π _{β} ( x _{i} )} (1 - d_{i}) f (x_{i})] = 0.

When this balance condition is solved independently, the problem is just-identified. When it is used in conjunction with the conventional bernoulli likelihood, the problem is over-identified. Implemented in CBPS::CBPS as well as balance.

The estimand is $μ_{1} = E [Y (1)]$ (with $μ_{0}$ defined analogously). The estimator for this quantity is written

$μ_{1} = \frac{1}{n} \sum_{i = 1}^{n} D_{i} γ (X_{i}) Y_{i}$

where the weights $γ (\cdot)$ are chosen to satisfy the sample balance property

\frac{1}{n} i = 1 \sum n D_{i} γ (X_{i}) f (X_{i}) \approx \frac{1}{n} i = 1 \sum n f (X_{i}) .

for any bounded function $f$ .

in words: for every function $f (x)$ , the weighting function equates weighted averages of $f$ over the treated units to unweighted averages over the study population.

The weights are solved by solving an optimisation problem to trade off imbalance and some measure of complexity

γ = γ arg min ⎩ ⎨ ⎧ ζ (\cdot) imbalance_{M}^{2} (γ) + χ (γ {X_{i}}) \frac{σ ^{2}}{n ^{2}} D_{i} \sum γ (X_{i})^{2} ⎭ ⎬ ⎫ .

with convex $ζ, χ$ functions.

A common imbalance measure is

imbalance_{M}^{2} (γ) = j = 1, \dots, p max \frac{1}{n} i = 1 \sum n X_{ij} - \frac{1}{n} i = 1 \sum n D_{i} γ (X_{i}) X_{ij} .

for $M = {β \cdot x : ∥ β ∥_{1} \leq 1}$

Hybrid Estimators

A doubly-robust estimator is consistent if one gets either the propensity score $\overset{π}{^}$ or the regression $\overset{μ}{^}$ right.

Oracle AIPW

τ_{AIPW}^{*} = \frac{1}{n} i = 1 \sum n [μ_{(1)} (X_{i}) - μ_{(0)} (X_{i}) + \frac{D _{i} ( y _{i} - μ _{(1)} ( X _{i} ) )}{e ( X _{i} )} + \frac{( 1 - D _{i} ) ( y _{i} - μ _{(0)} ( X _{i} ) )}{1 - e ( X _{i} )}] .

Feasible AIPW

τ_{AIPW} = \frac{1}{N} i = 1 \sum n [\frac{D _{i} ( Y _{i} - μ ^ _{1} ( X _{i} ) )}{π ^ ( X _{i} )} - \frac{( 1 - D _{i} ) ( Y _{i} - μ ^ _{0} ( X _{i} ) )}{1 - π ^ ( X _{i} )} + (\overset{μ}{^}_{1} (X_{i}) - \overset{μ}{^}_{0} (X_{i}))] .

This is the Augmented-Inverse-Propensity Weighting Estimator (AIPW) introduced by . Additional overviews:. General double-robustness property also shared by targeted maximum-likelihood estimators(TMLE) - due to .

Similarly, analogous estimator for ATT

τ_{aipw}^{att} = \frac{1}{n} i = 1 \sum n [\frac{D _{i} ( Y _{i} - μ _{0} ( X _{i} ) )}{ρ} - \frac{π ( X _{i} ) ( 1 - D _{i} ) ( Y _{i} - μ _{0} ( X _{i} ) )}{ρ ( 1 - π ( X _{i} ) )}] .

where $ρ = Pr (D_{i} = 1)$ and $ρ$ is its empirical analogue.

The Cross-fit version can be stated as

τ_{IPW} = \frac{1}{n} i = 1 \sum n μ_{(1)}^{- k (i)} (X_{i}) - μ_{(0)}^{- k (i)} (X_{i}) + D_{i} \frac{y _{i} - μ _{(1)}^{- k (i)} ( X _{i} )}{π ^{- k (i)} ( X _{i} )} - (1 - D_{i}) \frac{y _{i} - μ _{(0)}^{- k (i)} ( X _{i} )}{1 - π ^{- k (i)} ( X _{i} )} .

where $k (i)$ is a mapping that takes an observation and puts it into one of the $k$ folds. $μ_{(1)}^{- k (i)}$ is an estimator excluding the $k^{t h}$ fold.

Define individual treatment effect score as

Γ_{i} = μ_{(1)} (X_{i}) - μ_{(0)} (X_{i}) + \frac{D _{i} ( Y _{i} - μ _{(1)} ( X _{i} ) )}{π ( X _{i} )} - \frac{( 1 - D _{i} ) ( Y _{i} - μ _{(0)} ( X _{i} ) )}{1 - π ( X _{i} )} .

Then, $τ = \frac{1}{n} \sum_{i = 1}^{n} Γ_{i}$

We can form level- $α$ CIs $I_{α}$ :

I_{α} = τ \pm z_{1 - α /2} V^{1/2}, V = \frac{1}{n ( n - 1 )} i = 1 \sum n (Γ_{i} - τ)^{2} .

grf has a forest-based implementation of AIPW

cf = causal_forest(X, Y, D)
ate_hat = average_treatment_effect(cf)

partially-linear setup

y_{i} = d_{i} τ + g (x_{i}) + ε_{i}, E [ε_{i} ∣ x_{i}, d_{i}] = 0.

d_{i} = m (x_{i}) + η_{i}, E [η_{i} ∣ x_{i}] = 0.

where $d_{i}$ is a scalar treatment indicator. Observations are independent but not necessarily identically distributed. We are interested in inference about $τ$ that is robust to mistakes in model-selection.

Approximate $g$ and $m$ with linear combinations of control terms $c_{i} = P (x_{i})$ , which may contain interactions and non-linear transformations.

Assume approximate sparsity ( $: =$ there are only a small number of relevant controls, and irrelevant controls have a high probability of being small).

Naive (incorrect) approach: use LASSO on an eqn of the form

y_{i} = τ D_{i} + x_{i}^{'} β + ε_{i}, h (β) = ∥ β ∥_{1} = j \sum ∣ β_{j} ∣ .

where the treatment $τ$ is not penalised. This will mean we drop any control that is highly correlated with the treatment if the control is moderately correlated with the outcome. Then, if we use a post-LASSO selection to estimate the treatment effect, the effect will be contaminated with an omitted variable bias.

recommended two-step approach

Estimate $y_{i} = c_{i}^{'} β + ν_{i}$ with LASSO, select predictive variables (i.e. those with nonzero coefficients) in $A$

Estimate $d_{i} = c_{i}^{'} β + ν_{i}$ with LASSO, select predictive variables (i.e. those with nonzero coefficients) in $B$

Estimate $y_{i} = τ D_{i} + e_{i}^{'} κ + υ_{i}$ where $c_{i} : = A \cup B$ [i.e. control for variables that are selected in either the first or second regression]

Let $l_{1}, l_{2} \subset {1, \dots, p}$ be the indices of the selected controls for the outcome and treatment respectively.

The post-double-selection estimator is

(\overset{τ}{ˇ}, \overset{ˇ}{β}) = τ \in R, β \in R^{k} arg min E_{n} [(y_{i} - d_{i} τ - x_{i}^{'} β)^{2}] s.t. β_{j} = 0 \forall j \in / l_{1} \cup l_{2} .

Can use plugin estimator for variance based on residuals

σ_{n}^{- 1} \overset{τ}{ˇ} - τ d N (0, 1), σ_{n}^{2} = \frac{E [ v _{i}^{2} ψ _{i}^{2} ]}{E [ v _{i}^{2} ] ^{2}} .

where $ψ_{i} v_{i} β = (y_{i} - d_{i} \overset{τ}{ˇ} - x_{i}^{'} \overset{ˇ}{β}) \frac{n}{n - s - 1} = d_{i} - x_{i}^{'} β = β \in R^{p} arg min {E_{n} [(d_{i} - x_{i}^{'} β)^{2}] : β_{j} = 0 \forall j \neq \in l}$

Implemented in hdm::rlassoEffect(., ’double selection’)

Let the target parameter $τ_{0}$ solve the equation $E [m (Z_{i}, τ_{0}, β_{0})] = 0$ for known score function $m$ , vector of observables $Z_{i} : = {X_{i}, D_{i}, Y_{i}}_{i = 1}^{n}$ , and nuisance parameter $β_{0}$ . In fully parametric models, $m$ is simply the score function [derivative of the log-likelihood]. For ATE, $m (Z_{i}, τ, β) : = (Y_{i} - D_{i} τ - X_{i}^{'} β) D_{i}$ .

In naive double-ML settings, $E [\partial_{β} m (Z_{i}, τ_{0}, β_{0}) = π_{0} μ_{1} \neq = 0]$ . So, we replace $m$ with the Neyman-orthogonal score $ψ$ s.t.

$E [\partial_{η} ψ (Z_{i}, τ_{0}, η_{0}) = 0] .$

which yields the Orthogonalised Moment Condition $E [ψ (Z_{i}, τ_{0}, η_{0})] = 0$ for some real-valued condition $ψ (.)$ .

Using a Neyman-orthogonal score eliminates first-order biases arising from the replacement of $η_{0}$ with $η_{0}$ .

Reference:

Consider data $W : = (Y, D, X)$ with $D \in {0, 1}$

Partial linear setup $Y = D θ_{0} + g_{0} (X) + U; D = m_{0} (X) + V$ .

Score function is

$$\psi(\Sett{W}; \theta, \eta) = (Y - \Ubr{l(X)}{ $E [Y ∣ X]$ } - \theta(D

\Ubr{m(X)}{ $E [D ∣ X]$ })) (D - m(X))$$

Partially Linear IV

Y - D θ_{0} = g_{0} (X) + ζ, E (ζ ∣ Z, X) = 0.

Z = m_{0} (X) + V, E (V ∣ X) = 0.

Score is

$ψ (W, θ, η) : = (Y - E [Y ∣ X] l (X) - θ (D - E [D ∣ X] r (X))) (Z - E [Z ∣ X] m (X))$

Interactive Regression $Y D = E [Y ∣ D, X] g_{0} (D, X) + ε, E [ε ∣ D, X] = 0 = Pr D = 1∣ X m_{0} (X) + ξ, E [ξ ∣ X] = 0$

Here, the estimands are

θ_{0}^{A TE} = E [g_{0} (1, X) - g_{0} (0, X)] .

θ_{0}^{A TT} = E [g_{0} (1, X) - g_{0} (0, X) ∣ D = 1] .

The score function for ATE (Hahn (1998))

ψ^{A TE} (Z_{i}, θ, η) = (g (1, X) - g (0, X)) + \frac{D ( Y - g ( 1 , X ) )}{m ( X )} - \frac{( 1 - D ) ( Y - g ( 0 , X ) )}{1 - m ( X )} - θ .

The nuisance parameter true value is $η_{0} = (g_{0}, m_{0})$ . For ATET,

ψ^{A TT} (Z_{i}, θ, η) = \frac{1}{π _{0}} (D - \frac{m ( X )}{1 - m ( X )} (1 - D)) (Y - g (0, X)) - \frac{D}{π _{0}} θ .

Take a $K$ -fold random partition $(I_{k})_{k = 1, \dots, K}$ of observation indices ${1, \dots, n}$ s.t. each fold $I_{k}$ has size $n / K$ . For each $k$ , define $I_{k}^{C} : = {1, \dots, n} I_{k}$ as the complement / auxiliary sample.

For each $k \in {1, \dots, K}$ , construct a ML estimator of $η_{0}$ using only the auxiliary sample $I_{k}^{C}$ ; $η_{k} = \overset{η}{^} ((Z_{i})_{i \in I_{k}^{C}})$

For each $k \in {1, \dots, K}$ , using the main sample $I_{k}$ , construct the estimator $\overset{τ}{ˇ}_{k}$ as the solution of $\frac{1}{n / K} \sum_{i \in I_{k}} ψ (Z_{i}, \overset{τ}{ˇ}_{k}, \overset{η}{^}_{k}) = 0$

Aggregate the estimators $\overset{τ}{ˇ}_{k}$ on each main sample $\overset{τ}{ˇ} = \frac{1}{K} \sum_{k = 1}^{K} \overset{τ}{ˇ}_{k}$

Simple implementation of Cross-fitting for Treatment effects

Partition the data in two, such that each fold $I_{1}, I_{2}$ has size $n /2$ .

Using only sample $I_{1}$ , construct a ML estimator of $g (0, X)$ and $m (X)$ ,e.g. a feedforward nnet of $Y_{i}$ on $X_{i}$ , denoted as $g_{I_{1}} (x)$ , and logit-lasso of $D_{i}$ on $X_{i}$ , denoted by $m_{I_{1}} (x)$ .

Use the estimators on the hold-out sample $I_{2}$ to compute the T.E $\overset{τ}{ˇ}_{I_{2}} = \frac{1}{\sum _{i \in I_{2}} D _{i}} [D_{i} - \frac{m _{I_{1}} ( X _{i} )}{1 - m _{I_{1}} ( X _{i} )} (1 - D_{i})] (Y_{i} - g_{I_{1}} (X_{i}))$

Repeat (2,3) swapping the roles of $I_{1}$ and $I_{2}$ to get $\overset{τ}{ˇ}_{I_{1}}$

Aggregate the estimators: $\overset{τ}{ˇ} = \frac{τ ˇ _{I_{1}} + τ ˇ _{I_{2}}}{2}$

Implemented in DoubleML

Augmented Balancing

Loosely: AIPW without the (potentially fraught) inversion of the propensity score step. Exposition based on Bruns-Smith et al (2023)

setup: Covariates $X \in X \subseteq R^{k}$ , $Y \in R$ outcome, two populations $p$ and $q$ that are distributions over $(X, Y)$ $p$ is ‘source’, $q$ is ‘target’ (e.g. treatment group and overall sample) Estimand is $E_{q} [Y]$ Identification Assumptions Conditional Mean Ignorability: $E_{p} [Y ∣ X] = E_{q} [Y ∣ X]$ Population Overlap: $q (x)$ is absolutely continuous w.r.t. $p (x)$

Effect Functionals

Regression Functional

E_{q} [E_{p} [Y ∣ X]] = E_{q} [E_{q} [Y ∣ X]] = E_{q} [Y] .

Weighting Functional

E_{p} [\frac{d q}{d p} (X) Y] = E_{p} [\frac{d q}{d p} (X) E_{p} [Y ∣ X]] = E_{q} [E_{p} [Y ∣ X]] = E_{q} [Y] .

Doubly-Robust Functional

E_{q} [E_{p} [Y ∣ X]] + E_{p} [\frac{d q}{d p} (X) (Y - E_{p} [Y ∣ X])] .

Balancing Weights: Rationale

$\frac{d q}{d p} (X)$ is difficult to estimate using plug-in estimation Alternative: weighting for balance $\equiv$ automatic estimation of the Riesz representer

Weighting to minimise covariate imbalance

w min ⎩ ⎨ ⎧ Imbalance f \in F sup (E_{p} [w (X) f (X)] - E_{q} [f (X)]) + δ ∥ w ∥_{2}^{2} ⎭ ⎬ ⎫ .

Direct estimation of the density ratio

f \in F min E_{p} [(f (X) - \frac{d q}{d p} (X))^{2}] .

Minimum variance weights that balance $F$ are also guaranteed to balance all other measurable functions in $F$ .

In linear setting, relevant imbalance is captured entirely by feature mean imbalance $X_{p}, Y_{p}$ are $n$ iid draws from $p$ , $X_{q}$ are m draws from $q$ Define feature map $ϕ : X \to R^{d}$ ; construct gram matrices $Φ_{p} : = ϕ (X_{p})$ $\overline{Φ}_{q} : = E_{q} [Φ_{q}]$ Let $F = {f (x) = θ^{⊤} ϕ (x) : ∥ θ ∥ \leq r}$ Let $∥ \cdot ∥_{*}$ denote dual norm [ $ℓ_{2} : ℓ_{2}$ , $ℓ_{1} : ℓ_{\infty}$ ]

$Imbalance_{F (w)} = w Φ_{p} - \overline{Φ}_{q}_{*}$

Three Equivalent Representations

Penalised form

w \in R^{n} min {w Φ_{p} - \overline{Φ}_{q}_{*}^{2} + δ_{1} ∥ w ∥_{2}^{2}} .

Constrained form

w \in R^{n} min ∥ w ∥_{2}^{2} s.t. w Φ_{p} - \overline{Φ}_{q} \leq δ_{2} .

Automatic form

θ \in R^{d} min {θ^{⊤} (Φ_{p}^{⊤} Φ_{p}) θ - 2 θ^{⊤} \overline{Φ}_{q} + δ_{3} ∥ θ ∥} .

OLS is equivalent to a weighting estimator that exactly balances the feature means. Let $β_{OLS} = (Φ_{p}^{⊤} Φ_{p})^{- 1} Φ_{p}^{⊤} Y_{p}$ be the linear regression fit on $p$ (source sample). Then,

E_{q} [Φ_{q} β_{OLS}] = E_{p} [w_{exact} \circ Y_{p}] .

E_{q} [Φ_{q} (Φ_{p}^{⊤} Φ_{p})^{- 1} Φ_{p}^{⊤} Y_{p}] = E_{p} [\overline{Φ}_{q} (Φ_{p}^{⊤} Φ_{p})^{- 1} Φ_{p}^{⊤} Y_{p}] .

Analogue for Ridge

E_{q} [Φ_{q} (Φ_{p}^{⊤} Φ_{p} + δ I)^{- 1} Φ_{p}^{⊤} Y_{p}] = E_{p} [\overline{Φ}_{q} (Φ_{p}^{⊤} Φ_{p} + δ I)^{- 1} Φ_{p}^{⊤} Y_{p}] .

$\forall β_{re g}^{λ} \in R^{d}$ , and any linear balancing weight estimator with estimated coefficients $θ^{d} \in R^{d}$ , $w^{δ} = θ Φ_{p}^{⊤}$ , and $Φ_{q}^{δ} = w Φ_{p}$

E_{q} [Φ_{q} β_{re g}^{λ}] + E_{p} [w^{δ} \circ (Y_{p} - Φ_{p} β_{re g}^{λ})] = E_{p} [w^{δ} \circ Y_{p}] + E_{q} [(\overline{Φ}_{q} - Φ_{q}^{δ}) β_{re g}^{λ}] = E_{q} [Φ_{q}^{δ} β_{O L S} + (\overline{Φ}_{q} - Φ_{q}^{δ}) β_{re g}^{λ}] = E_{q} [Φ_{q} β_{a ug}] .

β_{a ug, j} : = (1 - a_{j}^{δ}) β_{re g, j}^{λ} + a_{j}^{δ} β_{O L S, j}, a_{j}^{δ} : = \frac{Φ _{q, j}^{δ} - Φ _{p, j}}{Φ _{q, j} - Φ _{p, j}} .

In words: when both outcome and weighting models are linear, the augmented estimator is equivalent to a linear model with coefficients that are element-wise affine combinations of base learner $β_{re g}^{λ}$ and coefs $β_{O L S}$ from regressing $Y_{p}$ on $Φ_{p}$

Heterogeneous Treatment Effects with selection on observables

Conditional Average Treatment Effects (CATEs) ( $τ (x) = E [Y^{1} - Y^{0} ∣ X = x]$ ) are often of great policy interest for targeting those who have largest potential gains. However, conventional methods are prone to a severe risk of fishing from researchers (cf ‘conditional effects’ in most published work in the social sciences).

Instead, recent work proposes to use nonparametric estimators to find subgroups, use sample-splitting for honesty.

transformed outcome regression use outcome transformed w pscore $H = \frac{D Y}{p ( X )} - \frac{( 1 - D ) Y}{1 - p ( X )}$

conditional mean regression use the fact that under SOO $τ (x) = E [Y_{1} ∣ X = x] - E [Y_{0} ∣ X = x] = μ_{1} (x) - μ_{0} (x)$

(1) typically inefficient because of pscore in denominator, so most focus is on (2). Random forests are a flexible method that is widely liked.

Consider a model for $τ (x)$ where

$Y_{i} (d) = f (X_{i}) + d \cdot τ (X_{i}) + ε (d), Pr d_{i} ∣ X_{i} = e (x)$

where $τ (x) = ψ (x) β$ for some pre-determined set of basis functions $ψ : X \to R^{k}$ . We allow for non-parametric relationships between $X_{i}, y_{i}, d_{i}$ , but the treatment effect function itself is parametrised by $β \in R^{k}$ . showed that under unconfoundedness, we can rewrite the semiparametric setup above as

Y_{i} - m (X_{i}) = (d_{i} - e (X_{i})) ψ (X_{i}) \cdot β + ε_{i} .

where

m (x) = E [Y_{i} ∣ X_{i} = x] = f (X_{i}) + e (X_{i}) τ (X_{i}) .

The oracle algorithm for estimating $β$ is (1) define $Y_{i}^{*} = Y_{i} - m (X_{i})$ and $Z_{i}^{*} = (d_{i} - e (X_{i}) ψ (X_{i})$ , then estimate residuals-on-residual regression. This procedure is $n$ -consistent and asymptotically normal.

Use cross-fitting to emulate the Oracle.

Run non-parametric regressions $Y \sim X$ and $D \sim X$ to get $m (x), e (x)$

define transformed features $Y_{i} = Y_{i} - m^{- k (i)} (X_{i})$ , $Z = (D_{i} - e^{- k (i)} (X_{i}) ψ (X_{i})$

Estimate $ζ_{b}$ by regressing $Y_{i} \sim Z_{i}$

To define R-Loss , under more general setup restate unconfoundedness as follows

E [ε_{i} (d_{i}) ∣ X_{i}, d_{i}] = 0.

where

ε_{i} (d) : = Y_{i} (d) - (μ_{(0)} (X_{i}) + d τ (X_{i})) .

and follow Robinson’s approach to write

$Y_{i} - m (X_{i}) = (D_{i} - e (X_{i})) τ (X_{i}) + ε_{i}$

R-loss is then written

τ (\cdot) = τ^{'} arg min E [(Y_{i} - m (X_{i}) - (d_{i} - e (X_{i})) τ^{'} (X_{i}))^{2}] .

Define $e (x) = Pr (D = 1∣ X = x)$ and $m (x) = E [Y ∣ X = x]$ The R-learner consists of the following steps

Use any method to estimate the response functions $e (x), m (x)$

Minimise R-loss using cross-fitting for nuisance components

τ (\cdot) = τ arg min [((Y_{i} - m (X_{i})) - τ (X_{i}) (D_{i} - e (X_{i})))^{2} + Λ_{n} (τ (\cdot))] .

where $Λ_{n}$ is a regulariser.

Causal forest as implemented by grf starts by fitting two separate trees to estimate $m, e$ , makes out-of-bag predictions [using cross-fitting] using the two first-stage forests, then grow causal forest via

τ (x) = \frac{\sum _{i = 1}^{n} α _{i} ( x ) ( Y _{i} - m ( X _{i} ) ) ( D _{i} - e ( X _{i} ) )}{\sum _{i = 1}^{n} α _{i} ( x ) ( D _{i} - e ( X _{i} ) ) ^{2}} .

where $α_{i} (x) = \frac{1}{B} \sum_{b} \frac{1 _{X_{i} \in L_{b} (x), i \in B}}{∣ i : X _{i} \in L _{b} ( x ) , i \in B ∣}$

are the learned adaptive weights.

Draw a subsample of size $s$ from the sample with replacement and divide it into disjoint sets $I, J; ∣ I ∣ = ∣ J ∣ = n /2$ .

Grow a tree via recursive partitioning, with splits chosen from $J$ (i.e. without using $Y$ observations from $I$ sample)

Estimate leaf responses using only $I$ sample

Finally, aggregate all trees over subsamples of size $s$

μ (x, Z_{1}, \dots, Z_{n}) = (s n)^{- 1} 1 \leq i_{1} < \dots < i_{s} \leq n \sum E_{ξ \in Ξ} [T (x, ξ, Z_{i_{1}}, \dots, Z_{i_{s}})] \approx \frac{1}{B} b = 1 \sum B T (x, ξ_{b}^{*}, Z_{b, 1}^{*}, \dots, Z_{b, s}^{*}) .

where $ξ$ summarises randomness in the selection of the variable when growing the tree, $Z_{i} : = (D_{i}, X_{i}, Y_{i})$ is shorthand for a training sample.

where the base learner $T (x; ξ_{b}^{*}, Z_{b, 1}^{*}, \dots, Z_{b, s}^{*}) = \sum_{i \in {i_{b, 1}, \dots i_{b, s}}} α_{i, b}^{*} (x) Y_{i, b}^{*}; α_{i, b}^{*} (x) = \frac{1 _{X_{i, b}^{*} \in L_{b}^{*} (x)}}{∣ i : X _{i, b}^{*} \in L _{b}^{*} ( x ) ∣}$

the ‘honesty’ property is making $α_{i, b}^{*} (x)$ independent of $Y_{i, b}^{*}$ , i.e. do not use the same data to select partition (splits) and make predictions.

Implemented in causalForest and grf.

Multi-action policy learning

$i = 1, \dots, N$ units, to be assigned to $J + 1$ actions $A_{i} \in {0, 1, \dots, J} = : A$ , which has have corresponding rewards ${Y_{i}^{(0)}, Y_{i}^{(1)}, \dots, Y_{i}^{(J)}}$ . Each observation has covariate $X_{i} \in X \subseteq R^{d}$ . Define a policy function

$π : x \to A$

A given policy assigns each unit to a treatment level. Each policy has a corresponding value function

$V (π) : = E [Y (π (x))]$

An optimal policy $π^{*} \in Π$ is defined as

$π^{*} = arg max_{π \in Π} E [Y (π)]$

Deviations from this optimum is called regret

$R (π) = E [Y (π^{*})] - E [Y (π)] = V (π^{*}) - V (π)$

Define a CEF as

$μ_{i} (a, x_{i}) = E [Y_{i}^{(a)} ∣ x_{i}]$

The first-best optimal rule is

$π_{i} (x_{i}) = arg max_{a \in J} {μ_{i} (a, x_{i})}$

In the binary action case, this simplifies to $π (x_{i}) = 1 {μ (1, x_{i}) \geq μ (0, x_{i})} = 1 {τ (x) > 0}$ which is the conditional empirical success (CES) rule of Manski (2004).

Under unconfoundedness and Overlap, we can estimate $\overset{μ}{^}$ s and construct an empirical analogue of the value function for a policy $π$ using the following familiar estimators

V_{R A} (π) = \frac{1}{n} i = 1 \sum n μ_{i} (π (x_{i}), x_{i}) .

V_{I P W} (π) = \frac{1}{n} i = 1 \sum n \frac{1 [ A _{i} = π ( x _{i} )]}{p _{A_{i}} ( X _{i} )} Y_{i} .

V_{A I P W} (π) = \frac{1}{n} i = 1 \sum n [μ_{i} (π (x_{i}), x_{i}) + \frac{1 [ A _{i} = π ( x _{i} )]}{p _{A_{i}} ( x _{i} )} (Y_{i} - μ_{i} (π (x_{i}), x_{i}))] .

A $n$ -convergent estimator of the value function is the Cross-fit Augmented Inverse Probability Weighted Learning (CAIPWL) estimator of , which is constructed as a cross-fit analogue of the AIPW estimator.

Sensitivity Analysis

Check balance by computing SDiff for observable confounders

$Standardised Difference = \frac{X _{t} - X _{c}}{( s _{t}^{2} + s _{c}^{2} ) /2}$

Three valued treatment indicator: $T_{i} \in {- 1, 0, 1}$ corresponding with ineligibles, eligible nonparticipants, and participants. We can test unconfoundedness by comparing ineligibles with eligible nonparticipants, i.e. test

$Y_{i} ⊥ ⊥ 1 {T_{i} = 0} ∣ X_{i}, T_{i} \in {- 1, 0}$

Placebo Outcomes

Covariates included lagged outcomes $Y_{i, - 1}, \dots Y_{i, - T}$ . Test $Y_{i, - 1} ⊥ ⊥ D_{i} ∣ Y_{i, - 2}, Y_{i, - T}, X_{i}$ e.g. Earnings in 1975 in Lalonde

$U$ is a nuisance parameter.

$Y_{1}, Y_{0} ⊥ ⊥ D ∣ X, U$

Where $U \sim B (π = 0.5)$ , and $U ⊥ ⊥ X$ . $P (U = 1) = P (U = 0) = 0.5$ .

Propensity score is Logistic:

$P (D = 1∣ X, U) = \frac{e x p ( Xθ + γ U )}{1 + e x p ( Xθ + γ U )}$

$γ$ indicates strength of relationship between $U$ and $D ∣ X$ .

Y is conditionally normal $Y ∣ X, U \sim N (α D + Xβ + δ U, σ^{2})$

$δ$ indicates strength of relationship between $U$ and $Y ∣ X$ .

MLE setup

Construct grid of $(γ, δ)$ and calculate the MLE for $\overset{α}{^} (γ, δ)$ by maximising $l (α, β, θ, γ, δ)$ over $(γ, δ)$ .

Use 2 partial $R^{2}$ s:

$R_{Y, p a r}^{2} (δ)$ : Residual variation in outcome explained by $U$ (after partialling out $X$ ).
$R_{D, p a r}^{2} (γ)$ : Residual variation in treatment assignment explained by $U$ (after partialling out $X$ ).

Draw threshold contours, should expect most covariates to be clustered around origin.

Rosenbaum (2002)[]

Tuning parameter $Γ \geq 1$ that measures departure from zero hidden bias.

For any two observations $i$ and $j$ with identical covariate values $X_{i} = X_{j}$ , under unconfoundedness, probability of assignment into treatment should be identical $π (X_{i}) = π (X_{j})$

Treatment assignment probability may differ due to unobserved binary confounder $U$ . We can bound this by the ratio:

\frac{1}{Γ} \leq \frac{π _{i} ( 1 - π _{j} )}{( 1 - π _{i} ) π _{j}} \leq Γ.

$γ = 1 ⟹$ No bias. $Γ = 2 ⟹$ $i$ is twice as likely to be treated than $j$ despite identical $x$ .

$Γ$ is assumed to satisfy

\frac{1}{Γ} \leq \frac{Pr ( D = d ∣ X = x ) / ( 1 - Pr ( D = d ∣ X = x ) )}{Pr ( D = d ∣ X = x , Y ( d ) = y ) / ( 1 - Pr ( D = d ∣ X = x , Y ( d ) = y ) )} \leq Γ.

For any given candidate $Γ > 1$ , estimates of the treatment effect can be computed. Implemented in rbounds::hlsens.

Altonji, Elder, Taber (2005)

Only informative if selection on observables is informative about selection on unobservables.

How much does treatment effect move when controls are added? Estimate model with and without controls:

$Y_{i} = α^{F} D_{i} + Xβ + ϵ$
$Y_{i} = α^{R} D_{i} + ϵ$

AET ratio: $ρ = \frac{α ^ ^{F}}{α ^ ^{R} - α ^ ^{F}}$

Want $ρ$ to be as big as possible (i.e. $\overset{α}{^}^{R} - \overset{α}{^}^{F} \to 0$ under unconfoundedness).

Define proportional selection coefficient

δ = \frac{Cov ( ϵ , D ) / V [ ϵ ]}{Cov ( X ^{'} γ ) / V [ X ^{'} γ ]} .

Then,

β^{*} \approx \tilde{β} - δ (\dot{β} - \tilde{β}) \frac{R _{ma x} - R ~}{R ~ - R ˙} p β .

where

$\dot{β}, \dot{R}$ are from a univariate regression of $Y$ on $T$

$\tilde{β}, \tilde{R}$ are from a regression including controls

$R_{ma x}$ is maximum achievable $R^{2}$

True model is $Y = τ D + X β + γ Z + ε$ , but we don’t observe $Z$ . We would like to quantify how biased the coefficient from the short regression $τ_{s}$ is for the long regression coefficient $τ$ . From OVB FOrmula, we know $τ_{s} = τ + γ + δ$ where $γ$ is the conditional association between the omitted $Z$ and $Y$ (‘impact’) and $δ$ is the coefficient from regressing $Z$ on $D$ (‘imbalance’).

The bias from this omission is

Bias = \frac{R _{Y \sim Z ∣ D, X}^{2} R _{D \sim Z ∣ X}^{2}}{1 - R _{D \sim Z ∣ X}^{2}} \cdot \frac{sd ( Y ^{⊥ (X, D)} )}{sd ( D ^{⊥ X} )} .

They then define

$RV_{q} = \frac{1}{2} [f_{q}^{4} + 4 f_{q}^{2} - f_{q}^{2}]$

where $f_{q} : = q f_{Y \sim D ∣ X}$ where $f_{Y \sim D ∣ X}$ is the partial Cohen’s $f$ of the treatment with the outcome, and $q$ is the proportion of reduction on the treatment coefficient $τ$ that would be deemed problematic.

Partial Identification

the ATE can be decomposed as

ATE = E [Y (1)] - E [Y (0)] = E [Y_{i} (1) ∣ D_{i} = 1] Pr (D_{i} = 1) + E [Y_{i} (1) ∣ D_{i} = 0] Pr (D_{i} = 0) - E [Y_{i} (0) ∣ D_{i} = 1] Pr (D_{i} = 1) - E [Y_{i} (0) ∣ D_{i} = 0] Pr (D_{i} = 0) .

The terms in red are counterfactual outcomes for which the data contains no information. Bounding approaches involve estimators for these missing quantities.

Suppose all we know is $Y^{d} \in [0, 1]$

w.l.o.g. given bounded support $[\underline{Y}, \overline{Y}]$ , we can always min-max rescale to $\frac{Y - Y}{Y - Y}$

E [Y^{1} - Y^{0}] \in [{E [Y ∣ D = 1] Pr (D = 1) - E [Y ∣ D = 0] (1 - Pr (D = 1))} - Pr (D = 1), {E [Y ∣ D = 1] Pr (D = 1) - E [Y ∣ D = 0] (1 - Pr (D = 1))} + (1 - Pr (D = 1))] .

Width of possible interval learnable from data is $[0, 1]$ at largest, $[- 1, 0]$ at smallest, so worst case interval always contains 0. Need theory/assumptions to even get the sign right.

Assume bounded support for the outcome. Replace missing values with maximum ( $y^{U B}$ ) or minimum ( $y^{L B}$ ) of support. These are worst-case bounds and yield intervals that are basically uninformative.

E [Y (1)]^{U B} = E [Y ∣ D = 1] Pr (D = 1) + y^{U B} Pr (D = 0) .

E [Y (1)]^{L B} = E [Y ∣ D = 1] Pr (D = 1) + y^{L B} Pr (D = 0) .

E [Y (0)]^{U B} = y^{U B} Pr (D = 1) + E [Y ∣ D = 0] Pr (D = 0) .

E [Y (0)]^{L B} = y^{L B} Pr (D = 1) + E [Y ∣ D = 0] Pr (D = 0) .

And denote $Δ^{U B} : = E [Y (1)]^{U B} - E [Y (0)]^{L B}$ $Δ^{L B} : = E [Y (1)]^{L B} - E [Y (0)]^{U B}$

Monotone Treatment Response: assume mean potential outcome under treatment cannot be lower than under control $E [Y (1)] \geq E [Y (0)] = Δ \geq 0$ . Then

$Δ^{L B} = max (E [Y (1)]^{L B} - E [Y (0)]^{U B}, 0)$

Monotone Treatment Selection: subjects select themselves into treatment in a way the mean potential outcomes of the treatment and control groups can be ordered. Positive MTS implies $E [Y (1) ∣ D = 1] \geq E [Y (1) ∣ D = 0]$ and $E [Y (0) ∣ D = 1] \geq E [Y (0) ∣ D = 0]$ . This implies $E [Y (0)]^{L B} = E [Y ∣ D = 0]$ and $E [Y (1)]^{U B} = E [Y ∣ D = 1]$

Let $τ_{i} : = Y_{1 i} - y_{0 i}$ denote the treatment effect and $F$ denote its distribution, and let $F_{1}, F_{0}$ denote the distributions of outcomes for the two potential outcomes. Then, $F^{L} (b) \leq F (b) \leq F^{U} (b)$ where

F^{L} (b) = max {y max F_{1} (y) - F_{0} (y - b), 0}

F^{U} (b) = 1 + min {y min F_{1} (y) - F_{0} (y - b), 0}

Instrumental Variables

SOO Fails/ $E [X_{i} ϵ_{i}] \neq = 0$ because of OVB, then $\hat{β}_{O L S}$ is no longer consistent. Use $Z$ as instrument for $D$ which isolates variation unrelated to the omitted variable.

Traditional IV Framework (Constant Treatment Effects)

Setup

Second Stage: $Y = α_{0} + α_{1} D + u_{2}$

First Stage: $D = π_{0} + π_{1} Z + u_{1}$

Reduced Form: $$\begin{aligned} Y & = \gamma_0 + \gamma_1 Z + u_3 \ & = \alpha_0 + \alpha_1 ( \pi_0 + \pi_1 Z + u_1) + u_2 \ & = (\alpha_0 + \alpha_1 \pi_0) + \underbrace{\hlred{(\alpha_1 \pi_1)}}_{\gamma_1} Z + (\alpha_1 u_1 + u_2)

\end{aligned}$$

Exogeneity (as good as random conditional on covariates): $Cov (u_{1}, Z) = 0$

Exclusion Restriction: $Cov (u_{2}, D) = 0$ , $Z$ has no effect on $Y$ except through $D$ .

Relevance: $Z$ affects $D$

With the above assumptions, we can write

$\hat{β}_{I V} = (Z^{'} X)^{- 1} Z^{'} y$

This is equivalent to

With binary treatment and binary instrument, one can write the IV effect as

α_{1} = \frac{γ _{1}}{π _{1}} = \frac{Cov ( Y , Z )}{Cov ( D , Z )} = \frac{E [ Y ∣ Z = 1 ] - E [ Y ∣ Z = 0 ]}{E [ D ∣ Z = 1 ] - E [ D ∣ Z = 0 ]} .

With multiple instruments or endogenous variables,

α_{2 S L S} = (X^{'} P_{z} X)^{- 1} X^{'} P_{z} y

where $P_{z} = Z (Z^{'} Z)^{- 1} Z^{'}$ is $X$ projected in the column space of $Z .$

α_{k} = (X^{'} (I - k P_{z}) X)^{- 1} X^{'} (I - k P_{z}) y

which nests 2SLS, LIML, and Fuller’s estimator as special cases. Specifically,

$k = 0 ⟹ α_{k}$ is OLS

$k = 1 ⟹ α_{k}$ is 2SLS

$k = k_{LIML} ⟹ α_{k}$ is LIML

$k = k_{LIML} - \frac{b}{n - L - p}; b > 0 ⟹ α_{k}$ is Fuller’s estimator

here, $k_{LIML}$ is the minimum value of $k$ that satisfies

det (y^{⊤} (I - k P_{z}) y X^{⊤} (I - k P_{z}) y y^{⊤} (I - k P_{z}) X X^{⊤} (I - k P_{z}) X) = 0

Implemented in ivmodel, which takes model fits from AER::ivreg and computes LIML / k-class estimates.

Asymptotically, all $k -$ class estimators are consistent for $α$ when $k \to 1, n \to \infty$ .

Inference

Under homoscedasticity, $V [\overset{α}{^}_{2 S L S}] = σ^{2} (X^{'} P_{z} X)^{- 1}$

Under heteroskedasticity, $V (\hat{β}_{I V}) = (Z^{'} P_{z} X)^{- 1} P_{z} X^{'} \hat{Ω} P_{z} X (X^{'} P_{z} X)^{- 1}; \hat{Ω} = Diag [\overset{u}{^}_{i}^{2}]$

Test statistic and null distribution $H : = \frac{( β ^ _{2 s l s} - β ^ _{o l s} ) ^{2}}{V ( β ^ _{2 s l s} ) - V ( β ^ _{o l s} )} \sim χ_{1}^{2}$

Equivalently, Assuming the instrument $Z$ is valid, we can test for whether $x$ is endogenous by estimating the following regression

$y_{i} = Z_{i}^{'} π + x_{i} α_{1} + \overset{v}{^}_{i} ρ_{1} + ϵ_{i}$

where $\overset{v}{^}$ are the (fitted) residuals from estimating the first stage regression $x_{i} = Z_{i}^{'} ψ + v_{i}$ . A standard t-test for $ρ$ tests whether $x$ is exogenous assuming $Z_{i}$ is a valid set of instruments. [means this test is not that useful in practice]

Weak Instruments

plim α_{I V} = \frac{Cov ( Y , Z )}{Cov ( Z , D )} + \frac{Cov ( Z , u _{2} )}{Cov ( Z , D )} = α_{D} + \frac{Cov ( Z , u _{2} )}{Cov ( Z , D )} .

Second term non-zero if instrument is not exogenous. Let $σ_{u 1, u_{2}} = Cov (u_{1}, u_{2})$ and $σ_{u_{1}}^{2} = V [u_{2}]$ [variance of first stage error] and $F$ be F statistic of the first-stage. Then, bias in IV is

E [\overset{α}{^}_{I V} - α] = \frac{σ _{u_{1}, u_{2}}}{σ _{u_{2}}^{2}} \frac{1}{F + 1}

If first stage is weak, bias approaches $\frac{σ _{u_{1}, u_{2}}}{σ _{u_{2}}^{2}}$ . As $F \to \infty$ , $B_{I V} \to 0$ .

When instruments are weak, AR confidence intervals are preferable to eyeballing F-statistics. Let $M$ be a $n \times 2$ matrix of $(y X)$ , and let $a_{0} = (β_{0}, 1)$ , $b_{0} = (1, - β_{0})$ (where $β_{0}$ is typically 0), and

Σ = \frac{M ^{⊤} P _{z} M}{n - L - p}

be an estimator for the covariance matrix for the errors.

and let $s, t$ be two-dimensional vectors defined as

s : = (Z^{⊤} Z)^{\frac{1}{2}} Z^{⊤} M b_{0} (b_{0}^{⊤} Σ b_{0})^{- \frac{1}{2}}

and

t : = (Z^{⊤} Z)^{\frac{1}{2}} Z^{⊤} M Σ^{- 1} a_{0} (a_{0}^{⊤} Σ a_{0})^{- \frac{1}{2}}

Define the scalars $Q_{1} = s^{⊤} s, Q_{2} = s^{⊤} t, Q_{3} = t^{⊤} t$

based on these scalars, two tests that are fully robust to weak instruments for testing $H_{0} : β = β_{0}$ - Anderson Rubin test (AR1949) and Conditional Likelihood Test (Moriera 2003)

A R (β_{0}) = \frac{Q _{1}}{L}

C L R (β_{0}) = \frac{1}{2} (Q_{1} - Q_{3}) + \frac{1}{2} (Q_{1} + Q_{3})^{2} - 4 (Q_{1} Q_{3} - Q_{2}^{2})

IV with Heterogeneous Treatment Effects / LATE Theorem

binary instrument $Z_{i} \in {0, 1}$

binary treatment $D_{z} \in {0, 1}$ is potential treatment status given $Z = z$

potential outcomes: $Y_{i} (D, Z) = {Y (1, 1), Y (1, 0), Y (0, 1), Y (0, 0)}$

heterogeneous treatment effects $β_{i} = Y_{i} (1) - Y_{i} (0)$

Compliers: $D_{1} > D_{0}$ , $D_{0} = 0$ , $D_{1} = 1$

Always takers: $D_{0} = D_{1} = 1$

Never Takers : $D_{0} = D_{1} = 0$

Defiers: $D_{1} < D_{0}$

A1: Independence of Instrument : ${Y_{0}, Y_{1}, D_{0}, D_{1}} ⊥ ⊥ Z$

A2: Exclusion restriction : $Y_{i} (d, 0) = Y_{i} (d, 1) \equiv Y_{d i} for d = 0, 1$

A3: First Stage: $E [D_{1 i} - D_{0 i}] \neq = 0$

A4: Monotonicity / No defiers: $D_{1 i} - D_{0 i} \geq 0 \forall i$ or vice versa

Under A1-A4,

α_{I V} = \frac{E [ Y ∣ Z = 1 ] - E [ Y ∣ Z = 0 ]}{E [ D ∣ Z = 1 ] - E [ D ∣ Z = 0 ]} = E [Y_{1 i} - Y_{0 i} ∣ D_{1 i} > D_{0 i}]

If A1:A4 are satisfied, the IV estimate is the Local Average Treatment Effect for the compliers.

LATE = ATE + \frac{Cov ( β _{1 i} , π _{1 i} )}{E [ π _{1 i} ]}

So, LATE is a weighted average for people with large $π_{1 i}$ ; i.e. treatment effect for those whose probability of treatment is most influenced by $Z_{i}$ .

IV in Randomized Trials with one-sided noncompliance. Conditional on A1:A4 holding, and $E [D ∣ Z_{i} = 0] = Pr (D = 1∣ Z = 0) = 0$ . Then,

\frac{E [ Y ∣ Z = 1 ] - E [ Y ∣ Z = 0 ]}{Pr ( D = 1∣ Z = 1 )} = \frac{ITT}{Compliance} = E [Y_{1} - Y_{0} ∣ D = 1] = A TT

Precision for LATE Estimation

$S E_{L A TE} \approx \frac{S E _{I TT}}{Compliance}$

Characterising Compliers

PO Model of IV allows for heterogeneous treatment effects but does not formally identify LATE conditional on X.

extends methods by allowing the treatment inducer to be randomized conditionally on the covariates and by allowing the outcome to depend on the covariates besides the treatment intake. The paper also provided semiparametric estimations of the probability of receiving the treatment inducement, which helps to identify the treatment effects in a more robust way.

Need the following assumptions (all conditional on $X)$ :

Independence of instrument: $Z ⊥ ⊥ (D (z), Y (z^{'}, d)) ∣ X \forall z, z^{'}, d \in {0, 1}$ : SOO w.r.t. instrument. Exclusion restriction: $Pr (Y (1, d) = Y (0, d) = Y (d) ∣ X) = 1$ Monotonicity: $Pr (D (1) \geq D (0) ∣ X) = 1$ First Stage: $E [D ∣ Z = 1, X] - E [D ∣ Z = 0, X] \neq = 0$ Common Support : $0 < Pr (Z = 1, X) < 1$

Specifically, when the treatment inducer Z is as good as randomized after conditioning on covariates X, Abadie proposed a two-stage procedure to estimate treatment effects.

Estimate the probability of receiving the treatment inducement $P (Z = 1∣ X)$ (preferably using a semiparametric estimator) in order to provide a set of pseudo-weights.

Second, the pseudo-weights are used to estimate the local average response function (LARF) of the outcome conditional on the treatment and covariates.

The estimated coefficient for the treatment intake D reflects the conditional treatment effect.

Given monotonicity, we can identify the proportion of compliers, never-takers, and always-takers respectively.

π_{compliers} = Pr (D_{1} > D_{0} ∣ X) = E [D ∣ X, Z = 1] - E [D ∣ X, Z = 0]

π_{always-takers} = Pr (D_{1} = D_{0} = 1∣ X) = E [D ∣ X, Z = 0]

π_{never-takers} = Pr (D_{1} = D_{0} = 0∣ X) = 1 - E [D ∣ X, Z = 1]

If nobody in the treatment group has access to the treatment (i.e. $E [D ∣ Z = 0] = 0$ ), the $LATE = ATT$ .

By Bayes rule,

Pr (D_{1} > D_{0} ∣ D = 1) = \frac{Pr ( D = 1∣ D _{1} > D _{0} ) Pr ( D _{1} > D _{0} )}{Pr ( D = 1 )} = \frac{Pr ( Z = 1 ) [ E [ D ∣ Z = 1 ] - E [ D ∣ Z = 0 ] ]}{Pr ( D = 1 )}

Suppose assumptions of LATE thm hold conditional on covariates $X$ . Let $g (\cdot)$ be any measurable real function of $Y, D, X$ with finite expectation. We can show that the expectation of $g$ is a weighted sum of the expectation in the three groups

E [g ∣ X] = Compliers E [g ∣ X, D_{1} > D_{0}] Pr (D_{1} > D_{0} ∣ X) + Always takers E [g ∣ X, D_{1} = D_{0} = 1] Pr (D_{1} = D_{0} = 1∣ X) + Never Takers E [g ∣ X, D_{1} = D_{0} = 0] Pr (D_{1} = D_{0} = 0 ∣ X)

Rearranging terms gives us

Then,

E [g (Y, D, X) ∣ D_{1} > D_{0}] = \frac{E [ κ \cdot g ( Y , D , X ) ]}{Pr ( D _{1} > D _{0} )} = \frac{E [ κ \cdot g ( Y , D , X ) ]}{E [ κ ]}

where

κ_{i} = 1 - \frac{D ( 1 - Z )}{1 - Pr ( Z = 1 ∣ X )} - \frac{( 1 - D ) Z}{Pr ( Z = 1 ∣ X )}

This result can be applied to any characteristic or outcome and get its mean for compliers by removing the means for never and always takers. [p 181-183] provides overview of estimation. Trick is to construct a weighting scheme with positive weights so that $κ_{i}$ , which is negative for always-takers and never-takers.

To compute $κ$ , we need $Pr (Z = 1∣ X)$ , which can be computed using a standard logit/probit or a power-series.

Standard example: average covariate value among compliers:

$E [X ∣ D_{1} > D_{0}] = \frac{E [ κ X ]}{E [ κ ]}$

is the weighted average of covariate $X$ using Kappa weights.

Likelihood that Complier has a given value of (Bernoulli distributed) characteristic X relative to the rest of the population is given by

\frac{E [ D ∣ Z = 1 , X = 1 ] - E [ D ∣ Z = 0 , X = 1 ]}{E [ D ∣ Z = 1 ] - E [ D ∣ Z = 0 ]} = \frac{FS in Subgroup}{Overall FS}

Assume A1-A4 from LATE. Generalise $D$ to take values in the set ${0, 1, \dots, \overset{ˇ}{D}}$ ; Let $Y_{d i} : = f_{i} (d)$ denote the potential (or latent) outcome for person $i$ for treatment level $d$ . Then,

\frac{E [ Y ∣ Z = 1 ] - E [ Y ∣ Z = 0 ]}{E [ D ∣ Z = 1 ] - E [ D ∣ Z = 0 ]} = d = 1 \sum \overset{ˇ}{D} ω_{d} E [Y_{d i} - Y_{d - 1, i} ∣ d_{1 i} \geq d > d_{0 i}]

where the weights

ω_{d} = \frac{Pr ( d _{1 i} > d > d _{0 i} )}{\sum _{j = 1}^{\overset{ˇ}{D}} Pr ( d _{1 i} \geq j \geq d _{0 i} )}

are non-negative and sum to 1.

CEF of $Y ∣ X, D$ for the subpopulation of compliers: $E [Y ∣ X, D, D_{1} > D_{0}]$

$E [Y ∣ X, D, D_{1} > D_{0}] = \frac{E [ κY ∣ X , D ]}{E [ κ ]}$

Estimate $κ$

Estimate $E [Y ∣ X, D]$ in the whole population, weighting by $κ$

implemented in LARF::larf in R.

Treatment is $W$ . First define two additional quantities

$P_{A, C, i} : = Pr (W_{1} > W_{0} \cup W_{0} = 1∣ X_{i} = x) = F (x_{i}^{'} θ_{A, C})$ is the conditional probability that unit $i$ is either a complier *or* an always taker assume that this probability is a function of covariates $X_{i}$ , with corresponding parameter vector $θ_{A, C}$ and CDF $F$ that transforms it to the probability scale [taken to be the normal CDF $Φ$ henceforth, but can be relaxed] $P_{A ∣ A, C, i} : = Pr (W_{0} = 1∣ W_{1} > W_{0} \cup W_{0} = 1, X_{i} = x) = F (x_{i}^{'} ϕ_{A ∣ A, C})$ is the conditional probability that unit $i$ is an always taker *conditional* on being either a complier or a never taker assume that this probability is a function of covariates with corresponding covariate vector $ϕ_{A ∣ A, C}$

Next, they note that the probability of treatment for stratum $X_{i} = x_{i}$ can be written as

Pr (W = 1 ∣ X_{i} = x_{i}) = Pr (W_{1} > W_{0} ∣ X_{i} = x_{i}) Z_{i} Compliers assigned to treatment + Pr (W_{0} = 1 ∣ X_{i} = x_{i}) Always takers

Using the two conditional probabilties defined above, this can be written as

Pr (W = 1 ∣ X_{i} = x_{i}) = P_{A, C, i} (1 - P_{A ∣ A, C, i}) Z_{i} + P_{A, C, i} P_{A ∣ A, C, i}

which, for binary treatment $W_{i}$ lets us write a Bernoulli likelihood for an observation

ℓ_{i} (P_{A ∣ A, C, i}, P_{A, C, i} ∣ W, Z) = (P_{A, C, i} (1 - P_{A ∣ A, C, i}) Z_{i} + P_{A, C, i} P_{A ∣ A, C, i})^{W_{i}} (1 - P_{A, C, i} (1 - P_{A ∣ A, C, i}) Z_{i} - P_{A, C, i} P_{A ∣ A, C, i})^{1 - W_{i}}

Plugging in the definitions of $P_{A, C, i}$ and $P_{A ∣ A, C, i}$ gives us the likelihood and its argmax defines the solution for $θ_{A, C}$ and $ϕ_{A ∣ A, C}$ . This is generically a difficult optimisation problem and improving its computation is a promising avenue for future research.

L (P_{A ∣ A, C, i}, P_{A, C, i} ∣ W, Z) = i = 1 \prod n (F (x_{i}^{'} θ_{A, C}) (1 - F (x_{i}^{'} ϕ_{A ∣ A, C}) Z_{i} + F (x_{i}^{'} θ_{A, C}) F (x_{i}^{'} ϕ_{A ∣ A, C})))^{W_{i}} ((1 - F (x_{i}^{'} θ_{A, C})) (1 - F (x_{i}^{'} ϕ_{A ∣ A, C}) Z_{i} - F (x_{i}^{'} θ_{A, C}) F (x_{i}^{'} ϕ_{A ∣ A, C})))^{1 - W_{i}}

The maximum likelihood estimates of the two parameter vectors can be plugged into $F$ to compute individual compliance scores

P_{C, i} = Pr (W_{1} > W_{0} ∣ X_{i} = x_{i}) = F (x_{i}^{'} θ_{A, C}) [1 - F (x_{i}^{'} ϕ_{A ∣ A, C})]

The inverse compliance score weighted estimator for the ATE with weights $ω_{C, i} : = 1/ P_{C, i}$ is then

τ_{ICSW}^{ATE} = \frac{\frac{\sum _{i = 1}^{n} ω ^ _{C i} Z _{i} Y _{i}}{\sum _{i = 1}^{n} ω ^ _{C i} Z _{i}} - \frac{\sum _{i = 1}^{n} ω ^ _{C i} ( 1 - Z _{i} ) Y _{i}}{\sum _{i = 1}^{n} ω ^ _{C i} ( 1 - Z _{i} )}}{\frac{\sum _{i = 1}^{n} ω ^ _{C i} Z _{i} W _{i}}{\sum _{i = 1}^{n} ω ^ _{C i} Z _{i}} - \frac{\sum _{i = 1}^{n} ω ^ _{C i} ( 1 - Z _{i} ) W _{i}}{\sum _{i = 1}^{n} ω ^ _{C i} ( 1 - Z _{i} )}}

which is a weighted version of the familiar Wald estimator with a Hajek correction that normalises each expectation by the sum of weights in that treatment group.

SSIV setting from [notation and exposition from PGP’s slides]. We want to estimate the causal effect or structural parameter $τ$ in

$y_{l} = τ w_{l} + γ^{⊤} x_{l} + ε_{l}$

where $Cov (ε_{l}, w_{l}) \neq = 0$ because the ‘treatment’ $w_{l}$ is typically a change in an economic quantity (e.g. employment) that is correlated with unobserved shocks to the outcome $y_{l}$ (e.g. wages). $l$ indexes locations.

An accounting identity that decomposes the treatment is

w_{l} = k = 1 \sum K z_{l k} Location-Industry Shares Location-Industry Shifts g_{l k}

where $k$ indexes industries. 2nd accounting identity for location-industry shifts is

Location-industry g_{l k} = industry g_{k} + location-industry shocks: unobserved g_{l k}

As a GMM system

y_{lt} = D_{lt}^{⊤} β_{0} + τ w_{lt} + ε_{lt}

w_{lt} = D_{lt}^{⊤} γ_{0} + ψ B_{lt} + η_{lt}

g_{l k t} = g_{k t} + g_{l k t}

B_{lt} = k = 1 \sum K z_{l k 0} g_{k t}

$D_{lt}$ denotes exogenous controls and fixed effects.

{{w_{lt}, D_{lt}, ε_{lt}}_{t = 1}^{T}}_{l = 1}^{L} are IID, L \to \infty

Under constant $τ$ , need Exogeneity $E [B_{lt} ε_{lt} ∣ D_{lt}] = 0$ Relevance $Cov (B_{lt}, w_{lt} ∣ D_{lt}) \neq = 0$

τ_{Bartik} = \frac{\sum _{l = 1}^{L} \sum _{t = 1}^{T} \sum _{k = 1}^{K} z _{l k t} g _{k t} y _{lt}^{⊥}}{\sum _{l = 1}^{L} \sum _{t = 1}^{T} \sum _{k = 1}^{K} z _{l k t} g _{k t} w _{lt}^{⊥}}

‘shares’: focus on $z_{l k 0}$ : Analogy to DiD: $Δ_{g t} =$ Changes in industry composition $g_{k t}$ ‘shifts’: focus on $g_{k t}$ : requires argument for why shocks are randomly assigned

$τ_{bartik} = \sum_{k} α_{k} τ_{k}$

with Rotemberg weight

$α_{k} = \frac{g _{k} Z _{k}^{⊤} W}{\sum _{k = 1}^{K} g _{k} Z _{k}^{⊤} W}$

Marginal Treatment Effects: Treatment effects under self selection

propose the marginal treatment effect (MTE) setup that generalises the IV approach for continuous instruments and nests many estimands (and is a generalisation of the Roy (1951) model). It also has a clearer treatment of self-selection.

Exposition based on . Define potential outcomes

Y_{0 i} = μ_{0} (x_{i}) + υ_{0 i}

Y_{1 i} = μ_{1} (x_{i}) + υ_{1 i}

where $μ_{j} (\cdot)$ is the conditional mean function and $υ_{ji}$ captures deviations, with $E [υ_{ji} ∣ x_{i}] = 0$ .

Treatment assignment assumes a weakly separable choice model

D_{i}^{*} = μ_{d} (x_{i}, z_{i}) + v_{i}

D_{i} = 1_{D_{i}^{*} \geq 0}

where $d_{i}^{*}$ is the latent propensity to take the treatment, and is interpreted as the net gain from treatment since treatment is only taken up if $D_{i}^{*} \geq 0$ . $z_{i}$ is an instrument. $v_{i}$ enters the selection equation negatively, and thus represents latent resistance to treatment.

The condition $D_{i}^{*} \geq 0$ can be rewritten as $μ_{d} (x_{i}, z_{i}) \geq v_{i}$ . Applying the CDF of $v$ $F_{v}$ to both sides yields

F_{v} (μ_{d} (x_{i}, z_{i})) \geq F_{v} (v_{i})

Define $P (x_{i}, z_{i}) = : F_{v} (μ_{d} (x_{i}, z_{i}))$ and $υ_{d i} = : F_{v} (v_{i})$ .

Both RHS and LHS are distributed on $[0, 1]$ . The treatment decision can now be written as

$D_{i} = 1_{P (x_{i}, z_{i}) \geq υ_{d i}}$ .

Now, we define treatment effects

Y_{i} = (1 - D_{i}) Y_{0 i} + D_{i} Y_{1 i} = Y_{0 i} + D_{i} (Y_{1 i} - Y_{0 i}) = μ_{0} (x_{i}) + D_{i} [μ_{1} (x_{i}) - μ_{0} (x_{i}) + υ_{1 i} - υ_{0 i}] + υ_{0 i}

Aggregating over different parts of the covariate distribution yields different estimates.

ATE (x) : = E [Δ_{i} ∣ x_{i} = x] = μ_{1} (x) - μ_{0} (x)

ATT (x) : = E [Δ_{i} ∣ x_{i} = x, D_{i} = 1] = μ_{1} (x) - μ_{0} (x) + E [υ_{1 i} - υ_{0 i} ∣ D_{i} = 1]

ATU (x) : = E [Δ_{i} ∣ x_{i} = x, D_{i} = 0] = μ_{1} (x) - μ_{0} (x) + E [υ_{1 i} - υ_{0 i} ∣ D_{i} = 0]

Integrating these over $x$ yields the conventional estimators. With self-selection based on $D_{i} = 1_{D_{i}^{*} \geq 0}$ , typically ATT > ATE > ATU.

The covariate-specific Wald estimator is

Wald (x) = \frac{E [ Y _{i} ∣ z _{i} = 1 , x _{i} = x ] - E [ Y _{i} ∣ z _{i} = 0 , x _{i} = x ]}{E [ D _{i} ∣ z _{i} = 1 , x _{i} = x ] - E [ D _{i} ∣ z _{i} = 0 , x _{i} = x ]}

Under the standard A1-A4 from AIR96,

LATE (x) : = E [Y_{1 i} - Y_{0 i} ∣ D_{1 i} > D_{0 i}, x_{i} = x] = μ_{1} (x_{i}) - μ_{0} (x_{i}) + E [υ_{1 i} - υ_{0 i} ∣ D_{1 i} > D_{0 i}, x_{i} = x]

These can be aggregated using the ‘saturate and weight’ theorem (Angrist and Imbens)

$IV = \sum_{x \in X} ω (x) LATE (x)$

with weights

ω (x_{i}) = \frac{p _{x} V [ D _{i} ∣ x _{i} , z _{i} ]}{V [ D _{i} ]}

For a continuous instrument, for a pair of instrument values $z, z^{'}$ , $LATE (z, z^{'}, x) = E [Y_{1 i} - Y_{0 i} ∣ D_{z i} > D_{z^{'} i}, x_{i} = x]$ .

MTE (x_{i} = x, V_{i} = v) : = E [Y_{1 i} - Y_{0 i} ∣ x_{i} = x, V_{i} = v] = \frac{\partial E [ Y _{i} ∣ x _{i} = x , p ( Z , X ) = p ( z , x ) ]}{\partial p ( z , x )}

MTE is defined as a continuum of treatment effects along the distribution of $υ_{D}$ .

Define two marginal treatment response (MTR) functions

m_{0} (u, x) = E [Y_{0} ∣ U = u, X = x]

m_{1} (u, x) = E [Y_{1} ∣ U = u, X = x]

Many useful parameters are identified using the following expression

β^{⋆} \equiv E [\int_{0}^{1} m_{0} (u, X) ω_{0}^{⋆} (u, X, Z) d u] + E [\int_{0}^{1} m_{1} (u, X) ω_{1}^{⋆} (u, X, Z) d u]

with weights specified in the figure below.

MTE weights from

Parametric Model: Assuming joint normality for $U_{0}, U_{1}, V$ ,

E [U_{0 i} ∣ D_{i} = 0, X_{i}, Z_{i}] = E [U_{0 i} ∣ V_{i} \geq (X_{i}, Z_{i}) β_{d}, X_{i}, Z_{i}] = ρ_{0} (\frac{ϕ ( ( X _{i} , Z _{i} ) β _{d} )}{1 - Φ ( ( X _{i} , Z _{i} ) β _{d} )})

E [U_{1 i} ∣ D_{i} = 1, X_{i}, Z_{i}] = E [U_{1 i} ∣ V_{i} < (X_{i}, Z_{i}) β_{d}, X_{i}, Z_{i}] = ρ_{1} (\frac{- ϕ ( ( X _{i} , Z _{i} ) β _{d} )}{Φ ( ( X _{i} , Z _{i} ) β _{d} )})

where $ρ_{0}$ is the correlation $Corr (U_{0 i}, V_{i})$ , and $ρ_{1} = Corr (U_{1 i}, V_{i})$ .

yields MTE estimator

MTE (x, u_{D}) = E [Y_{1 i} - Y_{0 i} ∣ X_{i} = x, U_{D i} = u_{D}] = x (β_{1} - β_{0}) + (ρ_{1} - ρ_{0}) Φ^{- 1} (u_{D})

Let $x_{i} = x_{i} - \overline{x}$ . Write

Y_{i} = x_{i}^{⊤} α + D_{i} x_{i}^{'} θ + D_{i} δ_{i} + ε_{i}

where $δ_{i}$ is a random effect that captures treatment effect heterogeneity . We can rewrite this and by demeaning $δ_{i} = \overline{δ} - δ_{i}$ .

Y_{i} = x_{i}^{⊤} α + D_{i} x_{i}^{⊤} θ + D_{i} \overline{δ} + υ_{0 i}

where $\overline{δ}$ captures the ATE at means of $X$ , which is the unconditional ATE under the linear specification.

Write the selection equation

D_{i} = x_{i}^{⊤} π_{1} + z_{i} π_{2} + ν_{i}

with $E [ν_{i} ∣ x_{i}, z_{i}] = 0$ .

Assumptions

$E [ε_{i} ∣ ν_{i}] = η ν_{i}$ : Conventional selection bias.

$E [δ_{i} ∣ ν_{i}] = ψ ν_{i}$ : unobservable part of treatment effect $δ_{i}$ depends linearly on the unobservables that affect treatment selection.

Including $ν_{i}$ and $ν_{i} D_{i}$ in the control-function outcome equation yields a consistent estimate of the ATE: $\overline{δ}$ .

High Dimensional IV selection

setup:

y_{i} = τ d_{i} + x_{i}^{'} β + ε_{i}

d_{i} = x_{i}^{'} γ_{0} + z_{i}^{'} δ_{0} + υ_{i}

where

$x_{i}$ is a vector of $p_{n}^{x}$ exogenous controls, including a constant.

$z_{i}$ is a vector of $p_{n}^{z}$ instruments

$d_{i}$ is an endogenous variable

$p_{n}^{x} >> n$ and $p_{n}^{z} >> n$

Run (post)LASSO of $d_{i}$ on $x_{i}, z_{i}$ to obtain $γ, δ$

Run (post)LASSO of $y_{i}$ on $x_{i}$ to get $θ$ .

Run (post)LASSO of $d_{i} = x_{i}^{'} γ + z_{i}^{'} δ$ on $x_{i}$ to get $ϑ$ .

Construct $ρ_{i}^{y} : = y_{i} - x_{i} θ$ , $ρ_{i}^{d} : = d_{i} - x_{i}^{'} ϑ$ and $v_{i} : = x_{i} γ + z_{i}^{'} δ + x_{i}^{'} ϑ$ .

Estimate $τ$ by using standard IV regression of $ρ_{i}^{y}$ on $ρ_{i}^{d}$ with $v_{i}$ as instrument. Perform inference using score statistics or conventional heteroskedasticity-robust SEs.

implemented in hdm::rlassoIV(., select.X = T, select.Z = T).

Discussion in .

Principal Stratification

Treatment comparisons often need to be adjusted for post-treatment variables.

Binary treatment $Z_{i} \in {0, 1}$ . post-treatment Intermediate variable $S_{i} (z_{i}) \in {0, 1}$ , Outcome $Y_{i} \in {0, 1}$ . For each individual, the treatment assumes a single value, so only one of the two potential intermediate values are observed. Based on joint potential outcomes of the intermediate variable $(S_{i}, (0), S_{i} (1))$ , we have 4 strata

00 = {i : S_{i} (0) = 0, S_{i} (1) = 0}

Never takers.

10 = {i : S_{i} (0) = 1, S_{i} (1) = 0}

Defiers.

01 = {i : S_{i} (0) = 0, S_{i} (1) = 1}

Compliers.

11 = {i : S_{i} (0) = 1, S_{i} (1) = 1}

Always takers.

The basic principal stratification $P_{0}$ w.r.t post treatment variable $S$ is the partition of units $i = 1, \dots, n$ such that, forall units in any set of $P_{0}$ , all units have the same vector of $(S_{i} (0), S_{i} (1))$ . The principal stratum $G_{i} \in {00, 10, 01, 11}$ to which unit $i$ belongs is not affected by treatment assignment for any principal stratification, so can be considered pre-treatment.

Treatment Ignorability implies $(Y_{i} (0), Y_{i} (1)) ⊥ ⊥ Z_{i} ∣ S_{i} (0), S_{i} (1), X$ (i.e. treatment and control units can be compared conditional on stratum)

Principal Causal Effect (PCE) $τ_{s_{0}, s_{1}} : = E [Y_{i} (1) - Y_{i} (0) ∣ S_{i} (0) = s_{0}, S_{i} (1) = s_{1}]$

A common example is the

Complier Average Causal Effect (CACE) = Causal Effect on Principal Stratum of Compliers (AIR96)

$CACE = E [Y_{i} (1) - Y_{i} (0) ∣ S_{i} (0) = 0, S_{i} (1) = 1]$

Recall that $G_{i} = (S_{0}, S_{1})$ concatenated. So, AIR96 in PS terms: Monotonicity: $S_{1} \geq S_{0} ⟹ {G_{i} = 10}$ must be empty: no defiers. Exclusion: $τ_{11} = τ_{00}$

Estimation under principal ignorability

Treatment ignorability $Z ⊥ ⊥ (S_{0}, S_{1}, Y_{0}, Y_{1}) ∣ X$ monotonicity: $S_{1} \geq S_{0} ⟹ G_{i} = 10$ is not allowed principal ignorability

E [Y_{1} ∣ G = 11, X] = E [Y_{1} ∣ G = 01, X]

E [Y_{0} ∣ G = 00, X] = E [Y_{1} ∣ G = 01, X]

	S = 0	S = 1
Z = 0	G = 00 or 01	G = 11
Z = 1	G = 00	G = 11 or 01

Disentangle mixture distribution within strata by assuming same conditional expectation across mixture components (complier, never taker, always taker).

Define nuisance functions:

Treatment probability: $π (X) = Pr (Z = 1 ∣ X)$ Principal Score: $e_{g} (X) = Pr (G = g ∣ X)$ identified by

e_{01} (X) = p_{1} (X) - p_{0} (X)

e_{00} (X) = 1 - p_{1} (X)

e_{11} (X) = p_{0} (X)

where $p_{z} (X) = Pr (S = 1 ∣ Z = z, X)$ . Outcome mean: $μ_{zs} (X) = E [Y ∣ Z = z, S = s, X]$ .

Treatment Probability and Principal Score

τ_{01} = E {\frac{e _{01} ( X )}{p _{1} - p _{0}} \frac{S}{p _{1} ( X )} \frac{Z}{π ( X )} Y} - E {\frac{e _{01} ( X )}{p _{1} - p _{0}} \frac{1 - S}{1 - p _{0} ( X )} \frac{1 - Z}{1 - π ( X )} Y}

τ_{00} = E {\frac{1 - S}{1 - p _{1}} \frac{Z}{π ( X )} Y} - E {\frac{e _{00} ( X )}{1 - p _{1}} \frac{1 - S}{1 - p _{0} ( X )} \frac{1 - Z}{1 - π ( X )} Y}

τ_{11} = E {\frac{e _{11} ( X )}{p _{0}} \frac{S}{p _{1} ( X )} \frac{Z}{π ( X )} Y} - E {\frac{S}{p _{0}} \frac{1 - Z}{1 - π ( X )} Y}

Treatment Probability and Outcome Mean

τ_{01} = E [\frac{SZ / π ( X ) - S ( 1 - Z ) / { 1 - π ( X )}}{p _{1} - p _{0}} {μ_{11} (X) - μ_{00} (X)}]

τ_{00} = E [\frac{1 - SZ / π ( X )}{1 - p _{1}} {μ_{10} (X) - μ_{00} (X)}]

τ_{11} = E [\frac{S ( 1 - Z ) / { 1 - π ( X )}}{p _{0}} {μ_{11} (X) - μ_{01} (X)}]

Principal Score and Outcome Mean

τ_{01} = E [\frac{p _{1} ( X ) - p _{0} ( x )}{p _{1} - p _{0}} {μ_{11} (X) - μ_{00} (X)}]

τ_{00} = E [\frac{1 - p _{1} ( X )}{1 - p _{1}} {μ_{10} (X) - μ_{00} (X)}]

τ_{11} = E [\frac{p _{0} ( X )}{p _{0}} {μ_{11} (X) - μ_{01} (X)}]

Direct and Indirect Effects via Principal Stratification

Direct effect of $Z$ conditional on $S$ exists if there is a causal effect of $Z$ on $Y$ for observations for whom the treatment does not affect selection $S$ , i.e. principal strata $00, 11$ . This is a zero-first-stage sample in IV-terms.

The Indirect Effect is mediated through $S$ .

Attrition as Selection Bias

Let $S$ denote a binary selection indicator for when $Y$ is observed. Let $S (1), S (0)$ denote potential selection states under treatment and nontreatment.

$S (1) = 0, S (0) = 0$ : never-selected $S (1) = 1, S (0) = 1$ : always selected $S (0) = 0, S (1) = 1$ : selection compliers $S (0) = 1, S (1) = 0$ : selection defiers (ruled out by Lee bounds)

Dominance assumption: $E [Y (1) ∣ S (1) = 1, S (0) = 1] \geq E [Y (1) ∣ S (1) = 1, S (0) = 0]$ and $E [Y (0) ∣ S (1) = 1, S (0) = 1] \geq E [Y (0) ∣ S (1) = 1, S (0) = 0]$ . The average potential outcome of the always selected dominates that of compliers under either treatment state.

Then, Zhang and Rubin (2003) bounds are

Δ^{U B} = E [Y ∣ D = 1, S = 1, Y \geq y^{*}] - E [Y ∣ D = 0, S = 1]

Δ^{L B} = E [Y ∣ D = 1, S = 1] - E [Y ∣ D = 0, S = 1]

where $y^{*}$ is chosen such that the lowest outcomes among those with $D = 1, S = 1$ correspond to the share of compliers among those with $D = 1, S = 1$ are smaller than this value.

Assuming

randomisation: ${Y (1), Y (0), S (0), S (1), X} ⊥ ⊥ D$ monotonicity: $S (1) \geq S (0) a . s .$

Lee (2009) focuses on the ATE among the always observed

$E [Y (1) - Y (0) ∣ S (0) = S (1) = 1]$

The second quantity: $E [Y (0) ∣ S (1) = 1, S (0) = 1]$ is point identified. In contrast, the outcome in the treatment group can be either an always-selected’s outcome or a selection complier’s outcome.

Always selected share among the treated is

p_{0} = Pr (S (1) = 1, S (0) = 1∣ S (1) = 1) = Pr (S (0) = 1∣ S (1) = 1) = \frac{Pr ( S = 1∣ D = 0 )}{Pr ( S = 1∣ D = 1 )}

In the best case, the always-selected comprise the top $p_{0}$ quantile of the treatment outcomes. Then the largest possible value of $β$ is

β_{U} = E [Y ∣ Y \geq Q_{y ∣ S = 1, D = 1} (1 - p_{0}), D = 1, S = 1] - E [Y ∣ S = 1, D = 0]

The smallest possible one is

β_{L} = E [Y ∣ Y \leq Q_{y ∣ S = 1, D = 1} p_{0}, D = 0, S = 1] - E [Y ∣ S = 1, D = 0]

this can be implemented conditional on covariates by constructing $p_{0} (x)$ within each $x$ stratum.

Regression Discontinuity Design

Setup

Treatment ( $D$ ) changes discontinuously at some particular value $x_{0}$ in $x$ [and nothing else does], so

D_{i} = 1_{x_{i} \geq x_{0}}

Standard identification assumptions violated by definition because although unconfoundedness holds trivially since we have $D_{i} = 1_{x_{i} \geq c}$ , this also means overlap is always violated. Need to invoke continuity to do causal inference.

Identified at $x = c$ , i.e. $τ_{c} = μ_{(1)} (c) - μ_{(0)} (c)$ via

τ_{c} : = E [Y_{1} - Y_{0} ∣ X = c] = x ↓ c lim E [Y ∣ X = c] - x ↑ c lim E [Y ∣ X = c]

x ↓ c lim E [y ∣ X] - x ↑ c lim E [y ∣ X] = τ_{SR DD} + (x ↓ c lim E [u ∣ X] - x ↑ c lim E [u ∣ X]) \approx τ_{SR DD}

Conditional mean function $E [u ∣ X]$ is continuous at $c$
Mean Treatment effect function $E [τ_{i} ∣ X]$ is right continuous at $c$

Estimators

Normalise running variable $c : = x_{0}$ . Then, the linear regression implementation is the following:

Y = α_{l} + τ D + β_{l} f (X - c) + (β_{r} - β_{l}) \times D \times g (X - c) + ϵ

where $f$ and $g$ are local or global polynomials. Since the design relies on identification at infinity (i.e. at the cutoff), choice of polynomial / functional form matters a lot.

Calonico, Cattaneo, Titiunik (2014) recommend local-linear regressions. Older literature relies on global higher-order polynomials, which often yields strange estimates.

τ_{c} = ar g a, τ, β_{(0)}, β_{(1)} min i = 1 \sum n K (\frac{∣ X _{i} - c ∣}{h _{n}}) (Y_{i} - a - τ D_{i} - β_{(0)} (X_{i} - c)_{-} - β_{(1)} (X_{i} - c)_{+})^{2}

Where $K (\cdot)$ is a kernel function. Common choices are the window function $K (x) = 1_{∣ x ∣ \leq 1}$ or the triangular kernel $K (x) = (1 - ∣ x ∣)_{+}$

Assumptions for Local Linear Estimator

Loosely, we need CEFs $μ_{(w)}$ to be smooth. More precisely, we need $μ_{(w)} (x)$ to be twice-differentiable with uniformly bounded second derivative.

\frac{d ^{2}}{d x ^{2}} μ_{(w)} (x) \leq B, \forall x \in R, w \in {0, 1}

Taking a taylor expansion around $c$ , we can write the CEFs as

μ_{(w)} (x) = a_{(w)} + β_{(w)} (x - c) + \frac{1}{2} ρ_{(w)} (x - c)^{2}, ρ_{(w)} (x) \leq B

with $τ_{c} = a_{(1)} - a_{(0)}$ . The local linear regression with a window kernel can be solved in closed form

a_{(1)} = c \leq X_{i} \leq c + h_{n} \sum γ_{i} Y_{i}

γ_{i} = \frac{E _{(1)} [( X _{i} - c ) ^{2} ] - E _{(1)} [ X _{i} - c ] \cdot ( X _{i} - c )}{E _{(1)} [( X _{i} - c ) ^{2} ] - E _{(1)} [ X _{i} - c ] ^{2}}

where $E (\cdot)$ denote sample averages over the regression window. Then, the error term can be written as

a_{(1)} = a_{(1)} + c \leq X_{i} \leq c + h_{n} \sum γ_{i} ρ_{(1)} (X_{i} - c) + c \leq X_{i} \leq c + h_{n} \sum γ_{i} (Y_{i} - μ_{(1)} (X_{i}))

Curvature bias is bounded by $B h_{n}^{2}$ .

τ_{c} = τ_{c} + O (n^{- 2/5}), with h_{n} \sim n^{- 1/5}

This rate is a consequence of working with the 2nd derivative. In general, if we assume $μ_{(w)} (\cdot)$ has a bounded $k -$ th derivative, we can achieve an $n^{- k / (2 k + 1)}$ rate using local polynomial regression of order $k - 1$ with a bandwidth scaling as $h_{n} \sim n^{- 1/ (2 k + 1)}$ .

The local linear regression estimator for $τ_{c}$

τ_{c} = ar g a, τ, β_{(0)}, β_{(1)} min i = 1 \sum n K (\frac{∣ Z _{i} - c ∣}{h _{n}}) (Y_{i} - a - τ W_{i} - β_{(0)} (Z_{i} - c)_{-} - β_{(1)} (Z_{i} - c)_{+})^{2}

which can be written as a local linear estimator $τ_{c} = \sum_{i = 1}^{n} γ_{i} Y_{i}$ where weights $γ_{i}$ only depend on the running variable $Z$ . show that local linear regression is not the best estimator in this class.

Under an assumption that $μ_{(w)}^{''} (z) \leq B ∣ {Z_{1}, \dots, Z_{n}}$ , the minimax linear estimator is the one that minimises the MSE $MSE (τ_{c} (γ) ∣ {Z_{1}, \dots, Z_{n}}) \leq σ ∥ γ ∥_{2}^{2} + I_{B}^{2} (γ)$ and is given by

τ_{c} (γ^{B}) = i = 1 \sum n γ_{i}^{B} Y_{i}, γ^{B} = ar g γ min {σ ∥ γ ∥_{2}^{2} + I_{B}^{2} (γ)}

These weights can be solved for using quadratic programming.

Fuzzy RD

Discontinuity doesn’t deterministically change treatment, but affects probability of treatment. Analogue of IV with one-sided non-compliance.

P [D_{i} = 1∣ x_{i}] = g_{0} (x_{i}) 1_{x_{i} < x_{0}} + g_{1} (x_{i}) 1_{x_{i} \geq x_{0}}

$g_{0} (x_{i}) \neq = g_{1} (x_{i})$ . Assuming $g_{1} (x_{0}) > g_{0} (x_{0})$ , the probability of treatment relates to $x_{i}$ via:

$E [D_{i} ∣ x_{i}] = P [D_{i} = 1∣ x_{i}] = g_{0} (x_{i}) + [g_{1} (x_{i}) - g_{0} (x_{i})] T_{i}$ where $T_{i} = 1_{x_{i} \geq x_{0}}$ := point of discontinuity

Regression Kink Design

First-derivative version of the fuzzy RD. Continuous treatment, where the treatments are a function of the running variable $X$ with kink at $x_{0}$ . This implies that the first derivative $\frac{\partial D}{\partial X}$ of continuous treatment D is discontinuous at the threshold.

The marginal treatment effect at the threshold is defined as

Δ_{X = x_{0}} (d_{0}) = \frac{\partial E [ Y ( d _{0} ) ∣ X = x _{0} ]}{\partial D} = \frac{lim _{ε \to 0} \frac{\partial E [ Y ∣ X = x _{0} + ε ]}{\partial X} - lim _{ε \to 0} \frac{\partial E [ Y ∣ X = x _{0} - ε ]}{\partial X}}{lim _{ε \to 0} \frac{\partial E [ D ∣ X = x _{0} + ε ]}{\partial X} - lim _{ε \to 0} \frac{\partial E [ D ∣ X = x _{0} - ε ]}{\partial X}}

Differences-in-Differences

DiD with 2 periods

Binary treatment $d \in {0, 1}$ , 2 time periods $t \in {0, 1}$ .

Potential outcomes denoted $Y_{t}^{d}$ .

ATT in the $2$ nd period.

τ_{A TT} : = E [Y_{1}^{1} - Y_{1}^{0} ∣ D = 1]

$E [Y_{1}^{0} ∣ D = 1]$ not observed, so must be imputed.

Naive Estimation Strategies

Before-After Comparison: $\tau = \Exp{Y_1^1 | D = 1}

\hlblu{\Exp{Y_0^0 | D = 1}}$

assumes $E [Y_{1}^{0} ∣ D = 1] = E [Y_{0}^{0} D = 1]$ (No trending)

Post Treatment-Control Comparison: $τ = E [Y_{1}^{1} ∣ D = 1] - E [Y_{1}^{0} ∣ D = 0]$

Assumes $E [Y_{1}^{0} ∣ D = 1] = E [Y_{1}^{0} ∣ D = 0]$ (Random Assignment in the 2nd period)

Both typically untenable in practice, so we need parallel trends.

Sample analogue of

Impute $E [Y_{1}^{0} ∣ D = 1]$ with

E [Y_{1}^{0} ∣ D = 1] \approx E [Y_{0}^{0} ∣ D = 1] + (E [Y_{1}^{0} ∣ D = 0] - E [Y_{0}^{0} ∣ D = 0])

Δ_{D = 1} : = (E [Y_{1}^{1} ∣ D = 1] - E [Y_{0}^{0} ∣ D = 1]) - (E [Y_{1}^{0} ∣ D = 0] - E [Y_{0}^{0} ∣ D = 0])

E [Y_{1}^{0} - Y_{0}^{0} ∣ D = 1] = E [Y_{1}^{0} - Y_{0}^{0} ∣ D = 0]

Often justified using a figure [with transformed $y$ if necessary], or control for time trends [which relies on a strong functional form assumption], or a clear falsification test [on a placebo group].

If $E [Y_{0}^{0} ∣ D = 1] = E [Y_{0}^{0} ∣ D = 0]$ , this collapses to a selection-on-observables assumption in period 2.

E [Y_{1}^{0} ∣ D = 1] = E [Y_{1}^{0} ∣ D = 0]

For a two-period difference, we can also write the standard OLS exogeneity condition in differences form

E [Δ x^{'} Δ ϵ] = 0

E [x_{2}^{'} ϵ_{2}] + E [x_{1}^{'} ϵ_{1}] - (E [x_{1}^{'} ϵ_{2}] - E [x_{2}^{'} ϵ_{1}]) = 0

Which makes a direct link with the strong exogeneity assumption in panel data models that asserts that $ϵ_{t} ⊥ ⊥ x_{1}, \dots x_{t}$ .

Regression Estimator

We typically prefer the following regression estimator (for automatic standard errors etc).

Y_{i t} = α + γ Treat_{i} + λ Post_{t} + τ (Treat_{i} \times Post_{t}) + ε_{i t}

Triple Differences (DDD) Estimator

Regular Diff-in-Diff estimate - Diff-in-diff estimate for placebo group.

Nonparametric Identification Assumptions with Covariates

Estimand:

τ_{A TT} : = E [Y_{t}^{1} - Y_{t}^{0} ∣ D = 1] = E [θ_{t} (x) ∣ D = 1] = E_{X ∣ D = 1} [θ_{t} (x)]

Identification Assumptions:

SUTVA $Y_{t} = D Y_{t}^{1} + (1 - D) Y_{t}^{0}, t \in {0, 1}$

Covariate exogeneity

$X^{1} = X^{0} = X x \in X$

No effect before treatment

$θ_{0} (x) = 0; \forall x \in X$

Common Trend

(parallel trends within $x$ strata)

E [Y_{1}^{0} ∣ X = x, D = 1] - E [Y_{0}^{0} ∣ X = x, D = 1] = E [Y_{1}^{0} ∣ X = x, D = 0] - E [Y_{0}^{0} ∣ X = x, D = 0] = E [Y_{1}^{0} ∣ X = x] - E [Y_{0}^{0} ∣ X = x]

Common support

Pr (T = 1, D = 1∣ X = x, (T, D) \in {(t, d), (1, 1)}) < 1, \forall (t, d) \in {(0, 1), (0, 0), (1, 0)}, x \in X

This allows us to estimate the conditional ATT as the standard DiD within each $X$ stratum.

E [Y_{1} ∣ D = 1, X] - E [Y_{0} ∣ D = 1, X] - E [Y_{1} ∣ D = 0, X] + E [Y_{0} ∣ D = 0, X]

Averaging these over $d X$ gives us the ATT

τ_{1}^{ATT} = E [{μ_{1} (1, X) - μ_{1} (0, X)} - {μ_{0} (1, X) - μ_{0} (0, X)} ∣ D = 1, T = 1]

where regression functions $μ_{d} (t, x)$ denote conditional expectations for treatment $d$ at time $t$ given covariates $x$ .

Denote potential outcomes under treatment and control for unit $i$ as $Y_{i t}^{1}$ and $Y_{i t}^{0}$ . For some observed covariates $X_{i}$ , we are interested in the CATT

$τ_{0} (X_{i}) : = E [Y_{i 1}^{1} - Y_{i 1}^{0} ∣ X_{i}, D_{i} = 1]$

For identification, we need Conditional parallel trends: $E [Y_{i 1}^{0} - Y_{i 0}^{0} ∣ D_{i} = 1, X_{i}] = E [Y_{i 1}^{0} - Y_{i 0}^{0} ∣ D_{i} = 0, X_{i}]$ Overlap: $\exists c > 0$ such that $E [D_{i} = 1∣ X_{i}] > c$ and $E [D_{i} ∣ X_{i}] < 1 - c$

The Abadie estimand can be defined as

E [Y_{i 1}^{1} - Y_{i 1}^{0} ∣ X_{i}, D_{i}] = E [\frac{D _{i} - E [ D _{i} = 1∣ X _{i} ]}{E [ D _{i} = 1∣ X _{i} ] ( 1 - E [ D _{i} = 1∣ X _{i} ] )} (Y_{i 1} - Y_{i 0}) ∣ X_{i}]

Defining $Δ Y_{i} : = Y_{i 1} - Y_{i 0}$ , we then have

E [Y_{i 1}^{1} - Y_{i 1}^{0} ∣ X_{i}, D_{i}] = E [\frac{D _{i} - E [ D _{i} = 1∣ X _{i} ]}{E [ D _{i} = 1∣ X _{i} ] E [ D _{i} = 0∣ X _{i} ]} Δ Y_{i} ∣ X_{i}]

E [Y_{i 1}^{1} - Y_{i 1}^{0} ∣ X_{i}, D_{i}] = E [\frac{D _{i} Δ Y _{i}}{E [ D _{i} = 1∣ X _{i} ]} ∣ X_{i}] - E [\frac{( 1 - D _{i} ) Δ Y _{i}}{( 1 - E [ D _{i} = 1∣ X _{i} ] )} ∣ X_{i}]

This is an IPW Estimator.

Integrating this over $d P (X ∣ D = 1)$ gives us the ATT

E [Y_{1}^{1} - Y_{0}^{0} ∣ D = 1] = E [\frac{Y _{1} - Y _{0}}{Pr ( D = 1 )} \cdot \frac{D - E [ D = 1∣ X _{i} ]}{1 - E [ D = 1∣ X _{i} ]}]

THe full IPW estimator can be written

Δ_{D = 1, T = 1} = E [Y \cdot {\frac{D T}{Π} - \frac{D ( 1 - T ) ρ _{1, 1} ( X )}{ρ _{1, 0} ( X ) Π} - (\frac{( 1 - D ) T ρ _{1, 1} ( X )}{ρ _{0, 1} ( X ) Π} - \frac{( 1 - D ) ( 1 - T ) ρ _{1, 1} ( X )}{ρ _{0, 0} ( X ) Π})}]

where $Π = Pr (D = 1, T = 1)$ is the unconditional probability of being treated in the post-treatment period, and $ρ_{d, t} (X) = Pr (D = d T = t ∣ X)$ are conditional probabilities of specific treatment-group combinations.

Double-robust version - Zimmert (2020)

Δ_{D = 1, T = 1} = E [{\frac{D T}{Π} - \frac{D ( 1 - T ) ρ _{1, 1} ( X )}{ρ _{1, 0} ( X ) Π} - (\frac{( 1 - D ) T ρ _{1, 1} ( X )}{ρ _{0, 1} ( X ) Π} - \frac{( 1 - D ) ( 1 - T ) ρ _{1, 1} ( X )}{ρ _{0, 0} ( X ) Π})} \times (Y - μ_{d} (T, X)) + \frac{D T}{Π} ((μ_{1} (1, X) - μ_{1} (0, X)) - (μ_{0} (1, X) - μ_{0} (0, X)))]

Panel Data

Setup: We observe a sample of $i = 1, \dots, N$ cross-sectional units for $t = 1, \dots, T$ time periods $⟹ Data : {(y_{i t}, x_{i t}^{'}) : t = 1, \dots, T}_{t = 1}^{T}$

One-way fixed effects and Random effects both use the form

y_{i t} = x_{i t}^{'} β + e_{i t} θ_{i} + ϵ_{i t}

although they make different assumptions about the error.

Error assumptions for panel regressions

(1) FE: $E [ϵ_{i t} ∣ x_{i}, θ_{i}] = 0 ⟺ θ_{i} \neq ⊥ ⊥ x_{i}$ .

(2) RE: (1) and $E [e_{i t} ∣ x_{i}] = 0$ [Absorb unobserved unit effect into error term, impose orthogonality it] $⟹ θ_{i} ⊥ ⊥ x_{i}$ . Equivalent to Pooled OLS with FGLS.

Fixed Effects Regression

Identification Assumption []

Strict Exogeneity - errors are uncorrelated with lags and leads of x $E [ϵ_{i t} ∣ x_{i}] = E [ϵ_{i t} ∣ x_{i 1}, \dots x_{i T}] = 0 ⟺ E [x_{i s}^{'} ϵ_{i t}] = 0 \forall s, t = 1, \dots T$

Equivalent statement for $y_{i t}$ is $E [y_{i t} ∣ x_{i 1}, \dots, x_{i T}] = E [y_{i t} ∣ x_{i t}] = x_{i t}^{'} β$
- Rules out feedback loops i.e. $x_{i t}$ correlated with $ϵ_{i, t - 1}$ because $X$ s are set in response to prior error, e.g. Policing and crime.
regressors vary over time for at least some $i$ .

Setup[]

Define an individual fixed effect for individual $i$

A_{i} = 1_{the observation involves unit i}

and define the same for each time period for panel data.

If $D_{i t}$ is as good as randomly assigned conditional on $A_{i}$ :

$E [Y_{0 i t} ∣ A_{i}, X_{i t}, t, D_{i t}] = E [Y_{0 i t} ∣ A_{i}, X_{i t}, t]$

Then, assuming $A_{i}$ enter linearly,

E [Y_{0 i t} ∣ A_{i}, X_{i t}, t, D_{i t}] = α + λ_{t} + A_{i}^{'} γ + X_{i t}^{'} β

Assuming the causal effect of the treatment is additive and constant,

$E [Y_{1 i t} ∣ A_{i}, X_{i t}, t] = E [Y_{0 i t} ∣ A_{i}, X_{i t}, t] + ρ$

where $ρ$ is the causal effect of interest.

Then, we can write:

Y_{i t} = α_{i} + λ_{t} + ρ D_{i t} + X_{i t}^{'} β + ϵ_{i t}

ϵ_{i t} = Y_{0 i t} - E [Y_{0 i t} ∣ A_{i}, X_{i t}, t]

α_{i} = α + A_{i} γ

Restrictions

Linear
Additive functional form
Variation in $D_{i t}$ , over time, for $i$ , must be as good as random

Estimate the specification

$\overset{y}{¨}_{i} = \ddot{x}_{i}^{'} β + ϵ_{i}$

where $\ddot{k}_{i} = M_{i} k_{i}$ individual demeaned values from pre-multiplying by the Individual specific demeaning operator $M_{i} : = I_{i} - 1_{i} (1_{i}^{'} 1_{i})^{- 1} 1_{i}^{'}$ with every component in the general model above, which removes the fixed effect $θ_{i}$ .

Lag the general model one period and subtracting gives

$Δ y_{i t} = Δ x_{i t}^{'} β + Δ ϵ_{i t}$

where $Δ y_{i t} = y_{i t} - y_{i, t - 1}$ and so on. This naturally eliminates the time-invariant fixed effect $θ_{i}$ . The pooled OLS estimation of $β$ in the above regression is called the first differences (FD) estimator $β_{F D}$ .

FE estimator is more efficient under the assumption that $ϵ_{i t}$ are serially uncorrelated [ $E [e_{i} e_{i}^{'} ∣ x_{i}, θ_{i}] = σ_{e}^{2} I_{T}$ ]

FD more efficient when $ϵ_{i t}$ follows a random-walk.

For Individual Fixed Effects/Within estimation, using the regression anatomy formula, write:

\overset{ρ}{^}_{FE} = \frac{Cov ( ( ) Y ¨ _{i t} , D ¨ _{i t} )}{Var ( ( ) D ¨ _{i t} )}

Since $t = 2$ , $\overline{Y}_{i} = Y_{i t} + \frac{Δ Y _{i t}}{2}$ and $\overline{D}_{i} = D_{i t} + \frac{Δ D _{i t}}{2}$

\overset{ρ}{^}_{FE} = \frac{Cov ( ( ) Y ¨ _{i t} , D ¨ _{i t} )}{Var ( ( ) D ¨ _{i t} )} = \frac{Cov ( ( ) Δ Y _{i t} , Δ D _{i t} )}{Var ( ( ) Δ D _{i t} )} = \overset{ρ}{^}_{FD}

Random Effects

Identification Assumption []

Assume $θ_{i} ⊥ ⊥ X_{i} ⟺ E [θ_{i} ∣ x_{i}] = E [θ_{i}] = 0$ - strong assumption

In other words, entire error term $e_{i t} = ν_{i t} + θ_{i}$ is independent of $X$ . This assumes OLS is consistent but inefficient, which is why it is of limited use in observational settings.

When there is autocorrelation in time series (i.e. $ϵ_{t}$ s are correlated over time ), GLS estimates can be obtained by estimating OLS on quasi-differenced data. This allows us to estimate the effects of time-invariant characteristics (assuming the independence condition is met). $y_{i t} - λ \overset{y_{i}}{ˉ} = (x_{i t} - λ \overset{x_{i}}{ˉ}) β + (1 - λ) θ_{i} + ν_{i t} - λ \overset{ν}{ˉ}_{i}$

where $λ = 1 - [\frac{σ _{ν}^{2}}{σ _{ν}^{2} + T σ _{θ}^{2}}]^{\frac{1}{2}}$

Idiosyncratic errors $ν_{i t}$ have constant finite variance: $E [ν_{i t}^{2}] = σ_{ν}^{2}$

Idiosyncratic errors $ν_{i t}$ are serially uncorrelated: $E [ν_{i t} ν_{i s}] = 0\forall t \neq = s$ .

$E [θ_{i}^{2} ∣ x_{i}] = σ_{θ}^{2}$

Under these assumptions, the FGLS matrix $Ω$ takes a special form

$Ω = σ_{ν}^{2} I_{T} + σ_{θ}^{2} j_{T} j_{T}^{'}$

where $j_{T} j_{T}^{'}$ is a $T \times T$ matrix of $1$ s. Estimators for the variance components are in [c 10, pp 260-61]. A robust estimator of $\hat{Ω}$ is constructed using pooled OLS residuals $\hat{v_{i}}$

$Ω = \frac{1}{n} \sum_{i = 1}^{n} \hat{v}_{i} \hat{v}_{i}^{'}$

With this, we can apply the FGLS estimator

β_{RE} = (X^{'} \hat{Ω}^{- 1} X)^{- 1} X^{'} \hat{Ω}^{- 1} y

Hausman Test: Choosing between FE and RE

$β_{FE}$ is assumed to be consistent. Oft-abused test as a result.

H0: $β_{FE} - β_{RE} = 0$
H0: $β_{FE} - β_{RE} \neq = 0$

H = (β_{FE} - β_{RE})^{'} [Var (β_{FE}) - Var (β_{RE})]^{- 1} (β_{FE} - β_{RE}) \to d χ_{k}^{2}

If the error component $θ$ is correlated with $x$ , RE estimates are not consistent. Perform Hausman test for random vs fixed effects (where under the null, $Cov (θ_{i}, x_{i t}) = 0$ )

When the idiosyncratic error variance $\overset{σ}{^}_{ν}^{2}$ is large relative to $T_{i} \overset{σ}{^}_{θ}^{2}$ , $λ \to 0$ and $\hat{β}_{RE} \approx \hat{β}_{p oo l}$ . In words, the individual effect is relatively small, so Pooled OLS is suitable.
When the idiosyncratic error variance $\overset{σ}{^}_{ν}^{2}$ is small relative to $T_{i} \overset{σ}{^}_{θ}^{2}$ , $λ \to 1$ and $\hat{β}_{RE} \approx \hat{β}_{FE}$ . Individual effects are relatively large, so FE is suitable.

Time Trends

Linear Time Trend[]

y_{i t} = x_{i t} β + c_{i} + t + ε_{i t}, t = 1, 2, \dots, T

Time Fixed Effects (a.k.a. Two-way Fixed Effects)[]

y_{i t} = x_{i t} β + c_{i} + t_{t} + ε_{i t}, t = 1, 2, \dots, T

Unit Specific Time Trends[]

y_{i t} = x_{i t} β + c_{i} + g_{i} \cdot t + t_{t} + ε_{i t}, t = 1, 2, \dots, T

Distributed Lag

Define switching indicator $D_{i t}$ as 1 if $i$ switched from control to treatment between $t - 1$ and $t$ .

Y_{i s t} = γ_{s} + λ_{t} + τ = 0 \sum m δ_{- τ} D_{s, t - τ} + τ = 1 \sum q δ_{+ τ} D_{s, t + τ} + X_{i s t}^{'} β + ϵ_{i s t}

where the sums on the RHS allow for m lags / post-treatment effects, and q leads / pre-treatment effects. Leads should be close to 0.

Staggered Adoption

Let $T$ denote multiple time periods such that $t \in {0, 1, \dots, T}$ , with nobody treated at $t = 0$ and staggered adoption. Let $G_{t}$ be a dummy that is equal to one if a subject experiences treatment introduction in period $t$ (e.g. $G_{2} = 1$ implies the treatment is introduced in period $2$ in said group).

Under parallel trends for the untreated potential outcomes, $Y_{g, t} (0)$ , the treatment effect $β_{FE}$ in the vanilla two-way fixed effects regression

Y_{g, t} = β_{FE} D_{g, t} + γ_{g} + ψ_{t} + ε_{g t}

can be decomposed as

E [β_{FE}] = E (g, t) : D_{g, t} \neq = 0 \sum W_{g, t} Δ_{g, t}

where $Δ_{g, t} = Y_{g, t} (1) - Y_{g, t} (0)$ .

The weights $W_{g, t}$ sum to one and are proportional to and the same sign as

N_{g, t} (D_{g, t} - D_{g, \cdot} - D_{\cdot, t} + D_{\cdot, \cdot})

where $D_{g, \cdot}$ is the average treatment of group $g$ across periods (share of periods treated), $D_{\cdot, t}$ is the average treatment at period $t$ across groups, and $D_{\cdot, \cdot}$ is the grand mean of the treatment indicator. These weights can be negative.

This means that $β_{FE}$ is biased for the ATT because $W_{g, t}$ is in not (only) proportional to $N_{g, t}$ . $β$ is only unbiased when

the treatment is binary AND the treatment is staggered and absorbing (i.e. groups get treated once and stay treated) AND there is no variation in treatment timing

Under these conditions, the pesky weight is constant across treated units, so the weights are proportional to $N_{g, t}$ .

OR, $β_{FE}$ is also unbiased if $(D_{g, t} - D_{g, \cdot} - D_{\cdot, t} + D_{\cdot, \cdot})$ is uncorrelated with the treatment effects $Δ_{g, t}$ . This is only plausible when treatment has been randomly staggered, otherwise, it is entirely plausible that groups with larger treatment effects selected into treatment early, and so on.

Consider a dataset comprising $K$ timing groups ordered by the time at which they first receive treatment and a maximum of one never-treated group $U$ . The OLS estimate from a two-way fixed effects regression is

\hat{β}_{DD} = k \neq = U \sum s_{k U} \hat{β}_{k U}^{DD} + k \neq = U \sum j > k \sum (s_{kj} \hat{β}_{kj}^{DD} + s_{jk} \hat{β}_{jk}^{DD})

where weights depend on sample size and variance of treatment within each DD. This maximises the weights of groups treated in the middle of the panel. The Late vs Early comparison is particularly problematic (and is typically incorrect when treatment effects are heterogeneous in time).

Visually, this involves decomposing the setup in the first figure below into its constituent two-way parts in the second figure.

Some Staggered Difference in Differences data

Constituent 2-way Differences in Differences Comparisons

Estimand: Group-time average treatment effect

A TT (g, t) = E [Y_{t} (g) - Y_{t} (\infty) ∣ G_{g} = 1], \forall t \geq g

where $Y_{t} (g)$ is the potential outcome for group treated at $g$ .

Separate (1) identification, (2) estimation and inference, and (3) aggregation.

A1: No anticipation $\forall i, t and t < g, g^{'}$ , $Y_{i t} (g) = Y_{i, t} (g^{'})$

A2: Parallel trends based on ‘never treated’ group: $\forall t \in {2, \dots, T}$ , $g \in G$ s.t. $t \geq g$ , $Trend in group treated at 1 E [Y_{t} (0) - Y_{t - 1} (0) ∣ G_{g} = 1] = Trend in never treated E [Y_{t} (0) - Y_{t - 1} (0) ∣ C = 1]$

Estimators for Group-time ATEs

A T T_{unc}^{never} (g, t) = E [Y_{t} - Y_{g - 1} ∣ G_{g} = 1] - E [Y_{t} - Y_{g - 1} ∣ C = 1]

A T T_{unc}^{notyet} (g, t) = E [Y_{t} - Y_{g - 1} ∣ G_{g} = 1] - E [Y_{t} - Y_{g - 1} ∣ D_{t} = 0, C = 1]

Aggregation: event-study type estimand.

θ_{D} (e) = g = 2 \sum T 1_{g + e \leq T} A TT (g, g + e) Pr (G = g ∣ G + e \leq T, C \neq = 1)

Implemented in did and DRDID.

The negative weighting problem with 2WFE under staggered adoption can be remedied easily by using the following procedure, which is termed Imputation by . This nests the procedures in etc.

Fit a model for $Y_{0}^{([)}] [i t]$ using only untreated observations for all units (i.e. untreated periods for units that eventually got treated) Impute $Y_{0}^{([)}]$ for treated units and treated time periods compute $τ_{i t} = Y_{i t} - Y_{i t}^{(0)} ∣ \forall i, t where W_{i t} = 1$ Average for (equal weighting) ATT or average over time for event study

This works well when the outcome model for $Y_{0}^{([)}] [i t]$ is good, i.e. when the fixed effects or latent factors are well estimated. This will not work well for short panels.

Changes-in-Changes

Given a continuous outcome $Y$ and a monotonicity in unobserved heterogeneity, CiC allows us to identify both the ATT and Quantile effect on the treated (QTT).

Assume the following about untreated potential outcomes

$Y_{T}^{0} = H (U, T) U ⊥ ⊥ T ∣ D$

where $U$ is a scalar unobservable or an index of unobservables. $H (u, t)$ is a general function assumed to be strictly monotonically increasing in values of $u$ for periods $t \in {0, 1}$ . The conditional independence assumption requires that the unobserved heterogeneity is constant over time within treatment groups.

Denote $F_{Y (d) ∣ d t} (y) = Pr Y (d) \leq y ∣ D = d, T = t$ the conditional CDF of potential outcome $Y (d)$ , and $F_{d t} (y) = Pr D = d, T = t$ corresponding CDF for observed outcome. Conditional outcome distributions $F_{01}, F_{00}, F_{10}$ are observed. The inverse of the latter is $F_{d t}^{- 1} (y)$ , the conditional quantile function. The unobserved CDF is identified as

$F_{Y (0) ∣11} (y) = F_{10} (F_{00}^{- 1} (F_{01} (y)))$

The QTT at quantile $τ$ is then identified as

Δ_{D = 1} (τ) = F_{11}^{- 1} (τ) - F_{01}^{- 1} (F_{00} (F_{10}^{- 1} (τ)))

and the ATT is identified as

Δ_{D = 1} = E [Y ∣ D = 1, T = 1] - E [F_{01}^{- 1} (F_{00} (Y_{10}))]

Implemented in qte::CiC.

Synthetic Control

Original setup.

Observe $n_{0} +_{1}$ units in periods $t = 1, \dots, T$ . Unit 1 is treated starting from period $T_{0} + 1$ , while $2, \dots, n_{0} + 1$ are never treated, and are therefore called the donor pool.

Y_{i t}^{o b s} = Y_{i t} (D_{i t}) = (1 - D_{i t}) Y_{i t} (0) + D_{i t} Y_{i t} (1)

Since there is only 1 treated unit, the effect of interest $τ_{t} : = Y_{1 i} (t) - Y_{0 i} (t), t = T_{0} +_{1}, \dots, T$

Observed data matrix ()

Y^{o b s} : = (Y_{i t}^{o b s})_{t = 1, \dots, T, i = 1, \dots, n_{0} + 1}

FPCI applies; potential outcome matrices are:

Y (0) : = (Y_{i t} (0))_{t = 1, \dots, T, i = 1, \dots, n_{0} + 1}

Y (1) : = (Y_{i t} (1))_{t = 1, \dots, T, i = 1, \dots, n_{0} + 1}

Let $X_{t re a t}$ be a $p -$ vector of a pre-intervention characteristics, and $X_{c}$ is a $p \times n_{0}$ matrix containing the same values for control units. This typically includes pre-treatment outcomes, in which case $p = T_{0}$ , but predictors (even time invariant ones, $Z_{i}$ ) are usually available.

X_{t re a t} : = (Y_{1, 1}^{o b s}, Y_{1, 2}^{o b s}, \dots, Y_{1, T_{0}}^{o b s}, Z_{i})^{⊤}

For some $p \times p$ PSD matrix $V$ , define $∣ ∣ X ∣ ∣_{V} = X^{'} VX$ , where $V$ is typically diagonal. Consider weights $ω = (ω_{2}, \dots, ω_{n_{0} + 1})$ satisfying

ω_{i} \geq 0, i = 2, \dots, n_{0} + 1

i \geq 2 \sum ω_{i} = 1

This forces interpolation, i.e. the counterfactual cannot take a value greater than the maximal value or smaller than the minimal value of for a control unit. The synthetic control solution $ω^{*}$ solves

ω min ∣ ∣ X_{t re a t} - X_{c} ω ∣ ∣_{V}^{2} s.t. non-negativity and sum-to-1 constraints

The Synthetic Control Estimator is then

$τ_{t} : = Y_{1, t}^{o b s} - \sum_{i = 2}^{n_{0} + 1} ω_{i}^{*} Y_{i t}^{o b s}$

In contrast, a simple difference-in-differences estimator gives

τ_{t}^{D I D} : = Y_{1, t}^{o b s} - (Y_{1, T_{0}}^{o b s} - \frac{1}{n _{0}} i = 2 \sum n_{0} + 1 (Y_{i t}^{o b s} - Y_{i, T_{0}}^{o b s}))

choose $V = diag v_{1}, \dots, v_{p}$ using a nested-minimisation of the Mean Square Prediction Error (MSPE) over the pre-treatment period

MSPE (V) : = t = 1 \sum T_{0} (Y_{1, t}^{o b s} - i = 2 \sum n_{0} + 1 ω_{i} (V) Y_{i t}^{o b s})^{2}

Setup:

Y^{obs} = [Y_{t, post}^{obs} Y_{t, pre}^{obs} Y_{c, post}^{obs} Y_{c, pre}^{obs}] = [Y_{t, post} (1) Y_{t, pre} (0) Y_{c, post} (0) Y_{c, pre} (0)]

Y (0) = [? Y_{t, pre} (0) Y_{c, post} (0) Y_{c, pre} (0)]

relative magnitudes of $T$ and $N$ might dictate whether we impute the missing potential outcome $?$ using or comparison

Many Units and Multiple Periods: $N >> T_{0}$ , $Y (0)$ is ‘fat’, and comparison becomes challenging relative to . So matching methods are attractive.

$T_{0} >> N$ , $Y (0)$ is ‘tall’, and matching becomes infeasible. So it might be easier to estimate dependence structure.

Finally, if $T_{0} \approx N$ , regularization strategy for limiting the number of control units that enter into the estimation of $Y_{0, T_{0} + 1} (0)$ may be important

Focus on last period for now: $τ_{0, T} = Y_{0, T} (1) - Y_{0, T} (0) = Y_{0, T}^{obs} - Y_{0, T} (0)$

Many estimators impute $Y_{0, T} (0)$ with the linear structure $Y_{0, T} (0) = μ + \sum_{i = 1}^{n} ω_{i} \cdot Y_{i, T}^{obs}$

Methods differ in how $μ$ and $ω$ are chosen as a function of $Y_{c, post}^{obs}, Y_{t, pre}^{obs}, Y_{c, pre}^{obs}$

Impose four constraints

No Intercept: $μ = 0$ . Stronger than Parallel trends in DiD.

Adding up : $\sum_{i = 1}^{n} ω_{i} = 1$ . Common to DiD, SC.

Non-negativity: $ω_{i} \geq 0 \forall i$ . Ensures uniqueness via ‘coarse’ regularisation + precision control. Negative weights may improve out-of-sample prediction.

Constant Weights: $ω_{i} = \overline{ω} \forall i$

DiD imposes 2-4.

ADH(2010, 2014) impose 1-3

1 + 2 imply ‘No Extrapolation’.

Relaxing these assumptions:

Negative weights

If treated units are outliers on important covariates, negative weights might improve fit

Bias reduction - negative weights increase bias-reduction rate

When $N >> T_{0}$ , (1-3) alone might not result in a unique solution. Choose by

Matching on pre-treatment outcomes : one good control unit is better than synthetic one comprised of disparate units

Constant weights - implicit in DiD

Given many pairs of $(μ, ω)$

prefer values s.t. synthetic control unit is similar to treated units in terms of lagged outcomes

low dispersion of weights

few control units with non-zero weights

Optimisation Problem

Ingredients of objective function Balance: difference between pre-treatment outcomes for treated and linear-combination of pre-treatment outcomes for control $Y_{t, pre} - μ - ω^{⊤} Y_{c, pre}_{2}^{2} = (Y_{t, pre} - μ - ω^{⊤} Y_{c, pre})^{⊤} (Y_{t, pre} - μ - ω^{⊤} Y_{c, pre})$ Sparse and small weights: sparsity : $∥ ω ∥_{1}$ magnitude: $∥ ω ∥_{2}$

(μ^{e n} (λ, α), ω^{e n} (λ, α)) = μ, ω arg min Q (μ, ω ∣ Y_{t, pre}, Y_{c, pre}; λ, α) .

Q (μ, ω ∣ Y_{t, pre}, Y_{c, pre}; λ, α) = Y_{t, pre} - μ - ω^{⊤} Y_{c, pre}_{2}^{2} + λ (\frac{1 - α}{2} ∥ ω ∥_{2}^{2} + α ∥ ω ∥_{1}) .

Tailored Regularisation

don’t want to scale covariates $Y_{c, pre}$ to preserve interpretability of weights. Instead, treat each control unit as a ‘pseudo-treated’ unit and compute

Y_{j, T} (0) = μ^{en} (j; α, λ) + i \neq = j \sum ω_{i} (j; α, λ) \cdot Y_{i, T}^{obs} .

where

(μ^{e n} (j; λ, α), ω^{e n} (j; λ, α)) = μ, ω arg min t = 1 \sum T_{0} Y_{j, t} - μ - i \neq = 0, j \sum ω_{i} Y_{i, t}^{2} + λ (\frac{1 - α}{2} ∥ ω ∥_{2}^{2} + α ∥ ω ∥_{1}) .

pick the value of the tuning parameters $(α_{o pt}^{e n}, λ_{o pt}^{e n})$ that minimises

C V^{e n} (α, λ) = \frac{1}{N} j = 1 \sum N Y_{j, T} - μ^{e n} (j; α, λ) - i \neq = 0, j \sum ω_{i}^{e n} (j; α, λ) \cdot Y_{i, T} .

Difference in Differences

assume (2-4) No unique $μ, ω$ solution for $T = 2$ , so fix $ω = \frac{1}{N}$

ω_{i}^{did} = \frac{1}{N} \forall i \in {1, \dots N} .

μ^{did} = \frac{1}{T _{0}} s = 1 \sum T_{0} Y_{0, s} - \frac{1}{N T _{0}} s = 1 \sum T_{0} i = 1 \sum N Y_{i, s} .

Best Subset; One-to-one Matching

$(μ^{S}, ω^{S}) = arg min_{μ, ω} Q (\cdot; λ = 0, α)$ with $\sum_{i = 1}^{N} 1_{ω_{i} \neq = 0} \leq k$ (=1 for OtO)

Synthetic Control

assume (1-3) (i.e. $μ = 0$ ) For $M \times M$ PSD diagonal matrix $V$

(ω (V), μ (V)) = ω, μ arg min ((X_{t} - μ - ω^{⊤} X)^{⊤} V (X_{t} - μ - ω^{⊤} X)) .

V = V = diag (v_{1}, \dots, v_{M}) arg min ((Y_{t, pre} - ω (V)^{⊤} Y_{c, pre})^{⊤} (Y_{t, pre} - ω (V)^{⊤} Y_{c, pre})) .

Constrained regression: When $X_{i} = Y_{i, t}; 1 \leq t \leq T_{0}$ (Lagged Outcomes only) $V = I_{N}$ and $λ = 0$

Consider a balanced panel with $N$ units and $T$ time periods, where the first $N_{co}$ units are never treated, while $N_{t r} = N - N_{co}$ treated units are exposed after time $T_{p re}$ . We seek to solve for sdid weights $ω^{sdid}$ that align pre-exposure trends in outcomes of unexposed units with those for exposed units

i = 1 \sum N_{co} ω^{s d i d} Y_{i t} \approx N_{t r}^{- 1} i = N_{co} + 1 \sum N Y_{i t}

we also look for time weights $λ_{t}^{s d i d}$ that balance pre-exposure time periods with post-exposure time periods for unexposed units.

Weights are solved using the following optimisation problems

(\overset{ω}{^}_{0}, \overset{ω}{^}^{sdid}) = ω_{0} \in R, ω \in Ω arg min ℓ_{unit} (ω_{0}, ω) .

ℓ_{unit} (ω_{0}, ω) = t = 1 \sum T_{pre} (ω_{0} + i = 1 \sum N_{co} ω_{i} Y_{i t} - \frac{1}{N _{tr}} i = N_{co} + 1 \sum N Y_{i t})^{2} + ζ^{2} T_{pre} ∥ ω ∥_{2}^{2} .

Ω = {ω \in R_{+}^{N} : i = 1 \sum N_{co} ω_{i} = 1, ω_{i} = N_{tr}^{- 1} for all i = N_{co} + 1, \dots, N} .

where $R_{+}$ denotes the positive real line. We set the regularization parameter $ζ$ as

ζ = (N_{tr} T_{post})^{1/4} \overset{σ}{^}, \overset{σ}{^}^{2} = \frac{1}{N _{co} ( T _{pre} - 1 )} i = 1 \sum N_{co} t = 1 \sum T_{pre} - 1 (Δ_{i t} - \overset{ˉ}{Δ})^{2} .

Δ_{i t} = Y_{i (t + 1)} - Y_{i t}, \overset{ˉ}{Δ} = \frac{1}{N _{co} ( T _{pre} - 1 )} i = 1 \sum N_{co} t = 1 \sum T_{pre} - 1 Δ_{i t} .

We implement this for the time weights $\hat{λ}^{sdid}$ by solving

(\hat{λ}_{0}, \hat{λ}^{sdid}) = λ_{0} \in R, λ \in Λ arg min ℓ_{time} (λ_{0}, λ) .

ℓ_{time} (λ_{0}, λ) = i = 1 \sum N_{co} λ_{0} + t = 1 \sum T_{pre} λ_{t} Y_{i t} - \frac{1}{T _{post}} t = T_{pre} + 1 \sum T Y_{i t}^{2} .

Λ = ⎩ ⎨ ⎧ λ \in R_{+}^{T} : t = 1 \sum T_{pre} λ_{t} = 1, λ_{t} = T_{post}^{- 1} for all t = T_{pre} + 1, \dots, T ⎭ ⎬ ⎫ .

Compute regularisation parameter $ζ$ . Compute unit weights $ω^{sdid}$ . Compute time weights $λ^{sdid}$ . Compute the SDID estimator using the following weighted DID regression.

(τ^{sdid}, μ, α, β) = τ, μ, α, β arg min {i = 1 \sum N t = 1 \sum T (Y_{i t} - μ - α_{i} - β_{t} - D_{i t} τ)^{2} ω_{i}^{s d i d} λ_{t}^{s d i d}} .

implemented in synthdid::synthdid_estimate

Y_{i t} = δ_{i t} D_{i t} + x_{i t}^{'} β + λ_{i}^{'} f_{t} + ε_{i t} .

Where $D$ is the treatment, $δ_{i t}$ is the heterogeneous treatment effect for unit $i$ at time $t$ , $x_{i t}$ is a $p -$ vector of time-varying controls. $f_{t} = [f_{1 t}, \dots, f_{r t}]^{'}$ is a $k \times 1$ vector of unknown common factors, $λ_{i} = [λ_{i 1}, \dots, λ_{i r}]^{'}$ is a $r \times 1$ vector of unknown factor loadings. This factor component nests standard functional forms

Confounders U_{i t} = Loadings λ_{i} \times factors f_{t} .

$f_{t} = 1 ⟹ λ_{i} \times 1 = λ_{i}$ unit FEs

$λ_{i} = 1 ⟹ 1 \times f_{t} = f_{t}$ time FEs

$f_{1 t} = 1, f_{2 t} = ξ_{t}, λ_{i 1} = α_{i}, λ_{i 2} = 1 ⟹ f_{t} \times λ_{i} = α_{i} + ξ_{t}$ two-way FEs.

$f_{t} = t ⟹ λ_{i} \times f_{t} = λ_{i} \times t$ Unit-specific linear time trends

$λ_{i} = y_{i 0}, f_{t} = α_{t} ⟹ λ_{i} \times f_{t} = α y_{i, t - 1} - ν_{i t}$ Lagged dependent variable

Steps

Get initial value of $β$ using within estimator

Estimate $λ_{i}, f_{t}$ using $β$

Re-estimate $β$ using $λ_{i}^{'} f_{t}$

Iterate

Drawback - constant effect

With, $N_{CO}$ control units and $N_{TR}$ treated units, Write DGP for individual unit as

Y_{i} = D_{i} \circ δ_{i} + X_{i}^{'} β + F λ_{i} + ε_{i}, i \in {1, 2, \dots, N} .

Where $Y_{i} = [y_{i 1}, y_{i 2}, \dots, y_{i T}]^{'}$ , $D_{i} = [D_{i 1}, \dots, D_{i T}]^{'}$ , $X_{i} = [x_{i 1}, \dots, x_{i T}]^{'}$ is $T \times k$ , $F = [f_{1}, \dots, f_{T}]^{'}$ is $T \times r$ .

Stack controls together gives

T \times N_{CO} Y_{CO} = T \times N_{CO} \times p X_{CO} p \times 1 β + T \times N_{CO} F' Λ_{CO} + ε_{CO} .

GSC for treatment effects is an out-of-sample prediction method: the treatment effect for unit $i$ at time $t$ is the difference between the actual outcome and its estimated counterfactual $δ_{i t} = Y_{i t} (1) - Y_{i t} (0)$ , where $Y_{i t} (0)$ is imputed in three steps.

Estimate an IFE model using only the control group data and estimate $β, F, Λ_{CO}$

(β, F, Λ_{CO}) = β, F, Λ arg min i \in C \sum (Y_{i} - X_{i} β - F Λ_{i})^{'} (Y_{i} - X_{i} β - F Λ_{i}) .

F^{'} F = I_{r}, Λ_{CO}^{'} Λ_{CO} = Diagonal .

Estimate Factor loadings for each treated unit by minimising mean-squared error of the predicted treated outcome in pretreatment periods

Λ_{i} = Λ_{i} arg min (Y_{i}^{0} - X_{i}^{0} β - F^{0} Λ_{i})^{'} (Y_{i}^{0} - X_{i}^{0} β - F^{0} Λ_{i}) .

Λ_{i} = (F^{0^{'}} F^{0})^{- 1} F^{0^{'}} (Y_{i}^{0} - X_{i}^{0} β), i \in T .

where $0$ superscripts denote the pretreatment periods.

Calculate Treated Counterfactuals based on $β, F, Λ_{i}$

Y_{i t} (0) = x_{i t}^{'} β + λ_{i}^{'} f_{t}, i \in T, t > T_{0} .

Choose the number of factors $r$ by cross-validation. Implemented in gsynth.

Dynamic Treatment Effects

We may want to estimate the effects of treatment sequences (‘time-varying exposures’), as in medical settings (Robins 1986, ).

2 period example

Consider a setting with $t = 1, 2$ and corresponding outcomes $Y_{t}$ and treatments $D_{t}$ , where the treatment takes on values $d_{1}, d_{2} \in {0, 1, \dots, J}$ , and baseline covariates $X_{0}$ and covariates at the end of the first period $X_{1}$ .

Let $d_{2} : = (d_{1}, d_{2}) \in {0, 1, \dots, J} \times {0, 1, \dots, J}$ . Accordingly, $Y_{2} (d_{2})$ is the potential outcome realised when treatment is set to sequence $d_{2}$ . The ATE (contrast) two distinct treatment sequences $d_{2}$ vs $d_{2}^{'}$ is

Δ (d_{2}, d_{2}^{'}) : = E [Y_{2} (d_{2}) - Y_{2} (d_{2}^{'})] .

Estimating this quantity requires a sequential selection on observables assumption

Y_{2} (d_{2}) ⊥ ⊥ D_{1} ∣ X_{0}, Y_{2} (d_{2}) ⊥ ⊥ D_{2} ∣ D_{1}, X_{0}, X_{1}, d_{1}, d_{2} \in {0, 1, \dots, J} .

Pr (D_{1} = d_{1} ∣ X_{0}) > 0, Pr (D_{2} = d_{2} ∣ D_{1}, X_{0}, X_{1}) > 0.

Under these assumptions, dynamic treatment effects can be estimated based on nested conditional means regressions

Δ^{snmm} (d_{2}, d_{2}^{'}) : = E [E [E [Y_{2} ∣ D_{2} = d_{2}, X_{0}, X_{1}] ∣ D_{1} = d_{1}, X_{0}] - E [E [Y_{2} ∣ D_{2} = d_{2}^{'}, X_{0}, X_{1}] ∣ D_{1} = d_{1}^{'}, X_{0}]] .

where $d_{2} = (d_{1}, d_{2})$ and $d_{2}^{'} = (d_{1}^{'}, d_{2}^{'})$ denote distinct treatment sequences.

or an IPW estimator

Δ^{ipw} (d_{2}, d_{2}^{'}) : = E [\frac{Y \cdot 1 _{D_{1} = d_{1}} 1 _{D_{2} = d_{2}}}{p ^{d_{1}} ( X _{0} ) p ^{d_{2}} ( D _{1} , X _{0} , X _{1} )} - \frac{Y \cdot 1 _{D_{1} = d_{1}^{'}} 1 _{D_{2} = d_{2}^{'}}}{p ^{d_{1}^{'}} ( X _{0} ) p ^{d_{2}^{'}} ( D _{1} , X _{0} , X _{1} )}] .

where $p^{d_{1}} (X_{0})$ and $p^{d_{2}}$ are propensity scores in the two periods.

Finally, a double robust estimator is

Δ^{dr} (d_{2}, d_{2}^{'}) = E [ψ^{d_{2}} - ψ^{d_{2}^{'}}] .

ψ^{d_{2}} = \frac{1 _{D_{1} = d_{1}} \cdot 1 _{D_{2} = d_{2}} \cdot ( Y _{2} - μ ^{Y_{2}} ( d _{2} , X _{1} ) )}{p ^{d_{1}} ( X _{0} ) p ^{d_{2}} ( D _{1} , X _{0} , X _{1} )} + \frac{1 _{D_{1} = d_{1}} \cdot ( μ ^{Y_{2}} ( d _{2} , X _{1} ) - ν ^{Y_{2}} ( d _{2} , X _{0} ) )}{p ^{d_{1}} ( X _{0} )} + ν^{Y_{2}} (d_{2}, X_{0}) .

where

μ^{Y_{2}} (d_{2}, X_{0}, X_{1}) = E [Y_{2} ∣ D_{2} = d_{2}, X_{0}, X_{1}] .

ν^{Y_{2}} (d_{2}, X_{0}) = E [E [Y_{2} ∣ D_{2} = d_{2}, X_{0}, X_{1}] ∣ D_{1} = d_{1}, X_{0}] .

are (nested) conditional mean outcomes.

If we assume that $D_{2}$ is conditionally independent of potential outcomes given pre-treatment covariates $X_{0}$ and $D_{1}$ (implying that post-treatment $X_{1}$ aren’t required to control for confounders jointly affecting the second treatment and the outcome). In this case, the second part of the first SOO assumption can be strengthened to $Y (d_{2}) ⊥ ⊥ D_{2} ∣ D_{1}, X_{1}$ . This simplifies

ψ^{d_{2}} = \frac{1 _{D_{1} = d_{1}} \cdot 1 _{D_{2} = d_{2}} \cdot ( Y _{2} - μ ^{Y_{2}} ( d _{2} , X _{0} ) )}{p ^{d_{1}} ( X _{0} ) p ^{d_{2}} ( d _{1} , X _{0} )} + μ^{Y_{2}} (d_{2}, X_{0}) .

implemented in causalweight::dyntreatDML.

Generalisation to arbitrary panels

Let $D_{i t}$ denote treatment status at time $t$ , and collect them into a $t -$ vector for each unit to form a Treatment History $D_{i} : = (D_{i 1}, D_{i 2}, \dots, D_{i T})$ . A partial treatment history up to time $t$ is denoted $D_{i, 1 : t}$ . Time varying covariates are arranged analogously $X_{i t}, X_{i t}, X_{i, 1 : t}$ .

Potential outcomes are defined on treatment histories and rely on the standard consistency assumption / SUTVA, which assumes that the potential outcome for the same observed history $Y_{i t} : = Y_{i t} (d_{1 : t})$ when $D_{i, 1 : t} = d_{1 : t}$ . This generates $2^{t}$ potential outcomes for the outcome in period $t$ , which permits many hypothetical comparisons.

The estimand typically of interest the average causal effect of a treatment history

τ (d_{1 : t}, d_{1 : t}^{'}) : = E [Y_{i t} (d_{1 : t}) - Y_{i t} (d_{1 : t}^{'})] .

Define potential outcomes just intervening on the last $j$ periods as $Y_{i t} (D_{i, 1 : t - j - 1}, d_{t - j : t})$ , which is the ‘marginal’ potential outcome if the treatment history runs its natural course up to $t - j - 1$ and sets the last $j$ lags to $d_{t - j : t}$ .

This allows us to define a contemporaneous treatment effect (CET)

τ_{c} (t) = E [Y_{i t} (D_{i, 1 : t - 1}, 1) - Y_{i t} (D_{i, 1 : t - 1}, 0)] = E [Y_{i t} (1) - Y_{i t} (0)] .

The $j -$ step lagged effect is defined analogously

τ_{l} (t, j) : = E [Y_{i t} (D_{i, 1 : t - j - 1}, 1, 0_{j}) - Y_{i t} (D_{i, 1 : t - j - 1}, 0, 0_{j})] .

and the step response function (SRF) describes how this effect varies by time period and distance between the shift and the outcome

τ_{s} (t, j) = E [Y_{i t} (1_{j}) - Y_{i t} (0_{j})] .

These effects are (clunkily) parametrised in an autoregressive distributed-lag (ADL) models of the form

Y_{i t} = β_{0} + α Y_{i, t - 1} + β_{1} D_{i t} + β_{2} D_{i, t - 1} + ε_{i t} .

with assumption $ε_{i t} ⊥ ⊥ D_{i, s} \forall t, s$ . This implies the following form for potential outcomes

Y_{i t} (d_{1 : t}) = β_{0} + α Y_{i, t - 1} (d_{1 : t - 1}) + β_{1} d_{i t} + β_{2} d_{i, t - 1} + ε_{i t} .

hence, changes in $d_{t - 1}$ can have both a direct and indirect effect on $Y_{i t}$ .

{Y_{i t} (d_{1 : t}) : t = 1, \dots, T} ⊥ ⊥ D_{i, 1 : t} ∣ X_{i, 0} .

This relates to linear panel models of the form

Y_{i t} = β_{0} + β_{1} D_{i t} + β_{2} D_{i, t - 1} + η_{i t} .

where strict exogeneity $E [η_{i t} ∣ D_{i, 1 : T}] = E [η_{i t}] = 0$ is assumed.

For every treatment history $d_{1 : T}$ and period $t$ ,

{Y_{i s} (d_{1 : s}) : s = 1, \dots, T} ⊥ ⊥ D_{i, 1 : t} ∣ V_{i t} .

where $V_{i t}$ is a set of covariates such as ${Y_{i, t - 1}, D_{i, t - 1}, X_{i t}}$ .

This relates to sequential exogeneity in panel models

E [ε_{i t} ∣ D_{i, 1 : t}, X_{i, 1 : t}, Y_{i, 1 : t - 1}] = E [ε_{i t} ∣ D_{i t}, V_{i t}] = 0.

Under sequential ignorability, an ADL approach would be to write the outcome regression with time-varying covariates

Y_{i t} = β_{0} + α Y_{i, t - 1} + β_{1} D_{i t} + β_{2} D_{i, t - 1} + X_{i t}^{'} δ + ε_{i t} .

This generates post-treatment bias because $X_{i t}$ may be affected by $D_{i, 1 : t - 1}$ .

Define the impulse response functions (‘blip-down’ functions) as

b_{t} (d_{1 : t}, j) : = E [Y_{i t} (d_{1 : t - j}, 0_{j}) - Y_{i t} (d_{1 : t - j - 1}, 0_{j + 1}) ∣ D_{1 : t - j} = d_{1 : t - j}] .

which is the effect of a change from $0$ to $d_{t - j}$ in terms of the treatment on the outcome at time $t$ , conditional on treatment history up to time $t - j$ .

These functions are parametrised as a function of lag length

b_{t} (d_{1 : t}, j; γ) = γ_{1 j} d_{t - j} + γ_{2 j} d_{t - j} d_{t - j - 1} + \dots

This then allows us to construct blipped-down / demediated outcomes

Y_{i t}^{j} = Y_{i t} - s = 1 \sum j - 1 γ_{s} D_{i, t - s} .

Intuitively, this transformation subtracts off the effects of $j$ lags of treatment, creating an estimate of the counterfactual level of the outcome at time $t$ if the treatment had been set to $0$ for $j$ periods before $t$ . Under sequential ignorability, the transformed outcome $Y_{i t}^{j}$ has the same expectation as the counterfactual $Y_{i t} (d_{1 : t - j}, 0_{j})$ , and can be used to construct $Y_{i t}^{j + 1}$ by modelling the relationship between $Y_{i t}^{j}$ and $D_{i, t - j}$ to estimate the lagged effect for $j + 1$ . This is recursive, hence the ‘nested’.

Sequential g-estimation can be used to estimate effects. Suppose we’re interested in the contemporaneous effect and the first-lagged effect and we adopt an impulse response function $b_{t} (d_{1 : t, j; γ}) = γ_{j} d_{t - j}$ for both these effects. We assume sequential ignorability conditional on $V_{i t} : = {D_{i, t - 1}, Y_{i, t - 1}, X_{i t}}$ . Sequential g-estimation proceeds as follows

For $j = 0$ regress the un-transformed outcome on ${D_{i t}, D_{i, t - 1}, Y_{i, t - 1}, X_{i t}}$ as in an ADL model. If this is correctly specified, we estimate the blip-down parameter $γ_{0}$ (contemporaneous effect) correctly. We use $γ_{0}$ to construct the one-lag blipped-down outcome $Y_{i t}^{1} = Y_{i t} - γ_{0} D_{i t}$ This blipped-down outcome would be regressed on ${D_{i, t - 1}, D_{i, t - 2}, Y_{i, t - 2}, X_{i, t - 1}}$ to estimate the next blip-down parameter $γ_{1}$ (the first lagged effect) (repeat for further lags, standard error estimated via block-bootstrap)

To specify a marginal structural model, we choose a potential outcome lag length and write a model for the marginal model of those potential outcomes in terms of treatment history

E [Y_{i t} (d_{1 : t})] = g (d_{1 : t}; β) .

for example, for a contemporaneous and two lagged effects, we write $E [Y_{i t} (d_{t - 2 : t})] = g (d_{t - 2 : t}; β)$ , marginalising over further lags and covariates.

The average causal effect is then

τ^{msm} : = g (d_{1 : t}; β) - g (d_{1 : t}^{'}; β) .

This motivates an IPW approach where weights are constructed as

SW_{i t} : = s = 1 \prod t \frac{P ( D _{i s} ∣ D _{i, s - 1} , γ )}{P ( D _{i s} ∣ X _{i s} , Y _{i, s - 1} , D _{i, s - 1} , α )} .

where the denominator of each term is the product of the predicted probability of observing unit $i$ ‘s observed treatment status conditional on covariates that satisfy conditional ignorability. Multiplying this over time produces the probability of seeing this unit’s treatment history conditional on the past.

These weights can be used in a regression of the form

g (d_{t - 2 : t}; β) = β_{0} + β_{1} d_{t} + β_{2} d_{t - 1} + β_{3} d_{t - 2} .

Decomposition Methods

Basic idea of decomposition

F_{M} (y) - F_{F} (y) = \int F_{M} (y ∣ x) f_{M} (x) d x - \int F_{F} (y ∣ x) f_{F} (x) d x .

F_{M} (y) - F_{F} (y) = \int [F_{M} (y ∣ x) - F_{F} (y ∣ x)] f_{M} (x) d x + \int F_{M} (y ∣ x) [f_{M} (x) - f_{F} (x)] d x .

Oaxaca-Blinder Decomposition

\overline{y}_{A} - \overline{y}_{B} = \overline{x}_{A}^{'} β_{A} - \overline{x}_{B}^{'} β_{B} = \overline{x}_{B}^{'} (β_{A} - β_{B}) + (\overline{x}_{A} - \overline{x}_{B})^{'} β_{A} .

We consider two groups, $A$ and $B$ , and an outcome $Y$ , and a vector of predictors $x$ . Main question for decomposition is how much of the mean outcome difference [or another summary statistic / quantile of CDF] is accounted for by group differences in the predictors $x$ . The Oaxaca-Blinder decomposition refers to the following decompositions:

R : = E [Y_{A}] - E [Y_{B}] = E [x_{A}]^{'} β_{A} - E [x_{B}]^{'} β_{B} .

R = Explained (E [x_{A}] - E [x_{B}])^{'} β_{A} + Unexplained E [x_{B}]^{'} (β_{A} - β_{B}) .

R = endowments (E [x_{A}] - E [x_{B}])^{'} β_{B} + discrimination E [x_{B}]^{'} (β_{A} - β_{B}) + interaction (E [x_{A}] - E [x_{B}])^{'} (β_{A} - β_{B}) .

R = Explained (E [x_{A}] - E [x_{B}])^{'} β^{*} + Unexplained E [x_{A}]^{'} (β_{A} - β^{*}) + E [x_{B}]^{'} (β^{*} - β_{B}) .

δ_{A} : = β_{A} - β^{*}, δ_{B} : = β_{B} - β^{*} .

$Oaxaca decomposition where D_1 is the 'discrimination' piece . D_1 \neq D_2 generically unless two groups have the same slope (which is practically never the case)$

Detailed Decomposition

To examine the ‘contribution’ of each variable to the observed gap, estimate

y_{i} = j = 1 \sum k x_{ji} β_{j} + j = 1 \sum k d_{i} x_{ji} δ_{j} + ε_{i}, d_{i} : = {1, 0, i \in B, otherwise .

so, $β_{j}$ is the coefficient for group $A$ , and $β_{j} + δ_{j}$ is the coefficient for group $B$ . A t-test for $δ_{j}$ is used to establish whether a variable is a source of the observed gap. The contribution of each variable to the explained part is

c_{k}^{*} = \frac{( x _{k}^{A} - x _{k}^{B} ) β _{k}^{A}}{( x ^{A} - x ^{B} ) β ^{A}} .

Let outcome models be linear $Y_{i} = X_{i} β_{1} + ν_{1 i} if W_{i} = 1$ and $Y_{i} = X_{i} β_{0} + ν_{0 i} if W_{i} = 0$ where $E [ν_{1 i}] = E [ν]_{0 i} = 0$ .

The difference in means decomposition is

E [Y ∣ W = 1] - E [Y ∣ W = 0] = E [X ∣ W = 1] (β_{1} - β_{0}) + (E [X ∣ W = 1] - E [X ∣ W = 0]) β_{0} .

E [Y ∣ W = 1] - E [Y ∣ W = 0] = E [Y^{1} - Y^{0} ∣ W = 1] + (E [Y^{0} ∣ W = 1] - E [Y^{0} ∣ W = 0]) = τ_{P A TT} + (E [Y^{0} ∣ W = 1] - E [Y^{0} ∣ W = 0]) .

Sloczynski: SATT can be estimated by running the following regression:

$Y_{i} = α + τ W_{i} + X_{i}^{'} β + ψ W_{i} (X_{i} - \overline{X}_{1}) + ε_{i}$

Kline (2011) shows that this is ‘doubly robust’ and equivalent to a reweighting estimator based on the weights

w (x) : = \frac{d F _{X ∣ W = 1} ( x )}{d F _{X ∣ W = 0} ( x )} = \frac{1 - ρ}{ρ} \frac{e ( x )}{1 - e ( x )} .

where $ρ : = Pr (W_{i} = 0)$ is the treated share.

Distributional Regression

Section based on counterfactual distribution decomposition methods.

Let $F_{X_{k}}$ denote the distribution of job-relevant characteristics (education, experience, etc.) for men when $k = m$ and for women when $k = w$ . Let $F_{Y_{j} ∣ X_{j}}$ denote the conditional distribution of wages given job-relevant characteristics for group $j \in {w, m}$ , which describes the stochastic wage schedule that a given group faces. Using these distributions, we can construct $F_{⟨ j ∣ k ⟩}$ , the distribution of wages for group $k$ facing group $j$ ‘s wage schedule as

$F_{⟨ j ∣ k ⟩} (y) = \int F_{Y_{j} ∣ X_{j}} (y ∣ x) d F_{X_{k}} (x), y \in Y_{j}$

For example, $F_{⟨ 0 ∣ 0 ⟩}$ is the distribution of wages for men who face men’s wage schedule, and $F_{⟨ 1 ∣ 1 ⟩}$ is the distribution of wages for women who face women’s wage schedule, which are both observed distributions. We can also study $F_{⟨ 0 ∣ 1 ⟩}$ , the counterfactual distribution of wages for women if they faced the men’s wage schedule $F_{Y_{0} ∣ X_{0}}$ .

F_{Y ⟨ 0 ∣ 1 ⟩} (y) \equiv \int_{X_{1}} F_{Y_{0} ∣ X_{0}} (y ∣ x) d F_{X_{1}} (x) .

is the counterfactual distribution constructed by integrating the conditional distribution of wages for men with respect to the distribution of characteristics for women.

We can Interpret $F_{Y ⟨ 0∣1 ⟩}$ as the distribution of wages for women in the absence of gender discrimination, although it is predictive and cannot be interpreted as causal without further (strong) assumptions.

F_{Y ⟨ 1 ∣ 1 ⟩}^{\leftarrow} - F_{Y ⟨ 0 ∣ 0 ⟩}^{\leftarrow} = structure [F_{Y ⟨ 1 ∣ 1 ⟩}^{\leftarrow} - F_{Y ⟨ 0 ∣ 1 ⟩}^{\leftarrow}] + composition [F_{Y ⟨ 0 ∣ 1 ⟩}^{\leftarrow} - F_{Y ⟨ 0 ∣ 0 ⟩}^{\leftarrow}] .

Assumptions for Causal Interpretation

Under conditional exogeneity / selection on observables, CE can be interpreted as causal effects. Sec 2.3 in ECTA 2013 paper spells this out in detail. Let $(Y_{j}^{*} : j \in J)$ be the vector of potential outcomes for various values of a policy $j \in J$ , and $X$ be a vector of covariates. Let $J$ denote the random variable that describes the realised policy and let $Y := Y_{J}^{*}$ denote the realised outcome variable. When $J$ is not randomly assigned, the distribution of $Y ∣ J = j$ may differ from the distribution of $Y_{j}^{*}$ . However, under conditional exogeneity, the distribution of $Y ∣ X, J = j$ and $Y_{j}^{*} ∣ X$ agree, and the observed conditional distributions have a causal interpretation, and so do counterfactual distributions generated from these conditionals by integrating out $X$ .

Let $F_{Y_{j}^{*} ∣ J} (y ∣ k)$ denote the distribution of the potential outcome $Y_{j}^{*}$ in the population with $J = k \in J$ . The causal effect of exogenously changing the policy from $l$ to $j$ on the distribution of the potential outcome in the population with the realised policy $J = k$ is $F_{Y_{j}^{*} ∣ J} (y ∣ k) - F_{Y_{l}^{*} ∣ J} (y ∣ k)$ . Under conditional exogeneity, for any $j, k \in J$ , the counterfactual distribution $F_{Y ⟨ j ∣ k ⟩} (y)$ exactly corresponds to $F_{Y_{j}^{*} ∣ J} (y ∣ k)$ , and hence the causal effect of exogenously changing the policy from $l$ to $j$ in the population with $J = k$ corresponds to the CE of changing the conditional distribution from $l$ to $j$ , that is

$F_{Y_{j}^{*} ∣ J} (y ∣ k) - F_{Y_{l}^{*} ∣ J} (y ∣ k) = F_{Y ⟨ j ∣ k ⟩} (y) - F_{Y ⟨ l ∣ k ⟩} (y)$

Conditional exogeneity assumption for this section:

(Y_{j}^{*} : j \in J) ⊥ ⊥ J ∣ X

$K$ groups that partition the sample. For each population $k, \exists X_{k} \in R^{d}$ and outcome $Y_{k}$ . Covariate vector is observable in all populations, but the outcome is only observable in populations $j \in J \subset K$ . Let $F_{X_{k}}$ denote the covariate distribution in the population $k \in K$ , and $F_{Y_{j} ∣ X_{j}}$ and $Q_{Y_{j} ∣ X_{j}}$ denote the conditional distribution and quantile functions in population $j \in J$ . We denote the support of $X_{k}$ by $X_{k} \subset R^{d_{x}}$ and the region of interest $Y_{j}$ by $Y_{j} \subseteq R$ . We refer to $j$ as the reference population and $k$ as the counterfactual population.

The reference and counterfactual populations in the wage example correspond to different groups. We can also generate counterfactual populations by artificially transforming a reference population. We can think of $X_{k}$ as being created through a known transformation of $X_{j}$ : $X_{k} = g_{k} (X_{j}), where g_{k} : X_{j} \to X_{k}$

Counterfactual distribution and quantile functions are formed by combining the conditional distribution in population $j$ with the covariate distribution in population $k$ , namely:

F_{Y ⟨ j ∣ k ⟩} (y) \equiv \int_{X_{k}} F_{Y_{j} ∣ X_{j}} (y ∣ x) d F_{X_{k}} (x), y \in Y_{j} .

Q_{Y ⟨ j ∣ k ⟩} (τ) \equiv F_{Y ⟨ j ∣ k ⟩}^{\leftarrow} (τ), τ \in (0, 1) .

where $(j, k) \in J K$ and $F_{Y ⟨ j ∣ k ⟩}^{\leftarrow} (τ) = in f {y \in Y_{j} : F_{Y ⟨ j ∣ k ⟩} (y) \geq τ}$ is the left-inverse function of $F_{Y ⟨ j ∣ k ⟩}$ .

The main interest lies in the quantile effect (QE) function, defined as the difference of the two counterfactual quantile functions over a set of quantile indexes $T \subset (0, 1)$

Δ (τ) = Q_{Y ⟨ j ∣ k ⟩} (τ) - Q_{Y ⟨ j ∣ j ⟩} (τ), τ \in T .

Estimation of Conditional distribution

F_{Y_{j} ∣ X_{j}} (y ∣ x) \equiv \int_{(0, 1)} 1_{Q_{Y_{j} ∣ X_{j}} (u ∣ x) \leq y} d u .

method = "qr" default implements

\hat{F}_{Y_{j} ∣ X_{j}} (y ∣ x) = ε + \int_{(ϵ, 1 - ϵ)} 1_{x^{'} \hat{β}_{j} (u) \leq y} d u .

where $ε$ is a small constant that avoids estimation of tail quantiles, and $\hat{β} (u)$ is the quantile regression estimator

\hat{β}_{j} (u) = b \in R^{d_{x}} arg min i = 1 \sum n_{j} (u - 1_{Y_{ji} \leq X_{ji}^{'} b}) (Y_{ji} - X_{ji}^{'} b) .

method = "logit" implements the distribution regression estimator of the conditional distribution with the logistic link function

$\hat{F}_{Y_{j} ∣ X_{j}} (y ∣ x) = Λ (x^{'} \hat{β} (y))$

where $Λ$ is the standard logistic CDF and $\hat{β} (y)$ is the distribution regression estimator

\hat{β} (y) \equiv b \in R^{d_{x}} arg max i = 1 \sum n_{j} (1_{Y_{ji} \leq y} lo g Λ (X_{ij}^{'} b) + 1_{Y_{ij} > y} lo g Λ (- X_{ji}^{'} b)) .

Causal Directed Acyclic Graphs

based on ,Pearl (2009), Morgan and Winship (2014), Cunningham (2020).

For an undirected graph between $X, Y, and Z$ , there are four possible directed graphs:

$X \to Y \to Z$ (a chain)

$X \leftarrow Y \leftarrow Z$ (another chain)

$X \leftarrow Y \to Z$ (a fork on Y)

$X \to Y \leftarrow Z$ (collision on Y)

With the fork or either chain, we have $X ⊥ ⊥ Z ∣ Y$ . However, With a collider, $X \neq ⊥ ⊥ Z ∣ Y$ .

Causal effect of $X$ on $Y$ is written $Pr (Y ∣ do (X = x))$ . Basic idea is condition on adequate controls (i.e. not every observed control). Here, controlling for $U$ is unnecessary and would bias the estimate of $Pr (Y ∣ do (X = x))$ .

Basics / Terminology

A backdoor path is a non-causal path from $A$ to $Y$ . They are ‘backdoor’ because they flow backwards out of $A$ : all of these paths point into $A$ .

Here, $A \leftarrow U \to Y$ , where $U$ is a common cause for treatment and the outcome. So, $U$ is a confounder.

A worse problem arises with the following DAG, where dotted lines indicate that $U$ is unobserved. Because $U$ is unobserved, this backdoor path is open.

Colliders, when left alone, always close a backdoor path. Conditioning on them, however, opens a backdoor path, and yields biased estimates of the causal effect of $A$ on $Y$ .

Common colliders are post-treatment controls $A \to C \leftarrow Y$

Another insidious type of collider is of the form $A \leftarrow \dots \to C \leftarrow \dots \to Y$ , where $C$ is typically a lagged outcome.

Vector of measured controls $S$ satisfies the backdoor criterion if (i) $S$ blocks every path from $A$ to $Y$ that has an arrow into $A$ (i.e. blocks the back door) and (ii) no node in $S$ is a descendant of $A$ . Then,

$Pr (Y ∣ d o (A = a)) = \sum_{s} Pr (Y ∣ A = a, S = s) Pr (S = s)$

Which is the same as the subclassification estimator. The conditional Expectation $E [Y ∣ A = a, S = s]$ can be computed using a nonparametric regression / ML algorithm of choice.

$M$ satisfies the frontdoor criterion if (i) $M$ blocks all directed paths from $A$ to $Y$ , (ii) there are no unblocked back-door paths from $A$ to $M$ , and (iii) $A$ blocks all backdoor paths from $M$ to $Y$ .

Then,

$Pr (Y ∣ d o (A)) = Pr (M ∣ d o (A)) M \sum Pr (M = m ∣ A = a) Pr (Y ∣ M, d o (A)) a^{'} \sum Pr (Y ∣ A = a^{'}, M = M) P r (A = a^{'})$

The above DAG in words

The only way $A$ influences $Y$ is through $M$ , so there is no arrow bypassing $M$ between $X$ and $Y$ . In other words, $M$ intercepts all directed paths from $A$ to $Y$ .

Relationship between $A$ and $M$ is not confounded by unobservables - i.e. no back-door paths between A and M.

Conditional on $A$ , the relationship between $M$ and $Y$ is not confounded, i.e. every backdoor path between $M$ and $Y$ has to be blocked by $A$ .

With a single mediator $M$ that is not caused by $U$ , the ATE can be estimated by multiplying estimates $γ \times δ$ .

The FDC estimates the ATE because it decomposes a reduced-form relationship that is not causally identified into two causally identified relationships.

Implementation through linear regressions:

M_{i} = κ + γ A_{i} + ω_{i} .

Y_{i} = α + δ M_{i} + ψ A_{i} + ν_{i} .

Since $E [M ∣ A] = γ$ is identified, $Cov (ω_{i} A_{i}) = 0$ in the first-stage equation and $Cov (M, ν) = 0$ in the second-stage equation. Assume $ψ = 0$ . Then, write

$τ_{FDC} = E [Y ∣ d o (A)] = δ \times γ$

Mediation Analysis

Pearl (2001), Robins(2003)

Consider SRS where we observe $(D_{i}, M_{i}, X_{i}, Y_{i})$ , where $D_{i}$ is a treatment indicator, $M_{i}$ is a mediator, $X_{i}$ is a vector of pre-treatment controls, and $Y_{i}$ is the outcome. The supports are $M, X, Y$ respectively. $X$ s are partialled out.

Let $M_{i} (d)$ denote potential value for the mediator under treatment status $D_{i} = d$ . The outcome $Y_{i} (d, m)$ is the potential outcome for unit $i$ when $D_{i} = d, M_{i} = m$ . The observed variables can be written as $M_{i} = M_{i} (D_{i}), Y_{i} = Y_{i} (D_{i}, M_{i} (D_{i}))$ .

d, a used interchangeably for treatment.

{Y_{i} (d, m), M_{i} (d)} ⊥ ⊥ D_{i} ∣ X_{i} = x .

Y_{i} (d, m) ⊥ ⊥ M_{i} (d) ∣ D_{i}, X_{i} = x .

$\forall d^{'}, d \in {0, 1} and (m, x) \in M \times X$

This requires the treatment to be conditionally independent of the potential mediator states and outcomes given X, ruling out unobserved confounders jointly affecting the treatment on the one hand and the mediator and/or the outcome on the other hand conditional on the covariates. (5) postulates independence between the counterfactual outcome and mediator values ‘across-worlds’.

Effectively, Need $M$ to be randomly assigned (approx).

$NIE_{i} (d) \equiv δ_{i} (d) : = Y_{i} (d, M_{i} (1)) - Y_{i} (d, M_{i} (0))$

Difference in $Y$ holding treatment status constant, and varying the mediator. Sample Average: Average Causal Mediation Effect (ACME)

\overline{δ} (d) : = E [δ_{i} (d)] = E [Y_{i} (d, M_{i} (1)) - Y_{i} (d, M_{i} (0))] .

NDE_{i} (d) \equiv θ_{i} (d) : = Y_{i} (1, M_{i} (d)) - Y_{i} (0, M_{i} (d)) .

Difference in $Y$ holding mediator constant, and varying the treatment.

τ_{i} = Y_{i} (1, M_{i} (1)) - Y_{i} (0, M_{i} (0)) .

τ_{i} = (Y_{i} (1, M_{i} (1)) - Y_{i} (0, M_{i} (1))) + (Y_{i} (0, M_{i} (1)) - Y_{i} (0, M_{i} (0))) = θ_{i} (1) + δ_{i} (0) .

τ_{i} = (Y_{i} (1, M_{i} (0)) - Y_{i} (0, M_{i} (0))) + (Y_{i} (1, M_{i} (1)) - Y_{i} (1, M_{i} (0))) = θ_{i} (0) + δ_{i} (1) .

τ_{i} = δ_{i} (d) + θ_{i} (1 - d) .

NDE conditions on potential mediator effects.For CDE, we set mediator at a prescribed value $m$ .

$CDE_{i} (d, d^{'}, m) = Y_{i} (d, m) - Y_{i} (d^{'}, m) m \in M$

Difference between NDE and CDE is what value mediator is fixed at. Restated:

$ψ_{i} (d, d^{'}, m) = Y_{i} (d, m) - Y_{i} (d^{'}, m)$

Effect of changing the treatment while fixing the value of the mediator at some level $m$ .

$\overline{ψ} (d, d^{'}, m) = E [Y_{i} (d, m) - Y_{i} (d^{'}, m)]$

Decomposing total effect with binary mediator

τ (d, d^{'}) = ACDE (d, d^{'}, 0) + ANIE (d, d^{'}) + E [M (a^{'}) [CDE (d, d^{'}, 1) - CDE (d, d^{'}, 0)]] .

Assume linear models for mediator $M = ψ T + U_{m}$ and $Y = βT + γ M + U_{Y}$ .

Then fit the following regressions

Y_{i} = α_{1} + τ D_{i} + ε_{i 1} .

M_{i} = α_{2} + ψ D_{i} + ε_{i 2} .

Y_{i} = α_{3} + β D_{i} + γ M_{i} + ε_{i 3} .

Baron and Kenny (1986) suggest testing $τ = ψ = β = 0$ . If all nulls rejected, Mediation effect $\overline{δ} = ψ γ$ . Equivalently, mediation effect is $τ - β = ψ \times γ$ . Estimate variance using bootstrap / delta method.

Assume selection on observables w.r.t. D, M.

Huber(2014)

Average direct effect identified by

θ (d) = E [(\frac{Y \cdot D}{Pr ( D = 1 ∣ M , X )} - \frac{Y \cdot ( 1 - D )}{1 - Pr ( D = 1 ∣ M , X )}) \cdot \frac{Pr ( D = d ∣ M , X )}{Pr ( D = d ∣ X )}] .

Average Indirect Effect identified by

δ (d) = E [\frac{Y \cdot 1 _{D = d}}{Pr ( D = d ∣ M , X )} (\frac{Pr ( D = 1 ∣ M , X )}{Pr ( D = 1 ∣ X )} - \frac{1 - Pr ( D = 1 ∣ M , X )}{1 - Pr ( D = 1 ∣ X )})] .

implemented in causalweight::medweight.

Lalgorithms

Explorer

Chapter 03: Causal Inference

Concept map

Foundations, Experiments

Potential Outcomes

Treatment Effects

Difference in Means

Regression Adjustment

Randomisation Inference

Blocking

Power Calculations

MDES for Blocking

Selection On Observables

Regression Anatomy / FWL

Identification of Treatment Effects under Unconfoundedness

Multi-valued and Continuous Treatments

Estimators of E[Yd]

Subclassification / Blocking

Regression Adjustment

Matching

Horvitz-Thompson Estimator as Regression

Hybrid Estimators

Augmented Balancing

Heterogeneous Treatment Effects with selection on observables

Multi-action policy learning

Sensitivity Analysis

Partial Identification

Instrumental Variables

Traditional IV Framework (Constant Treatment Effects)

Weak Instruments

IV with Heterogeneous Treatment Effects / LATE Theorem

Characterising Compliers

Shift Share / Bartik Instruments

Marginal Treatment Effects: Treatment effects under self selection

High Dimensional IV selection

Principal Stratification

Estimation under principal ignorability

Direct and Indirect Effects via Principal Stratification

Attrition as Selection Bias

Regression Discontinuity Design

Estimators

Assumptions for Local Linear Estimator

Fuzzy RD

Regression Kink Design

Differences-in-Differences

DiD with 2 periods

Nonparametric Identification Assumptions with Covariates

Panel Data

Fixed Effects Regression

Random Effects

Hausman Test: Choosing between FE and RE

Time Trends

Distributed Lag

Staggered Adoption

Changes-in-Changes

Synthetic Control

Dynamic Treatment Effects

2 period example

Generalisation to arbitrary panels

Decomposition Methods

Oaxaca-Blinder Decomposition

Distributional Regression

Causal Directed Acyclic Graphs

Basics / Terminology

Mediation Analysis

Graph View

Table of Contents

Backlinks

Estimators of $E [Y^{d}]$