This note is a high-fidelity Markdown migration of the Causal Inference chapter from the LaTeX source.

Parent map: index Prerequisites: probability-and-mathstats, linear-regression

Concept map

flowchart TD
  A[Potential Outcomes] --> B[Randomized Experiments]
  A --> C[Selection on Observables]
  C --> D[Regression Adjustment]
  C --> E[Matching / IPW / AIPW]
  A --> F[Instrumental Variables]
  F --> G[LATE / Compliers]
  F --> H[MTE]
  A --> I[RD / RDD]
  A --> J[DiD / Panel]
  J --> K[Staggered Adoption]
  J --> L[Synthetic Control]
  A --> M[Decomposition Methods]
  A --> N[Causal DAGs]

Foundations, Experiments

Potential Outcomes

Exposition from

is the observed outcome, is the treatment with levels ,

potential outcomes denoted (interchangeably).

Equivalently, we have the switching equation

This encodes what is known as the causal-consistency assumption (/ SUTVA).

Generally, define a potential outcome

where is a vector of observed covariates and is a vector of unobservables, and is an unknown measurable function. Typically, we are interested in non-parametric identification of or some features of it.

Given a population of units, the assignment mechanism is a row-exchangeable function taking values on and satisfying

A unit level assignment probability for unit is

A finite population propensity score is

where is the number of units in each stratum defined by .

is a row-exchangeable function of potential outcomes, treatment assignment, and covariates.

are vectors of potential outcomes, is a covariate matrix, and is an assignment vector.

The most intuitive estimand is a vector . This is impossible to estimate because of the FPCI, so we instead use summaries, such as its sample average, or subgroup averages.

We never see both potential outcomes for any given unit.

Decompositions of Observed Differences:

where is the share of the sample treated.

This is a Missing Completely at Random (MCAR) assumption on potential outcomes.

Writing outcomes generated by the switching regression assumes that potential outcomes for any unit do not vary with the treatment assigned to other units. In practice, this is equivalent to a no spillovers assumption.

Equivalently, let denote a treatment vector for units, and be the potential outcome vector that would be observed if was based on allocation . Then, SUTVA requires that for allocations ,

Intuitively, SUTVA ensures that the ‘science table’ (Imbens & Rubin 2015) has 2 columns for the two potential outcomes as opposed to (number of potential outcomes with arbitrary interference).

Treatment Effects

Estimands

Under randomisation, , since the treated are a random sample of the population. Under weak(er) assumption of , only is identified.

Difference in Means

Variance of Difference in means estimator is given by

where are sample variances of respectively, and is the variance of the unit level treatment effect

This is not identifiable because of the last term. If the treatment effect is constant in the population, the last term is zero.

A (conservative) variance estimator is given by

where

These variance estimates can be used to construct 95% confidence intervals

Regression Adjustment

  • [extra terms above come from allowing for heterogeneous TEs]

Selection bias:

Suppose percent of the population gets the treatment. Let . Then,

Generalise to fraction treated

VCV under homoscedasticity

VCV under heteroskedasticity

Including controls:

Corrects for chance covariate imbalances, improves precision by removing variation in outcome accounted for by pre-treatment characteristics.

Freedman (2008) Critique

Regression of the form

is consistent for ATE but has small sample bias (unless model is true); bias is on the order of

precision does not improve through the inclusion of controls; including controls is harmful to precision if more than units are assigned to one treatment condition

Recommends fitting

Where the two potential outcomes are stipulated to follow

which has same small sample bias, but cannot hurt asymptotic precision even if the model is incorrect and will likely increase precision if covariates are predictive of the outcomes.

Randomisation Inference

sharp null: . Implies .

To test sharp null, set for all units and re-randomize treatment. Complete randomisation of units with treated. assignment vectors. value can be as small as .

is the full set of randomisation realisations, and is an element in the set (drawn either under complete randomization or binomial randomization), with associated probability

One sided P-value :

Blocking

Stratify randomisation to ensure that groups start out with identical observable characteristics on blocked factors.

if where and are errors from specification omitting and including block dummies respectively.

For blocks,

Point estimate

Variance Randomisations within each block are independent, so the variances are simple means (with squared weights).

Regression Formulation

If treatment probabilities vary by block, then weight by

Efficiency Gains from Blocking

Complete Randomisation : $Y_i = \alpha + \tau_{CR} D_i

  • \epsilon_i$

Block Randomisation:

Where is the fit from regressing on all dummies. Since by randomisation,

Power Calculations

Basic idea: With large enough samples, [where is the share of sample treated]. Set to minimise overall variance. Yields . With homoskedasticity, this is Treatment, control.

(effect size)

Test for .

For common variance ,

General formula for Power with unequal variances

This yields

Common variance (assumed)

where = Critical t-value to reject null + t-value for alternative (where ) is power.

MDES (Minimum Detectable Effect Size in Standard Deviation Units):

Multiplier simplifies to

Rearrange to get necessary sample size for any given hypothesised MDE and expected variance.

MDES for Blocking

where is the R-squared from regressing on block dummies.

To test against the alternative, we look at the T Statistic

Inverting this for size gives us a required sample size

typically, , , so by substitution:

Selection On Observables

typology

Regression estimators: rely on consistent estimation of Matching estimators Propensity score estimators: rely on estimation of Combination methods (augmented IPW, bias-corrected Matching, etc)

Regression Anatomy / FWL

where is the residual from a regression of on all other covariates.

If structural (long) equation is , with vector of unobserved, and we estimate short , then we can write the specification as

equivalently,

Coefficient in Short Regression = Coefficient in long regression + effect of omitted regression of omitted on included. This bias can be arbitrarily large.

Identification of Treatment Effects under Unconfoundedness

  • Unconfoundedness / Selection on Observables / Ignorability / Conditional Independence Assumption: In terms of densities, this is equivalent to the validity of the following density factorisation
  • common support

The third quantity is estimable using observed data.

Estimators:

Discrete Case: has finite values indexed by with generic entry

Multi-valued and Continuous Treatments

Treatment values: finite if multi-valued / for continuous, with corresponding dose-responses . We are interested in dose-response function , and contrasts.

First define Generalised propensity score :

Assumptions: Weak unconfoundedness: Conditional density overlap:

Bias removal using the generalised propensity score: Estimate the conditional expectation of the outcome as a function of treatment level and GPS as Estimate the dose-response function of the treatment by averaging the conditional expectation at that particular level of treatment

Then compute contrasts to get first derivative (MTE)

Estimators of

which can be used to construct estimators of ATE(), ATT(, and other estimands. reference: , David Childers’ lecture notes.

Regression Adjustment

Estimate by a nonparametric regression estimator .

Since the average predicted treated outcome for the treated equals the average predicted control outcome for controls, one ATE form is

SATT only requires imputation of one potential outcome:

Inverse Propensity Weighting

Estimate propensity score by . Then

Augmented Inverse Propensity Weighting / Combination Methods

Estimate and , then average

Normalized outcome regression:

Subclassification / Blocking

Weighted combination of subclasses of covariate values, which partition the population

Regression Adjustment

A single regression with controls is potentially problematic because of Simpson’s paradox. To account for this in a parametric setup, assume a set of iid subjects we observe a tuple , comprised of

feature vector

response

treatment assignment

Define conditional response surfaces as

First pass regression adjustment estimator (using OLS)

where is obtained via OLS. This generically doesn’t work for regularised regression.

With known propensity score (as in case of regression), an efficient estimator weights all estimated treatment effects by the propensity score:

Additional Assumptions for consistent estimate of ATE from OLS:

  1. Constant treatment effects

  2. Outcomes linear in X

will provide unbiased and consistent estimates of ATE.

    1. fails - is Best Linear Approximation of average causal response function .
    1. fails - is conditional variance weighted average of underlying s.

Pretend there are strata of . Then, OLS estimates

where the weight

weighs up groups where the size of the treated and untreated population are roughly equal, and weighs down groups with large imbalances in the size of these two groups.

is true effect IFF constant treatment effects holds.

Matching

Regression estimators impute missing potential outcomes by imputing it using . Matching estimators proceed by imputing the potential outcome using the observed outcome from the ‘closest’ control unit.

Define as the index that satisfies

So, is the index of the unit in the opposite treatment group that is th closest to unit in terms of covariate values in terms of the norm . Let denote the indices of the first matches for unit . Then, impute potential outcomes as

then, the simple matching (with replacement) estimator for ATE is

and corresponding ATT

where corresponds with one-to-one matching and is many-to-one. Many-to-one matching is not consistent (Abadie and Imbens (2006)) and has a bias of where is the number of continuous covariates.

Bias-corrected (Abadie-Imbens)

Where is the regression function under the control.

Metrics

  • Euclidian Distance
  • Stata diagonal distance
where the normalisation factor is the diagonal element of $\hat{\Sigma}$, the estimated variance covariance matrix.
  • Mahalanobis distance (scale-invariant)

Where is the variance-covariance matrix.

Matching estimators have a normal distribution in large samples provided that bias is small.

For matching without replacement,

For matching with replacement,

where is the number of times observation is used in a match, and the last error variance term is estimated by matching also. the bootstrap doesn’t work for matching.

PScore is a balancing score: conditioning on propensity score is equivalent to conditioning on covariates.

defines the semiparametric Efficiency Bound for ATE: the asymptotic variance of any regular estimator of of the population ATE obeys

where

and for PATE ()

where , , and .

Any regular estimator whose asymptotic variance achieves this efficiency bound is equal to , where

is the Efficient Influence Function for estimating .

shows that

Estimators in this section try to attain the SPEB.

and

The counterfactual mean can be identified as

where .

normalise both pieces using a Hajek-style adjustment, since extreme values of makes variance explode. Often advisable to trim or use Hajek weights, which introduces limited bias at the cost of large decreases in variance.

Similarly, for the effect on the treated

Horvitz-Thompson Estimator as Regression

with IPW weights

define the Weighted ATE (WATE) as

where is a weighting function. ATT is constructed when

the corresponding estimator is

Sample drawn from , and can represent a target population as where is the tilting function.

Define , which gives

For a given tilting function, to estimate , weight

Target Estimand


Combined 1 ATE [IPW] Treated ATT Control ATC Overlap ATO

Overlap weights are defined by choosing that minimises asymptotic variance of . The achieve exact balance on covariates included in the propensity score estimation.

can be interpreted as treatment effect among population that have good balance on observables.

Implemented in PSweight.

Entropy weights for each control unit are chosen by a reweighting scheme

subject to balance/moment-condition and normalising constraints

The above problem is convex but has dimensionality of (nonnegativity) + (moment conditions) + (normalisation). The dual, on the other hand, only has dimensionality and unconstrained, which is considerably easier to solve using Newton-Raphson.

propose CBPS, which is a method that involves modifying an initial propensity score estimate (e.g. by changing coefficients from a logistic model) iteratively until a balance criterion is reached.

Their basic insight is that when we use a logistic regression to estimate a propensity score, we assert that the pscore takes the form , and maximise the bernoulli log likelihood

which is then solved by the corresponding score

this score balances a particular function of covariates: . Alternatively, we could choose that function by specifying a moment condition

Analogously for ATT, this moment condition is

When this balance condition is solved independently, the problem is just-identified. When it is used in conjunction with the conventional bernoulli likelihood, the problem is over-identified. Implemented in CBPS::CBPS as well as balance.

The estimand is (with defined analogously). The estimator for this quantity is written

where the weights are chosen to satisfy the sample balance property

for any bounded function .

in words: for every function , the weighting function equates weighted averages of over the treated units to unweighted averages over the study population.

The weights are solved by solving an optimisation problem to trade off imbalance and some measure of complexity

with convex functions.

A common imbalance measure is

for

Hybrid Estimators

A doubly-robust estimator is consistent if one gets either the propensity score or the regression right.

Oracle AIPW

Feasible AIPW

This is the Augmented-Inverse-Propensity Weighting Estimator (AIPW) introduced by . Additional overviews:. General double-robustness property also shared by targeted maximum-likelihood estimators(TMLE) - due to .

Similarly, analogous estimator for ATT

where and is its empirical analogue.

The Cross-fit version can be stated as

where is a mapping that takes an observation and puts it into one of the folds. is an estimator excluding the fold.

Define individual treatment effect score as

Then,

We can form level- CIs :

grf has a forest-based implementation of AIPW

cf = causal_forest(X, Y, D)
ate_hat = average_treatment_effect(cf)

partially-linear setup

where is a scalar treatment indicator. Observations are independent but not necessarily identically distributed. We are interested in inference about that is robust to mistakes in model-selection.

Approximate and with linear combinations of control terms , which may contain interactions and non-linear transformations.

Assume approximate sparsity ( there are only a small number of relevant controls, and irrelevant controls have a high probability of being small).

Naive (incorrect) approach: use LASSO on an eqn of the form

where the treatment is not penalised. This will mean we drop any control that is highly correlated with the treatment if the control is moderately correlated with the outcome. Then, if we use a post-LASSO selection to estimate the treatment effect, the effect will be contaminated with an omitted variable bias.

recommended two-step approach

Estimate with LASSO, select predictive variables (i.e. those with nonzero coefficients) in

Estimate with LASSO, select predictive variables (i.e. those with nonzero coefficients) in

Estimate where [i.e. control for variables that are selected in either the first or second regression]

Let be the indices of the selected controls for the outcome and treatment respectively.

The post-double-selection estimator is

Can use plugin estimator for variance based on residuals

where

Implemented in hdm::rlassoEffect(., ’double selection’)

Let the target parameter solve the equation for known score function , vector of observables , and nuisance parameter . In fully parametric models, is simply the score function [derivative of the log-likelihood]. For ATE, .

In naive double-ML settings, . So, we replace with the Neyman-orthogonal score s.t.

which yields the Orthogonalised Moment Condition for some real-valued condition .

Using a Neyman-orthogonal score eliminates first-order biases arising from the replacement of with .

Reference:

Consider data with

Partial linear setup .

Score function is

$$\psi(\Sett{W}; \theta, \eta) = (Y - \Ubr{l(X)}{} - \theta(D

  • \Ubr{m(X)}{})) (D - m(X))$$

Partially Linear IV

Score is

Interactive Regression

Here, the estimands are

The score function for ATE (Hahn (1998))

The nuisance parameter true value is . For ATET,

Take a -fold random partition of observation indices s.t. each fold has size . For each , define as the complement / auxiliary sample.

For each , construct a ML estimator of using only the auxiliary sample ;

For each , using the main sample , construct the estimator as the solution of

Aggregate the estimators on each main sample

Simple implementation of Cross-fitting for Treatment effects

Partition the data in two, such that each fold has size .

Using only sample , construct a ML estimator of and ,e.g. a feedforward nnet of on , denoted as , and logit-lasso of on , denoted by .

Use the estimators on the hold-out sample to compute the T.E

Repeat (2,3) swapping the roles of and to get

Aggregate the estimators:

Implemented in DoubleML

Augmented Balancing

Loosely: AIPW without the (potentially fraught) inversion of the propensity score step. Exposition based on Bruns-Smith et al (2023)

setup: Covariates , outcome, two populations and that are distributions over is ‘source’, is ‘target’ (e.g. treatment group and overall sample) Estimand is Identification Assumptions Conditional Mean Ignorability: Population Overlap: is absolutely continuous w.r.t.

Effect Functionals

Regression Functional

Weighting Functional

Doubly-Robust Functional

Balancing Weights: Rationale

is difficult to estimate using plug-in estimation Alternative: weighting for balance automatic estimation of the Riesz representer

Weighting to minimise covariate imbalance

Direct estimation of the density ratio

Minimum variance weights that balance are also guaranteed to balance all other measurable functions in .

In linear setting, relevant imbalance is captured entirely by feature mean imbalance are iid draws from , are m draws from Define feature map ; construct gram matrices Let Let denote dual norm [, ]

Three Equivalent Representations

Penalised form

Constrained form

Automatic form

OLS is equivalent to a weighting estimator that exactly balances the feature means. Let be the linear regression fit on (source sample). Then,

Analogue for Ridge

, and any linear balancing weight estimator with estimated coefficients , , and

In words: when both outcome and weighting models are linear, the augmented estimator is equivalent to a linear model with coefficients that are element-wise affine combinations of base learner and coefs from regressing on

Heterogeneous Treatment Effects with selection on observables

Conditional Average Treatment Effects (CATEs) () are often of great policy interest for targeting those who have largest potential gains. However, conventional methods are prone to a severe risk of fishing from researchers (cf ‘conditional effects’ in most published work in the social sciences).

Instead, recent work proposes to use nonparametric estimators to find subgroups, use sample-splitting for honesty.

transformed outcome regression use outcome transformed w pscore

conditional mean regression use the fact that under SOO

(1) typically inefficient because of pscore in denominator, so most focus is on (2). Random forests are a flexible method that is widely liked.

Consider a model for where

where for some pre-determined set of basis functions . We allow for non-parametric relationships between , but the treatment effect function itself is parametrised by . showed that under unconfoundedness, we can rewrite the semiparametric setup above as

where

The oracle algorithm for estimating is (1) define and , then estimate residuals-on-residual regression. This procedure is -consistent and asymptotically normal.

Use cross-fitting to emulate the Oracle.

Run non-parametric regressions and to get

define transformed features ,

Estimate by regressing

To define R-Loss , under more general setup restate unconfoundedness as follows

where

and follow Robinson’s approach to write

R-loss is then written

Define and The R-learner consists of the following steps

Use any method to estimate the response functions

Minimise R-loss using cross-fitting for nuisance components

where is a regulariser.

Causal forest as implemented by grf starts by fitting two separate trees to estimate , makes out-of-bag predictions [using cross-fitting] using the two first-stage forests, then grow causal forest via

where

are the learned adaptive weights.

Draw a subsample of size from the sample with replacement and divide it into disjoint sets .

Grow a tree via recursive partitioning, with splits chosen from (i.e. without using observations from sample)

Estimate leaf responses using only sample

Finally, aggregate all trees over subsamples of size

where summarises randomness in the selection of the variable when growing the tree, is shorthand for a training sample.

where the base learner

the ‘honesty’ property is making independent of , i.e. do not use the same data to select partition (splits) and make predictions.

Implemented in causalForest and grf.

Multi-action policy learning

units, to be assigned to actions , which has have corresponding rewards . Each observation has covariate . Define a policy function

A given policy assigns each unit to a treatment level. Each policy has a corresponding value function

An optimal policy is defined as

Deviations from this optimum is called regret

Define a CEF as

The first-best optimal rule is

In the binary action case, this simplifies to which is the conditional empirical success (CES) rule of Manski (2004).

Under unconfoundedness and Overlap, we can estimate s and construct an empirical analogue of the value function for a policy using the following familiar estimators

A -convergent estimator of the value function is the Cross-fit Augmented Inverse Probability Weighted Learning (CAIPWL) estimator of , which is constructed as a cross-fit analogue of the AIPW estimator.

Sensitivity Analysis

Check balance by computing SDiff for observable confounders

Three valued treatment indicator: corresponding with ineligibles, eligible nonparticipants, and participants. We can test unconfoundedness by comparing ineligibles with eligible nonparticipants, i.e. test

Placebo Outcomes

Covariates included lagged outcomes . Test e.g. Earnings in 1975 in Lalonde

is a nuisance parameter.

Where , and . .

Propensity score is Logistic:

indicates strength of relationship between and .

Y is conditionally normal

indicates strength of relationship between and .

MLE setup

Construct grid of and calculate the MLE for by maximising over .

Use 2 partial s:

  • : Residual variation in outcome explained by (after partialling out ).

  • : Residual variation in treatment assignment explained by (after partialling out ).

Draw threshold contours, should expect most covariates to be clustered around origin.

Rosenbaum (2002)[]

Tuning parameter that measures departure from zero hidden bias.

For any two observations and with identical covariate values , under unconfoundedness, probability of assignment into treatment should be identical

Treatment assignment probability may differ due to unobserved binary confounder . We can bound this by the ratio:

No bias. is twice as likely to be treated than despite identical .

is assumed to satisfy

For any given candidate , estimates of the treatment effect can be computed. Implemented in rbounds::hlsens.

Altonji, Elder, Taber (2005)

Only informative if selection on observables is informative about selection on unobservables.

How much does treatment effect move when controls are added? Estimate model with and without controls:

AET ratio:

Want to be as big as possible (i.e. under unconfoundedness).

Define proportional selection coefficient

Then,

where

are from a univariate regression of on

are from a regression including controls

is maximum achievable

True model is , but we don’t observe . We would like to quantify how biased the coefficient from the short regression is for the long regression coefficient . From OVB FOrmula, we know where is the conditional association between the omitted and (‘impact’) and is the coefficient from regressing on (‘imbalance’).

The bias from this omission is

They then define

where where is the partial Cohen’s of the treatment with the outcome, and is the proportion of reduction on the treatment coefficient that would be deemed problematic.

Partial Identification

the ATE can be decomposed as

The terms in red are counterfactual outcomes for which the data contains no information. Bounding approaches involve estimators for these missing quantities.

Suppose all we know is

w.l.o.g. given bounded support , we can always min-max rescale to

Width of possible interval learnable from data is at largest, at smallest, so worst case interval always contains 0. Need theory/assumptions to even get the sign right.

Assume bounded support for the outcome. Replace missing values with maximum () or minimum () of support. These are worst-case bounds and yield intervals that are basically uninformative.

And denote

Monotone Treatment Response: assume mean potential outcome under treatment cannot be lower than under control . Then

Monotone Treatment Selection: subjects select themselves into treatment in a way the mean potential outcomes of the treatment and control groups can be ordered. Positive MTS implies and . This implies and

Let denote the treatment effect and denote its distribution, and let denote the distributions of outcomes for the two potential outcomes. Then, where

Instrumental Variables

SOO Fails/ because of OVB, then is no longer consistent. Use as instrument for which isolates variation unrelated to the omitted variable.

Traditional IV Framework (Constant Treatment Effects)

Setup

Second Stage:

First Stage:

Reduced Form: $$\begin{aligned} Y & = \gamma_0 + \gamma_1 Z + u_3 \ & = \alpha_0 + \alpha_1 ( \pi_0 + \pi_1 Z + u_1) + u_2 \ & = (\alpha_0 + \alpha_1 \pi_0) + \underbrace{\hlred{(\alpha_1 \pi_1)}}_{\gamma_1} Z + (\alpha_1 u_1 + u_2)

\end{aligned}$$

Exogeneity (as good as random conditional on covariates):

Exclusion Restriction: , has no effect on except through .

Relevance: affects

With the above assumptions, we can write

This is equivalent to

With binary treatment and binary instrument, one can write the IV effect as

With multiple instruments or endogenous variables,

where is projected in the column space of

which nests 2SLS, LIML, and Fuller’s estimator as special cases. Specifically,

is OLS

is 2SLS

is LIML

is Fuller’s estimator

here, is the minimum value of that satisfies

Implemented in ivmodel, which takes model fits from AER::ivreg and computes LIML / k-class estimates.

Asymptotically, all class estimators are consistent for when .

Inference

Under homoscedasticity,

Under heteroskedasticity,

Test statistic and null distribution

Equivalently, Assuming the instrument is valid, we can test for whether is endogenous by estimating the following regression

where are the (fitted) residuals from estimating the first stage regression . A standard t-test for tests whether is exogenous assuming is a valid set of instruments. [means this test is not that useful in practice]

Weak Instruments

Second term non-zero if instrument is not exogenous. Let and [variance of first stage error] and be F statistic of the first-stage. Then, bias in IV is

If first stage is weak, bias approaches . As , .

When instruments are weak, AR confidence intervals are preferable to eyeballing F-statistics. Let be a matrix of , and let , (where is typically 0), and

be an estimator for the covariance matrix for the errors.

and let be two-dimensional vectors defined as

and

Define the scalars

based on these scalars, two tests that are fully robust to weak instruments for testing - Anderson Rubin test (AR1949) and Conditional Likelihood Test (Moriera 2003)

IV with Heterogeneous Treatment Effects / LATE Theorem

binary instrument

binary treatment is potential treatment status given

potential outcomes:

heterogeneous treatment effects

Compliers: , ,

Always takers:

Never Takers :

Defiers:

A1: Independence of Instrument :

A2: Exclusion restriction :

A3: First Stage:

A4: Monotonicity / No defiers: or vice versa

Under A1-A4,

If A1:A4 are satisfied, the IV estimate is the Local Average Treatment Effect for the compliers.

So, LATE is a weighted average for people with large ; i.e. treatment effect for those whose probability of treatment is most influenced by .

IV in Randomized Trials with one-sided noncompliance. Conditional on A1:A4 holding, and . Then,

Precision for LATE Estimation

Characterising Compliers

PO Model of IV allows for heterogeneous treatment effects but does not formally identify LATE conditional on X.

extends methods by allowing the treatment inducer to be randomized conditionally on the covariates and by allowing the outcome to depend on the covariates besides the treatment intake. The paper also provided semiparametric estimations of the probability of receiving the treatment inducement, which helps to identify the treatment effects in a more robust way.

Need the following assumptions (all conditional on :

Independence of instrument: : SOO w.r.t. instrument. Exclusion restriction: Monotonicity: First Stage: Common Support :

Specifically, when the treatment inducer Z is as good as randomized after conditioning on covariates X, Abadie proposed a two-stage procedure to estimate treatment effects.

Estimate the probability of receiving the treatment inducement (preferably using a semiparametric estimator) in order to provide a set of pseudo-weights.

Second, the pseudo-weights are used to estimate the local average response function (LARF) of the outcome conditional on the treatment and covariates.

The estimated coefficient for the treatment intake D reflects the conditional treatment effect.

Given monotonicity, we can identify the proportion of compliers, never-takers, and always-takers respectively.

If nobody in the treatment group has access to the treatment (i.e. ), the .

By Bayes rule,

Suppose assumptions of LATE thm hold conditional on covariates . Let be any measurable real function of with finite expectation. We can show that the expectation of is a weighted sum of the expectation in the three groups

Rearranging terms gives us

Then,

where

This result can be applied to any characteristic or outcome and get its mean for compliers by removing the means for never and always takers. [p 181-183] provides overview of estimation. Trick is to construct a weighting scheme with positive weights so that , which is negative for always-takers and never-takers.

To compute , we need , which can be computed using a standard logit/probit or a power-series.

Standard example: average covariate value among compliers:

is the weighted average of covariate using Kappa weights.

Likelihood that Complier has a given value of (Bernoulli distributed) characteristic X relative to the rest of the population is given by

Assume A1-A4 from LATE. Generalise to take values in the set; Let denote the potential (or latent) outcome for person for treatment level . Then,

where the weights

are non-negative and sum to 1.

CEF of for the subpopulation of compliers:

Estimate

Estimate in the whole population, weighting by

implemented in LARF::larf in R.

Treatment is . First define two additional quantities

is the conditional probability that unit is either a complier *or* an always taker assume that this probability is a function of covariates , with corresponding parameter vector and CDF that transforms it to the probability scale [taken to be the normal CDF henceforth, but can be relaxed] is the conditional probability that unit is an always taker *conditional* on being either a complier or a never taker assume that this probability is a function of covariates with corresponding covariate vector

Next, they note that the probability of treatment for stratum can be written as

Using the two conditional probabilties defined above, this can be written as

which, for binary treatment lets us write a Bernoulli likelihood for an observation

Plugging in the definitions of and gives us the likelihood and its argmax defines the solution for and . This is generically a difficult optimisation problem and improving its computation is a promising avenue for future research.

The maximum likelihood estimates of the two parameter vectors can be plugged into to compute individual compliance scores

The inverse compliance score weighted estimator for the ATE with weights is then

which is a weighted version of the familiar Wald estimator with a Hajek correction that normalises each expectation by the sum of weights in that treatment group.

Shift Share / Bartik Instruments

SSIV setting from [notation and exposition from PGP’s slides]. We want to estimate the causal effect or structural parameter in

where because the ‘treatment’ is typically a change in an economic quantity (e.g. employment) that is correlated with unobserved shocks to the outcome (e.g. wages). indexes locations.

An accounting identity that decomposes the treatment is

where indexes industries. 2nd accounting identity for location-industry shifts is

As a GMM system

denotes exogenous controls and fixed effects.

Under constant , need Exogeneity Relevance

‘shares’: focus on : Analogy to DiD: Changes in industry composition ‘shifts’: focus on : requires argument for why shocks are randomly assigned

with Rotemberg weight

Marginal Treatment Effects: Treatment effects under self selection

propose the marginal treatment effect (MTE) setup that generalises the IV approach for continuous instruments and nests many estimands (and is a generalisation of the Roy (1951) model). It also has a clearer treatment of self-selection.

Exposition based on . Define potential outcomes

where is the conditional mean function and captures deviations, with .

Treatment assignment assumes a weakly separable choice model

where is the latent propensity to take the treatment, and is interpreted as the net gain from treatment since treatment is only taken up if . is an instrument. enters the selection equation negatively, and thus represents latent resistance to treatment.

The condition can be rewritten as . Applying the CDF of to both sides yields

Define and .

Both RHS and LHS are distributed on . The treatment decision can now be written as

.

Now, we define treatment effects

Aggregating over different parts of the covariate distribution yields different estimates.

Integrating these over yields the conventional estimators. With self-selection based on , typically ATT > ATE > ATU.

The covariate-specific Wald estimator is

Under the standard A1-A4 from AIR96,

These can be aggregated using the ‘saturate and weight’ theorem (Angrist and Imbens)

with weights

For a continuous instrument, for a pair of instrument values , .

MTE is defined as a continuum of treatment effects along the distribution of .

Define two marginal treatment response (MTR) functions

Many useful parameters are identified using the following expression

with weights specified in the figure below.

MTE weights from

Parametric Model: Assuming joint normality for ,

where is the correlation , and .

yields MTE estimator

Let . Write

where is a random effect that captures treatment effect heterogeneity . We can rewrite this and by demeaning .

where captures the ATE at means of , which is the unconditional ATE under the linear specification.

Write the selection equation

with .

Assumptions

: Conventional selection bias.

: unobservable part of treatment effect depends linearly on the unobservables that affect treatment selection.

Including and in the control-function outcome equation yields a consistent estimate of the ATE: .

High Dimensional IV selection

setup:

where

is a vector of exogenous controls, including a constant.

is a vector of instruments

is an endogenous variable

and

Run (post)LASSO of on to obtain

Run (post)LASSO of on to get .

Run (post)LASSO of on to get .

Construct , and .

Estimate by using standard IV regression of on with as instrument. Perform inference using score statistics or conventional heteroskedasticity-robust SEs.

implemented in hdm::rlassoIV(., select.X = T, select.Z = T).

Discussion in .

Principal Stratification

Treatment comparisons often need to be adjusted for post-treatment variables.

Binary treatment . post-treatment Intermediate variable , Outcome . For each individual, the treatment assumes a single value, so only one of the two potential intermediate values are observed. Based on joint potential outcomes of the intermediate variable , we have 4 strata

Never takers.

Defiers.

Compliers.

Always takers.

The basic principal stratification w.r.t post treatment variable is the partition of units such that, forall units in any set of , all units have the same vector of . The principal stratum to which unit belongs is not affected by treatment assignment for any principal stratification, so can be considered pre-treatment.

Treatment Ignorability implies (i.e. treatment and control units can be compared conditional on stratum)

Principal Causal Effect (PCE)

A common example is the

Complier Average Causal Effect (CACE) = Causal Effect on Principal Stratum of Compliers (AIR96)

Recall that concatenated. So, AIR96 in PS terms: Monotonicity: must be empty: no defiers. Exclusion:

Estimation under principal ignorability

Treatment ignorability monotonicity: is not allowed principal ignorability

S = 0S = 1
Z = 0G = 00 or 01G = 11
Z = 1G = 00G = 11 or 01

Disentangle mixture distribution within strata by assuming same conditional expectation across mixture components (complier, never taker, always taker).

Define nuisance functions:

Treatment probability: Principal Score: identified by

where . Outcome mean: .

Treatment Probability and Principal Score

Treatment Probability and Outcome Mean

Principal Score and Outcome Mean

Direct and Indirect Effects via Principal Stratification

Direct effect of conditional on exists if there is a causal effect of on for observations for whom the treatment does not affect selection , i.e. principal strata . This is a zero-first-stage sample in IV-terms.

The Indirect Effect is mediated through .

Attrition as Selection Bias

Let denote a binary selection indicator for when is observed. Let denote potential selection states under treatment and nontreatment.

: never-selected : always selected : selection compliers : selection defiers (ruled out by Lee bounds)

Dominance assumption: and . The average potential outcome of the always selected dominates that of compliers under either treatment state.

Then, Zhang and Rubin (2003) bounds are

where is chosen such that the lowest outcomes among those with correspond to the share of compliers among those with are smaller than this value.

Assuming

randomisation: monotonicity:

Lee (2009) focuses on the ATE among the always observed

The second quantity: is point identified. In contrast, the outcome in the treatment group can be either an always-selected’s outcome or a selection complier’s outcome.

Always selected share among the treated is

In the best case, the always-selected comprise the top quantile of the treatment outcomes. Then the largest possible value of is

The smallest possible one is

this can be implemented conditional on covariates by constructing within each stratum.

Regression Discontinuity Design

Setup

Treatment () changes discontinuously at some particular value in [and nothing else does], so

Standard identification assumptions violated by definition because although unconfoundedness holds trivially since we have , this also means overlap is always violated. Need to invoke continuity to do causal inference.

Identified at , i.e. via

  • Conditional mean function is continuous at

  • Mean Treatment effect function is right continuous at

Estimators

Normalise running variable . Then, the linear regression implementation is the following:

where and are local or global polynomials. Since the design relies on identification at infinity (i.e. at the cutoff), choice of polynomial / functional form matters a lot.

Calonico, Cattaneo, Titiunik (2014) recommend local-linear regressions. Older literature relies on global higher-order polynomials, which often yields strange estimates.

Where is a kernel function. Common choices are the window function or the triangular kernel

Assumptions for Local Linear Estimator

Loosely, we need CEFs to be smooth. More precisely, we need to be twice-differentiable with uniformly bounded second derivative.

Taking a taylor expansion around , we can write the CEFs as

with . The local linear regression with a window kernel can be solved in closed form

where denote sample averages over the regression window. Then, the error term can be written as

Curvature bias is bounded by .

This rate is a consequence of working with the 2nd derivative. In general, if we assume has a bounded th derivative, we can achieve an rate using local polynomial regression of order with a bandwidth scaling as .

The local linear regression estimator for

which can be written as a local linear estimator where weights only depend on the running variable . show that local linear regression is not the best estimator in this class.

Under an assumption that , the minimax linear estimator is the one that minimises the MSE and is given by

These weights can be solved for using quadratic programming.

Fuzzy RD

Discontinuity doesn’t deterministically change treatment, but affects probability of treatment. Analogue of IV with one-sided non-compliance.

. Assuming , the probability of treatment relates to via:

where := point of discontinuity

Regression Kink Design

First-derivative version of the fuzzy RD. Continuous treatment, where the treatments are a function of the running variable with kink at . This implies that the first derivative of continuous treatment D is discontinuous at the threshold.

The marginal treatment effect at the threshold is defined as

Differences-in-Differences

DiD with 2 periods

Binary treatment , 2 time periods .

Potential outcomes denoted .

ATT in the nd period.

not observed, so must be imputed.

Naive Estimation Strategies

Before-After Comparison: $\tau = \Exp{Y_1^1 | D = 1}

  • \hlblu{\Exp{Y_0^0 | D = 1}}$

assumes (No trending)

Post Treatment-Control Comparison:

Assumes (Random Assignment in the 2nd period)

Both typically untenable in practice, so we need parallel trends.

Sample analogue of

Impute with

Often justified using a figure [with transformed if necessary], or control for time trends [which relies on a strong functional form assumption], or a clear falsification test [on a placebo group].

If , this collapses to a selection-on-observables assumption in period 2.

For a two-period difference, we can also write the standard OLS exogeneity condition in differences form

Which makes a direct link with the strong exogeneity assumption in panel data models that asserts that .

Regression Estimator

We typically prefer the following regression estimator (for automatic standard errors etc).

Triple Differences (DDD) Estimator

Regular Diff-in-Diff estimate - Diff-in-diff estimate for placebo group.

Nonparametric Identification Assumptions with Covariates

Estimand:

Identification Assumptions:

SUTVA

Covariate exogeneity

No effect before treatment

Common Trend

(parallel trends within strata)

Common support

This allows us to estimate the conditional ATT as the standard DiD within each stratum.

Averaging these over gives us the ATT

where regression functions denote conditional expectations for treatment at time given covariates .

Denote potential outcomes under treatment and control for unit as and . For some observed covariates , we are interested in the CATT

For identification, we need Conditional parallel trends: Overlap: such that and

The Abadie estimand can be defined as

Defining , we then have

This is an IPW Estimator.

Integrating this over gives us the ATT

THe full IPW estimator can be written

where is the unconditional probability of being treated in the post-treatment period, and are conditional probabilities of specific treatment-group combinations.

Double-robust version - Zimmert (2020)

Panel Data

Setup: We observe a sample of cross-sectional units for time periods

One-way fixed effects and Random effects both use the form

although they make different assumptions about the error.

Error assumptions for panel regressions

(1) FE: .

(2) RE: (1) and [Absorb unobserved unit effect into error term, impose orthogonality it] . Equivalent to Pooled OLS with FGLS.

Fixed Effects Regression

Identification Assumption []

  • Strict Exogeneity - errors are uncorrelated with lags and leads of x

    Equivalent statement for is

    • Rules out feedback loops i.e. correlated with because s are set in response to prior error, e.g. Policing and crime.
  • regressors vary over time for at least some .

Setup[]

Define an individual fixed effect for individual

and define the same for each time period for panel data.

If is as good as randomly assigned conditional on :

Then, assuming enter linearly,

Assuming the causal effect of the treatment is additive and constant,

where is the causal effect of interest.

Then, we can write:

Restrictions

  • Linear

  • Additive functional form

  • Variation in , over time, for , must be as good as random

Estimate the specification

where individual demeaned values from pre-multiplying by the Individual specific demeaning operator with every component in the general model above, which removes the fixed effect .

Lag the general model one period and subtracting gives

where and so on. This naturally eliminates the time-invariant fixed effect . The pooled OLS estimation of in the above regression is called the first differences (FD) estimator .

FE estimator is more efficient under the assumption that are serially uncorrelated []

FD more efficient when follows a random-walk.

For Individual Fixed Effects/Within estimation, using the regression anatomy formula, write:

Since , and

Random Effects

Identification Assumption []

Assume - strong assumption

In other words, entire error term is independent of . This assumes OLS is consistent but inefficient, which is why it is of limited use in observational settings.

When there is autocorrelation in time series (i.e. s are correlated over time ), GLS estimates can be obtained by estimating OLS on quasi-differenced data. This allows us to estimate the effects of time-invariant characteristics (assuming the independence condition is met).

where

Idiosyncratic errors have constant finite variance:

Idiosyncratic errors are serially uncorrelated: .

Under these assumptions, the FGLS matrix takes a special form

where is a matrix of s. Estimators for the variance components are in [c 10, pp 260-61]. A robust estimator of is constructed using pooled OLS residuals

With this, we can apply the FGLS estimator

Hausman Test: Choosing between FE and RE

is assumed to be consistent. Oft-abused test as a result.

  • H0:

  • H0:

If the error component is correlated with , RE estimates are not consistent. Perform Hausman test for random vs fixed effects (where under the null, )

  • When the idiosyncratic error variance is large relative to , and . In words, the individual effect is relatively small, so Pooled OLS is suitable.

  • When the idiosyncratic error variance is small relative to , and . Individual effects are relatively large, so FE is suitable.

Linear Time Trend[]

Time Fixed Effects (a.k.a. Two-way Fixed Effects)[]

Unit Specific Time Trends[]

Distributed Lag

Define switching indicator as 1 if switched from control to treatment between and .

where the sums on the RHS allow for m lags / post-treatment effects, and q leads / pre-treatment effects. Leads should be close to 0.

Staggered Adoption

Let denote multiple time periods such that , with nobody treated at and staggered adoption. Let be a dummy that is equal to one if a subject experiences treatment introduction in period (e.g. implies the treatment is introduced in period in said group).

Under parallel trends for the untreated potential outcomes, , the treatment effect in the vanilla two-way fixed effects regression

can be decomposed as

where .

The weights sum to one and are proportional to and the same sign as

where is the average treatment of group across periods (share of periods treated), is the average treatment at period across groups, and is the grand mean of the treatment indicator. These weights can be negative.

This means that is biased for the ATT because is in not (only) proportional to . is only unbiased when

the treatment is binary AND the treatment is staggered and absorbing (i.e. groups get treated once and stay treated) AND there is no variation in treatment timing

Under these conditions, the pesky weight is constant across treated units, so the weights are proportional to .

OR, is also unbiased if is uncorrelated with the treatment effects . This is only plausible when treatment has been randomly staggered, otherwise, it is entirely plausible that groups with larger treatment effects selected into treatment early, and so on.

Consider a dataset comprising timing groups ordered by the time at which they first receive treatment and a maximum of one never-treated group . The OLS estimate from a two-way fixed effects regression is

where weights depend on sample size and variance of treatment within each DD. This maximises the weights of groups treated in the middle of the panel. The Late vs Early comparison is particularly problematic (and is typically incorrect when treatment effects are heterogeneous in time).

Visually, this involves decomposing the setup in the first figure below into its constituent two-way parts in the second figure.

Some Staggered Difference in Differences data

Constituent 2-way Differences in Differences Comparisons

Estimand: Group-time average treatment effect

where is the potential outcome for group treated at .

Separate (1) identification, (2) estimation and inference, and (3) aggregation.

A1: No anticipation ,

A2: Parallel trends based on ‘never treated’ group: , s.t. ,

Estimators for Group-time ATEs

Aggregation: event-study type estimand.

Implemented in did and DRDID.

The negative weighting problem with 2WFE under staggered adoption can be remedied easily by using the following procedure, which is termed Imputation by . This nests the procedures in etc.

Fit a model for using only untreated observations for all units (i.e. untreated periods for units that eventually got treated) Impute for treated units and treated time periods compute Average for (equal weighting) ATT or average over time for event study

This works well when the outcome model for is good, i.e. when the fixed effects or latent factors are well estimated. This will not work well for short panels.

Changes-in-Changes

Given a continuous outcome and a monotonicity in unobserved heterogeneity, CiC allows us to identify both the ATT and Quantile effect on the treated (QTT).

Assume the following about untreated potential outcomes

where is a scalar unobservable or an index of unobservables. is a general function assumed to be strictly monotonically increasing in values of for periods . The conditional independence assumption requires that the unobserved heterogeneity is constant over time within treatment groups.

Denote the conditional CDF of potential outcome , and corresponding CDF for observed outcome. Conditional outcome distributions are observed. The inverse of the latter is , the conditional quantile function. The unobserved CDF is identified as

The QTT at quantile is then identified as

and the ATT is identified as

Implemented in qte::CiC.

Synthetic Control

Original setup.

Observe units in periods . Unit 1 is treated starting from period , while are never treated, and are therefore called the donor pool.

Since there is only 1 treated unit, the effect of interest

Observed data matrix ()

FPCI applies; potential outcome matrices are:

Let be a vector of a pre-intervention characteristics, and is a matrix containing the same values for control units. This typically includes pre-treatment outcomes, in which case , but predictors (even time invariant ones, ) are usually available.

For some PSD matrix , define , where is typically diagonal. Consider weights satisfying

This forces interpolation, i.e. the counterfactual cannot take a value greater than the maximal value or smaller than the minimal value of for a control unit. The synthetic control solution solves

The Synthetic Control Estimator is then

In contrast, a simple difference-in-differences estimator gives

choose using a nested-minimisation of the Mean Square Prediction Error (MSPE) over the pre-treatment period

Setup:

relative magnitudes of and might dictate whether we impute the missing potential outcome using or comparison

Many Units and Multiple Periods: , is ‘fat’, and comparison becomes challenging relative to . So matching methods are attractive.

, is ‘tall’, and matching becomes infeasible. So it might be easier to estimate dependence structure.

Finally, if , regularization strategy for limiting the number of control units that enter into the estimation of may be important

Focus on last period for now:

Many estimators impute with the linear structure

Methods differ in how and are chosen as a function of

Impose four constraints

No Intercept: . Stronger than Parallel trends in DiD.

Adding up : . Common to DiD, SC.

Non-negativity: . Ensures uniqueness via ‘coarse’ regularisation + precision control. Negative weights may improve out-of-sample prediction.

Constant Weights:

DiD imposes 2-4.

ADH(2010, 2014) impose 1-3

1 + 2 imply ‘No Extrapolation’.

Relaxing these assumptions:

Negative weights

If treated units are outliers on important covariates, negative weights might improve fit

Bias reduction - negative weights increase bias-reduction rate

When , (1-3) alone might not result in a unique solution. Choose by

Matching on pre-treatment outcomes : one good control unit is better than synthetic one comprised of disparate units

Constant weights - implicit in DiD

Given many pairs of

prefer values s.t. synthetic control unit is similar to treated units in terms of lagged outcomes

low dispersion of weights

few control units with non-zero weights

Optimisation Problem

Ingredients of objective function Balance: difference between pre-treatment outcomes for treated and linear-combination of pre-treatment outcomes for control Sparse and small weights: sparsity : magnitude:

Tailored Regularisation

don’t want to scale covariates to preserve interpretability of weights. Instead, treat each control unit as a ‘pseudo-treated’ unit and compute

where

pick the value of the tuning parameters that minimises

Difference in Differences

assume (2-4) No unique solution for , so fix

Best Subset; One-to-one Matching

with (=1 for OtO)

Synthetic Control

assume (1-3) (i.e. ) For PSD diagonal matrix

Constrained regression: When (Lagged Outcomes only) and

Consider a balanced panel with units and time periods, where the first units are never treated, while treated units are exposed after time . We seek to solve for sdid weights that align pre-exposure trends in outcomes of unexposed units with those for exposed units

we also look for time weights that balance pre-exposure time periods with post-exposure time periods for unexposed units.

Weights are solved using the following optimisation problems

where denotes the positive real line. We set the regularization parameter as

We implement this for the time weights by solving

Compute regularisation parameter . Compute unit weights . Compute time weights . Compute the SDID estimator using the following weighted DID regression.

implemented in synthdid::synthdid_estimate

Where is the treatment, is the heterogeneous treatment effect for unit at time , is a vector of time-varying controls. is a vector of unknown common factors, is a vector of unknown factor loadings. This factor component nests standard functional forms

unit FEs

time FEs

two-way FEs.

Unit-specific linear time trends

Lagged dependent variable

Steps

Get initial value of using within estimator

Estimate using

Re-estimate using

Iterate

Drawback - constant effect

With, control units and treated units, Write DGP for individual unit as

Where , , is , is .

Stack controls together gives

GSC for treatment effects is an out-of-sample prediction method: the treatment effect for unit at time is the difference between the actual outcome and its estimated counterfactual , where is imputed in three steps.

Estimate an IFE model using only the control group data and estimate

Estimate Factor loadings for each treated unit by minimising mean-squared error of the predicted treated outcome in pretreatment periods

where superscripts denote the pretreatment periods.

Calculate Treated Counterfactuals based on

Choose the number of factors by cross-validation. Implemented in gsynth.

Dynamic Treatment Effects

We may want to estimate the effects of treatment sequences (‘time-varying exposures’), as in medical settings (Robins 1986, ).

2 period example

Consider a setting with and corresponding outcomes and treatments , where the treatment takes on values , and baseline covariates and covariates at the end of the first period .

Let . Accordingly, is the potential outcome realised when treatment is set to sequence . The ATE (contrast) two distinct treatment sequences vs is

Estimating this quantity requires a sequential selection on observables assumption

Under these assumptions, dynamic treatment effects can be estimated based on nested conditional means regressions

where and denote distinct treatment sequences.

or an IPW estimator

where and are propensity scores in the two periods.

Finally, a double robust estimator is

where

are (nested) conditional mean outcomes.

If we assume that is conditionally independent of potential outcomes given pre-treatment covariates and (implying that post-treatment aren’t required to control for confounders jointly affecting the second treatment and the outcome). In this case, the second part of the first SOO assumption can be strengthened to . This simplifies

implemented in causalweight::dyntreatDML.

Generalisation to arbitrary panels

Let denote treatment status at time , and collect them into a vector for each unit to form a Treatment History . A partial treatment history up to time is denoted . Time varying covariates are arranged analogously .

Potential outcomes are defined on treatment histories and rely on the standard consistency assumption / SUTVA, which assumes that the potential outcome for the same observed history when . This generates potential outcomes for the outcome in period , which permits many hypothetical comparisons.

The estimand typically of interest the average causal effect of a treatment history

Define potential outcomes just intervening on the last periods as , which is the ‘marginal’ potential outcome if the treatment history runs its natural course up to and sets the last lags to .

This allows us to define a contemporaneous treatment effect (CET)

The step lagged effect is defined analogously

and the step response function (SRF) describes how this effect varies by time period and distance between the shift and the outcome

These effects are (clunkily) parametrised in an autoregressive distributed-lag (ADL) models of the form

with assumption . This implies the following form for potential outcomes

hence, changes in can have both a direct and indirect effect on .

This relates to linear panel models of the form

where strict exogeneity is assumed.

For every treatment history and period ,

where is a set of covariates such as .

This relates to sequential exogeneity in panel models

Under sequential ignorability, an ADL approach would be to write the outcome regression with time-varying covariates

This generates post-treatment bias because may be affected by .

Define the impulse response functions (‘blip-down’ functions) as

which is the effect of a change from to in terms of the treatment on the outcome at time , conditional on treatment history up to time .

These functions are parametrised as a function of lag length

This then allows us to construct blipped-down / demediated outcomes

Intuitively, this transformation subtracts off the effects of lags of treatment, creating an estimate of the counterfactual level of the outcome at time if the treatment had been set to for periods before . Under sequential ignorability, the transformed outcome has the same expectation as the counterfactual , and can be used to construct by modelling the relationship between and to estimate the lagged effect for . This is recursive, hence the ‘nested’.

Sequential g-estimation can be used to estimate effects. Suppose we’re interested in the contemporaneous effect and the first-lagged effect and we adopt an impulse response function for both these effects. We assume sequential ignorability conditional on . Sequential g-estimation proceeds as follows

For regress the un-transformed outcome on as in an ADL model. If this is correctly specified, we estimate the blip-down parameter (contemporaneous effect) correctly. We use to construct the one-lag blipped-down outcome This blipped-down outcome would be regressed on to estimate the next blip-down parameter (the first lagged effect) (repeat for further lags, standard error estimated via block-bootstrap)

To specify a marginal structural model, we choose a potential outcome lag length and write a model for the marginal model of those potential outcomes in terms of treatment history

for example, for a contemporaneous and two lagged effects, we write , marginalising over further lags and covariates.

The average causal effect is then

This motivates an IPW approach where weights are constructed as

where the denominator of each term is the product of the predicted probability of observing unit ‘s observed treatment status conditional on covariates that satisfy conditional ignorability. Multiplying this over time produces the probability of seeing this unit’s treatment history conditional on the past.

These weights can be used in a regression of the form

Decomposition Methods

Basic idea of decomposition

Oaxaca-Blinder Decomposition

We consider two groups, and , and an outcome , and a vector of predictors . Main question for decomposition is how much of the mean outcome difference [or another summary statistic / quantile of CDF] is accounted for by group differences in the predictors . The Oaxaca-Blinder decomposition refers to the following decompositions:

Oaxaca decomposition where D_1 is the 'discrimination' piece . D_1 \neq D_2 generically unless two groups have the same slope (which is practically never the case)

Detailed Decomposition

To examine the ‘contribution’ of each variable to the observed gap, estimate

so, is the coefficient for group , and is the coefficient for group . A t-test for is used to establish whether a variable is a source of the observed gap. The contribution of each variable to the explained part is

Let outcome models be linear and where .

The difference in means decomposition is

Sloczynski: SATT can be estimated by running the following regression:

Kline (2011) shows that this is ‘doubly robust’ and equivalent to a reweighting estimator based on the weights

where is the treated share.

Distributional Regression

Section based on counterfactual distribution decomposition methods.

Let denote the distribution of job-relevant characteristics (education, experience, etc.) for men when and for women when . Let denote the conditional distribution of wages given job-relevant characteristics for group , which describes the stochastic wage schedule that a given group faces. Using these distributions, we can construct , the distribution of wages for group facing group ‘s wage schedule as

For example, is the distribution of wages for men who face men’s wage schedule, and is the distribution of wages for women who face women’s wage schedule, which are both observed distributions. We can also study , the counterfactual distribution of wages for women if they faced the men’s wage schedule .

is the counterfactual distribution constructed by integrating the conditional distribution of wages for men with respect to the distribution of characteristics for women.

We can Interpret as the distribution of wages for women in the absence of gender discrimination, although it is predictive and cannot be interpreted as causal without further (strong) assumptions.

Assumptions for Causal Interpretation

Under conditional exogeneity / selection on observables, CE can be interpreted as causal effects. Sec 2.3 in ECTA 2013 paper spells this out in detail. Let be the vector of potential outcomes for various values of a policy , and be a vector of covariates. Let denote the random variable that describes the realised policy and let denote the realised outcome variable. When is not randomly assigned, the distribution of may differ from the distribution of . However, under conditional exogeneity, the distribution of and agree, and the observed conditional distributions have a causal interpretation, and so do counterfactual distributions generated from these conditionals by integrating out .

Let denote the distribution of the potential outcome in the population with . The causal effect of exogenously changing the policy from to on the distribution of the potential outcome in the population with the realised policy is . Under conditional exogeneity, for any , the counterfactual distribution exactly corresponds to , and hence the causal effect of exogenously changing the policy from to in the population with corresponds to the CE of changing the conditional distribution from to , that is

Conditional exogeneity assumption for this section:

groups that partition the sample. For each population and outcome . Covariate vector is observable in all populations, but the outcome is only observable in populations . Let denote the covariate distribution in the population , and and denote the conditional distribution and quantile functions in population . We denote the support of by and the region of interest by . We refer to as the reference population and as the counterfactual population.

The reference and counterfactual populations in the wage example correspond to different groups. We can also generate counterfactual populations by artificially transforming a reference population. We can think of as being created through a known transformation of :

Counterfactual distribution and quantile functions are formed by combining the conditional distribution in population with the covariate distribution in population , namely:

where and is the left-inverse function of .

The main interest lies in the quantile effect (QE) function, defined as the difference of the two counterfactual quantile functions over a set of quantile indexes

Estimation of Conditional distribution

  • method = "qr" default implements

where is a small constant that avoids estimation of tail quantiles, and is the quantile regression estimator

  • method = "logit" implements the distribution regression estimator of the conditional distribution with the logistic link function

where is the standard logistic CDF and is the distribution regression estimator

Causal Directed Acyclic Graphs

based on ,Pearl (2009), Morgan and Winship (2014), Cunningham (2020).

For an undirected graph between , there are four possible directed graphs:

(a chain)

(another chain)

(a fork on Y)

(collision on Y)

With the fork or either chain, we have . However, With a collider, .

Causal effect of on is written . Basic idea is condition on adequate controls (i.e. not every observed control). Here, controlling for is unnecessary and would bias the estimate of .

Basics / Terminology

A backdoor path is a non-causal path from to . They are ‘backdoor’ because they flow backwards out of : all of these paths point into .

Here, , where is a common cause for treatment and the outcome. So, is a confounder.

A worse problem arises with the following DAG, where dotted lines indicate that is unobserved. Because is unobserved, this backdoor path is open.

Colliders, when left alone, always close a backdoor path. Conditioning on them, however, opens a backdoor path, and yields biased estimates of the causal effect of on .

Common colliders are post-treatment controls

Another insidious type of collider is of the form , where is typically a lagged outcome.

Vector of measured controls satisfies the backdoor criterion if (i) blocks every path from to that has an arrow into (i.e. blocks the back door) and (ii) no node in is a descendant of . Then,

Which is the same as the subclassification estimator. The conditional Expectation can be computed using a nonparametric regression / ML algorithm of choice.

satisfies the frontdoor criterion if (i) blocks all directed paths from to , (ii) there are no unblocked back-door paths from to , and (iii) blocks all backdoor paths from to .

Then,

The above DAG in words

The only way influences is through , so there is no arrow bypassing between and . In other words, intercepts all directed paths from to .

Relationship between and is not confounded by unobservables - i.e. no back-door paths between A and M.

Conditional on , the relationship between and is not confounded, i.e. every backdoor path between and has to be blocked by .

With a single mediator that is not caused by , the ATE can be estimated by multiplying estimates .

The FDC estimates the ATE because it decomposes a reduced-form relationship that is not causally identified into two causally identified relationships.

Implementation through linear regressions:

Since is identified, in the first-stage equation and in the second-stage equation. Assume . Then, write

Mediation Analysis

Pearl (2001), Robins(2003)

Consider SRS where we observe , where is a treatment indicator, is a mediator, is a vector of pre-treatment controls, and is the outcome. The supports are respectively. s are partialled out.

Let denote potential value for the mediator under treatment status . The outcome is the potential outcome for unit when . The observed variables can be written as .

d, a used interchangeably for treatment.

This requires the treatment to be conditionally independent of the potential mediator states and outcomes given X, ruling out unobserved confounders jointly affecting the treatment on the one hand and the mediator and/or the outcome on the other hand conditional on the covariates. (5) postulates independence between the counterfactual outcome and mediator values ‘across-worlds’.

Effectively, Need to be randomly assigned (approx).

Difference in holding treatment status constant, and varying the mediator. Sample Average: Average Causal Mediation Effect (ACME)

Difference in holding mediator constant, and varying the treatment.

NDE conditions on potential mediator effects.For CDE, we set mediator at a prescribed value .

Difference between NDE and CDE is what value mediator is fixed at. Restated:

Effect of changing the treatment while fixing the value of the mediator at some level .

Decomposing total effect with binary mediator

Assume linear models for mediator and .

Then fit the following regressions

Baron and Kenny (1986) suggest testing . If all nulls rejected, Mediation effect . Equivalently, mediation effect is . Estimate variance using bootstrap / delta method.

Assume selection on observables w.r.t. D, M.

Huber(2014)

Average direct effect identified by

Average Indirect Effect identified by

implemented in causalweight::medweight.