Chapter 02: Linear Regression

This note is a high-fidelity Markdown migration of the Linear Regression chapter from the LaTeX source.

Parent map: index Prerequisites: probability-and-mathstats

Concept map

flowchart TD
  A[Simple Linear Regression] --> B[OLS Estimator]
  B --> C[Finite-sample Properties]
  C --> D[Gauss-Markov BLUE]
  C --> E[Prediction / Intervals]
  B --> F[Projection Geometry]
  F --> G[FWL / Partitioned Regression]
  C --> H[Robust SEs]
  H --> I[Hypothesis Testing]
  I --> J[Multiple Testing]
  B --> K[Quantile Regression]
  B --> L[Measurement Error]
  B --> M[Bootstrap / Delta Method]
  B --> N[GMM]
  N --> O[Empirical Likelihood]
  N --> P[M-estimation]

1. Simple Linear Regression

Assume

Y_{i} = β_{0} + β_{1} X_{i} + ε_{i},

with

E [ε_{i} ∣ X_{i}] = 0, Var (ε_{i} ∣ X_{i}) = σ^{2} .

1.1 OLS in summation form

Population best linear predictor coefficients:

β_{1} = \frac{Cov ( X , Y )}{Var ( X )}, β_{0} = E [Y] - β_{1} E [X] .

Sample analogues:

\hat{β}_{1} = \frac{\sum _{i} ( X _{i} - X ˉ ) ( Y _{i} - Y ˉ )}{\sum _{i} ( X _{i} - X ˉ ) ^{2}} = \frac{X Y - X ˉ Y ˉ}{X ^{2} - X ˉ ^{2}},

\hat{β}_{0} = \overset{ˉ}{Y} - \hat{β}_{1} \overset{ˉ}{X} .

Residual variance estimate in simple regression:

\overset{σ}{^}^{2} = \frac{\sum _{i} ε ^ _{i}^{2}}{n - 2} .

Variance decomposition from the fitted line $\hat{Y} = \hat{β}_{0} + \hat{β}_{1} X$ and residual $U = Y - \hat{Y}$ :

Var (\hat{Y}) = ρ_{X Y}^{2} σ_{Y}^{2}, Var (U) = (1 - ρ_{X Y}^{2}) σ_{Y}^{2} .

1.2 Properties of least-squares estimators

Let $\hat{β} = (\hat{β}_{0}, \hat{β}_{1})^{'}$ . Conditionally on regressors:

E [\hat{β} ∣ X] = (β_{0} β_{1}),

Var (\hat{β} ∣ X) = \frac{σ ^{2}}{n s _{X}^{2}} (\frac{1}{n} \sum_{i} X_{i}^{2} - \overset{ˉ}{X} - \overset{ˉ}{X} 1),

where

s_{X}^{2} = \frac{1}{n} i \sum (X_{i} - \overset{ˉ}{X})^{2}, S_{xx} = i \sum (X_{i} - \overset{ˉ}{X})^{2} .

Component formulas:

Var (\hat{β}_{0}) = σ^{2} (\frac{1}{n} + \frac{X ˉ ^{2}}{S _{xx}}),

Var (\hat{β}_{1}) = \frac{σ ^{2}}{S _{xx}} = \frac{σ ^{2}}{n Var ( X )} .

Heteroskedastic scalar-regression asymptotic expression:

Var (\hat{β}_{1}) \approx \frac{Var ( ( X _{i} - X ˉ ) u _{i} )}{n Var ( X _{i} - X ˉ ) ^{2}} .

With multiple regressors, a useful form is

Var (\hat{β}_{j}) = \frac{σ ^{2}}{TSS _{j} ( 1 - R _{j}^{2} )},

where $R_{j}^{2}$ is from regressing regressor $X_{j}$ on the other regressors and an intercept. This denominator gives the variance-inflation interpretation.

1.3 Prediction

For new covariate value $x_{0}$ :

\overset{y}{^}_{0} = \hat{β}_{0} + \hat{β}_{1} x_{0} .

Prediction-mean variance:

Var (\overset{y}{^}_{0}) = Var (\hat{β}_{0}) + x_{0}^{2} Var (\hat{β}_{1}) + 2 x_{0} Cov (\hat{β}_{0}, \hat{β}_{1}) = σ^{2} (\frac{1}{n} + \frac{( x _{0} - X ˉ ) ^{2}}{S _{xx}}) .

Forecast error $e_{f} = y_{0} - \overset{y}{^}_{0}$ has

Var (e_{f}) = σ^{2} + Var (\overset{y}{^}_{0}) .

Estimated prediction variance:

\hat{ξ}^{2} = \overset{σ}{^}^{2} (1 + \frac{1}{n} + \frac{( x _{0} - X ˉ ) ^{2}}{S _{xx}}) .

1.4 Simple regression in matrix form

With $X = [1, x]$ ,

X^{'} X = (1^{'} 1 x^{'} 1 1^{'} x x^{'} x) .

Using block inversion, one can recover the standard simple-regression closed forms and show

\hat{β} = (X^{'} X)^{- 1} X^{'} y .

2. Classical Linear Model

2.1 Assumptions

A common stack of assumptions:

Linearity: $Y = Xβ + ε$ .
Exogeneity (strict/conditional): $E [ε ∣ X] = 0$ .
Spherical errors: $Var (ε ∣ X) = σ^{2} I_{n}$ .
Full column rank: $rank (X) = k$ .
Normal errors (for exact finite-sample normality): $ε ∣ X \sim N (0, σ^{2} I_{n})$ .
I.I.D. sampling of $(Y_{i}, X_{i})$ when using random-design asymptotics.

Notes:

Assumptions 1-4 give unbiasedness and Gauss-Markov efficiency statements.
Replacing homoskedasticity with $Var (ε ∣ X) = Ω$ yields $\hat{β} ∣ X \sim N (β, (X^{'} X)^{- 1} X^{'} Ω X (X^{'} X)^{- 1})$ under normality.

2.2 Optimization derivation of OLS

OLS solves

β \in R^{k} min (y - Xβ)^{'} (y - Xβ) .

FOC:

- 2 X^{'} (y - Xβ) = 0 \Rightarrow X^{'} X \hat{β} = X^{'} y \Rightarrow \hat{β} = (X^{'} X)^{- 1} X^{'} y .

With fixed regressors and homoskedasticity:

Var (\hat{β}) = σ^{2} (X^{'} X)^{- 1}, \overset{σ}{^}^{2} = \frac{e ^{'} e}{n - k}, e = y - X \hat{β} .

3. Finite and Large Sample Properties of $\hat{β}$ and $\overset{σ}{^}^{2}$

3.1 Finite-sample unbiasedness (conditional on $X$ )

Under linear model + exogeneity,

E [\hat{β} ∣ X] = E [(X^{'} X)^{- 1} X^{'} (Xβ + ε) ∣ X] = β + (X^{'} X)^{- 1} X^{'} E [ε ∣ X] = β .

A common caveat in the source notes: if you try to manipulate unconditional expectations through random matrix inverses directly, naive ratio-of-expectations shortcuts fail.

3.2 Finite-sample variance

Var (\hat{β} ∣ X) = (X^{'} X)^{- 1} X^{'} E [ε ε^{'} ∣ X] X (X^{'} X)^{- 1} = σ^{2} (X^{'} X)^{- 1}

under homoskedasticity.

With normality:

\hat{β} ∣ X \sim N (β, σ^{2} (X^{'} X)^{- 1}) .

3.3 Gauss-Markov (BLUE)

Within linear unbiased estimators, OLS has minimum variance: for any linear unbiased $b$ ,

a^{'} (Var (\hat{β} ∣ X) - Var (b ∣ X)) a \leq 0

in PSD ordering (equivalently $Var (b) - Var (\hat{β})$ is PSD).

3.4 Large-sample conditions and consistency

Typical conditions in the notes:

finite positive-definite $E [X_{i} X_{i}^{'}]$ ,
fourth moments of regressors and errors finite,
positive-definite $E [ε_{i}^{2} X_{i} X_{i}^{'}]$ .

Consistency decomposition:

\hat{β} - β = (\frac{1}{n} i \sum X_{i} X_{i}^{'})^{- 1} (\frac{1}{n} i \sum X_{i} ε_{i}) p 0.

3.5 Asymptotic normality and sandwich variance

n (\hat{β} - β) d N (0, A^{- 1} B A^{- 1}),

with

A = E [X_{i} X_{i}^{'}], B = E [ε_{i}^{2} X_{i} X_{i}^{'}] .

Sample robust (Huber-White) form:

Var (\hat{β}) = (X^{'} X)^{- 1} X^{'} \hat{Ω} X (X^{'} X)^{- 1}, \hat{Ω} = diag (\overset{e}{^}_{1}^{2}, \dots, \overset{e}{^}_{n}^{2}) .

In scalar simple regression this reduces to

Var (\hat{β}) \approx \frac{E [ ε ^{2} ( X - E X ) ^{2} ]}{n Var ( X ) ^{2}} .

3.6 Unbiasedness of $\overset{σ}{^}^{2}$

Using $e = M_{X} ε$ and $M_{X} = I - P_{X}$ ,

E [e^{'} e ∣ X] = E [ε^{'} M_{X} ε ∣ X] = σ^{2} tr (M_{X}) = σ^{2} (n - k) .

Hence $\overset{σ}{^}^{2} = e^{'} e / (n - k)$ is unbiased conditionally on $X$ .

3.7 Polynomial approximation and sieve intuition

Wierstrass theorem in this context supports approximating continuous conditional mean functions on compact support by high-order polynomials.

A polynomial sieve estimator of $m (x) = E [Y ∣ X = x]$ takes

\overset{m}{^} (x) = j = 0 \sum J_{n} \hat{β}_{j} x^{j},

where $J_{n} \to \infty$ but $J_{n} / n \to 0$ for consistency.

4. Geometry of OLS

Define

P_{X} = X (X^{'} X)^{- 1} X^{'} (hat matrix),

M_{X} = I_{n} - P_{X} (annihilator / residual-maker) .

Properties:

symmetric,
idempotent,
PSD,
$rank (P_{X}) = tr (P_{X}) = k$ ,
$rank (M_{X}) = tr (M_{X}) = n - k$ .

Fitted and residual vectors:

\overset{y}{^} = P_{X} y, e = M_{X} y .

4.1 Frisch-Waugh-Lovell theorem

Partition $X = [X_{1}, X_{2}]$ . Let $M_{1} = I - P_{X_{1}}$ . Then

\hat{β}_{2} = (X_{2}^{'} M_{1} X_{2})^{- 1} X_{2}^{'} M_{1} y .

So coefficients on $X_{2}$ equal regression of residualized $y$ on residualized $X_{2}$ after partialling out $X_{1}$ .

4.2 Partitioned regression equations

Normal equations under partitioning:

(X_{1}^{'} X_{1} X_{2}^{'} X_{1} X_{1}^{'} X_{2} X_{2}^{'} X_{2}) (β_{1} β_{2}) = (X_{1}^{'} y X_{2}^{'} y) .

FWL gives the equivalent closed-form block solutions via $M_{1}$ and $M_{2}$ .

5. Relationships Between Exogeneity Assumptions

For $y_{i} = β_{0} + β_{1} x_{i} + u_{i}$ :

$E [u] = 0$ mostly normalizes intercept handling.
Consistency for slope uses $Cov (u, x) = 0$ .
Mean independence $E [u ∣ x] = E [u] = 0$ implies zero covariance.
Zero covariance does not imply mean independence.
Full independence $u ⊥ x$ is stronger than mean independence.

Common failures of $E [u ∣ x] = 0$ :

Omitted variable bias.
Measurement error.
Simultaneity / reverse causality.

6. Residuals and Diagnostics

6.1 Leverage

Because

\overset{y}{^}_{i} = j = 1 \sum n h_{ij} y_{j},

diagonal $h_{ii}$ measures influence of observation $i$ on its fitted value.

6.2 Residual variance by leverage

From $\overset{ε}{^} = (I - H) y$ ,

Var (\overset{ε}{^}_{i} ∣ X) = σ^{2} (1 - h_{ii}) .

6.3 Standardized and studentized residuals

e_{i}^{std} = \frac{y _{i} - y ^ _{i}}{s 1 - h _{ii}},

e_{i}^{stu} = \frac{y _{i} - y ^ _{i}}{s ^{(- i)} 1 - h _{ii}},

where $s^{(- i)}$ omits observation $i$ .

6.4 Cook’s distance

Let $\hat{β}^{(- i)}$ omit observation $i$ and $d_{i} = \hat{β}^{(- i)} - \hat{β}$ . Then

D_{i} = \frac{d _{i}^{'} ( X ^{'} X ) d _{i}}{k s ^{2}} .

A common heuristic is $D_{i} > 1$ signals influential points.

7. Other Least-Squares Estimators

7.1 Robust regression (Huber loss)

\tilde{β} = ar g b min \frac{1}{n} i = 1 \sum n ρ (y_{i} - x_{i}^{'} b),

with Huber loss

ρ (u) = {u^{2}, 2 c ∣ u ∣ - c^{2}, ∣ u ∣ < c, ∣ u ∣ \geq c .

This behaves like squared loss near zero and absolute loss in tails.

7.2 Weighted least squares (WLS)

\hat{β}_{W L S} = (X^{'} W X)^{- 1} X^{'} W y .

7.3 Generalized least squares (GLS)

If $Var (ε ∣ X) = Ω$ known,

\hat{β}_{G L S} = (X^{'} Ω^{- 1} X)^{- 1} X^{'} Ω^{- 1} y,

Var (\hat{β}_{G L S} ∣ X) = (X^{'} Ω^{- 1} X)^{- 1} .

7.4 Restricted OLS

Under linear restrictions $Rβ = r$ , solve via Lagrangian

L (β, λ) = (y - Xβ)^{'} (y - Xβ) + 2 λ^{'} (Rβ - r) .

8. Goodness of Fit and Model Selection

Define:

Total sum of squares: $TSS = ∥ y - \overset{y}{ˉ} 1 ∥^{2}$ ,
Explained sum of squares: $ESS = ∥ \overset{y}{^} - \overset{y}{ˉ} 1 ∥^{2}$ ,
Residual sum of squares: $RSS = ∥ e ∥^{2}$ .

Decomposition:

TSS = ESS + RSS .

8.1 $R^{2}$ and adjusted $R^{2}$

R^{2} = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS},

\overset{ˉ}{R}^{2} = 1 - \frac{n - 1}{n - k} (1 - R^{2}) .

8.2 Information and subset criteria

Mallows $C_{p}$ (as written in source notes):

C_{p} = \frac{RSS + 2 ( k + 1 ) σ ^ ^{2}}{k} .

AIC:

A I C = lo g (\frac{e ^{'} e}{n}) + \frac{2 k}{n} .

BIC:

B I C = lo g (\frac{e ^{'} e}{n}) + \frac{k lo g n}{n} .

8.3 F-statistic and Wald statistic

Joint linear restrictions $Rβ = r$ :

F = \frac{( R β ^ - r ) ^{'} ( s ^{2} R ( X ^{'} X ) ^{- 1} R ^{'} ) ^{- 1} ( R β ^ - r )}{q} .

Equivalent model-comparison form:

F = \frac{( TSS - RSS ) / ( k - 1 )}{RSS / ( n - k )} = \frac{R ^{2} / ( k - 1 )}{( 1 - R ^{2} ) / ( n - k )} \sim F_{k - 1, n - k} .

Generic nonlinear Wald statistic:

W_{n} = nh (\hat{β}_{n})^{'} (\frac{\partial h ( β ^ _{n} )}{\partial β ^{'}} \hat{V}_{n} \frac{\partial h ( β ^ _{n} ) ^{'}}{\partial β})^{- 1} h (\hat{β}_{n}),

with asymptotic $χ_{q}^{2}$ reference.

8.4 Generalization error and LOOCV

Generalization error:

G = E [(Y - \overset{m}{^} (x))^{2}]

for a new draw $(x, Y)$ .

Training error:

T = \frac{1}{n} i = 1 \sum n (y_{i} - \overset{m}{^} (x_{i}))^{2},

often $T < G$ .

Leave-one-out CV:

L OOC V = \frac{1}{n} i = 1 \sum n (y_{i} - \overset{y}{^}_{i}^{(- i)})^{2} = \frac{1}{n} i = 1 \sum n (\frac{y _{i} - y ^ _{i}}{1 - h _{ii}})^{2} .

Using average leverage $γ = (p + 1) / n$ ,

L OOC V \approx \frac{1}{n} i \sum (\frac{y _{i} - y ^ _{i}}{1 - γ})^{2} \approx training error + \frac{2 σ ^ ^{2}}{n} (p + 1) .

For ridge with penalty $λ$ , effective degrees of freedom:

tr (H_{λ}) = j = 1 \sum J \frac{λ _{j}}{λ _{j} + λ},

where $λ_{j}$ are eigenvalues of $X^{'} X$ .

9. Multiple Testing Corrections

Testing many hypotheses with per-test size $α$ :

Probability of no false rejection under $k$ independent true nulls: $(1 - α)^{k}$ .
Probability of at least one false rejection: $1 - (1 - α)^{k}$ .

9.1 FWER

Let true-null index set be $M^{0}$ and rejected set be $R$ .

F W ER = P (M^{0} \cap R \neq = \emptyset) .

Equivalent with $N_{1} =$ number of type I errors:

F W ER = P (N_{1} > 0) .

Methods:

Bonferroni: threshold $α / J$ for $J$ tests.
Holm-Bonferroni stepdown: sequentially compare ordered $p$ -values $p_{(j)}$ to $α / (k - j + 1)$ .
Resampling methods (Romano-Wolf, Westfall-Young) when dependence matters.

9.2 Joint confidence bands

\hat{β} - β \sim a N (0, V / n),

seek intervals

[a_{k}, b_{k}] = [\hat{β}_{k} - c V_{kk} / n, \hat{β}_{k} + c V_{kk} / n]

with joint coverage approaching $1 - α$ .

The critical value $c$ is calibrated from the distribution of the sup-norm of a standardized Gaussian draw (often via simulation plugging in $\hat{V}$ ).

9.3 FDP / FDR and Benjamini-Hochberg

Setup:

hypotheses $H_{1}, \dots, H_{m}$ ,
$p$ -values $p_{1}, \dots, p_{m}$ (not necessarily independent),
rejected set $R$ with $R = ∣ R ∣$ ,
false rejections $V = ∣ R \cap H_{0} ∣$ .

Then

F D P = \frac{V}{R 1}, F D R = E [F D P] .

BH procedure at level $α$ :

Sort $p_{(1)} \leq \dots \leq p_{(m)}$ .
Reject up to $R = max {r : p_{(r)} \leq \frac{α r}{m}} .$
Equivalent threshold rule: $p_{i} \leq \frac{α R}{m} .$

Adjusted BH $q$ -value interpretation:

q_{i} = min {α : H_{i} rejected by BH (α)} .

BH procedure visual

10. Quantile Regression

Based on Koenker-style setup.

10.1 Conditional quantile function

Define conditional CDF

F_{Y ∣ X} (y ∣ x) = P (Y \leq y ∣ X = x) .

Conditional quantile at level $τ$ :

Q_{τ} (Y ∣ X = x) = F_{Y ∣ X}^{- 1} (τ ∣ x) .

Linear quantile model:

Q_{τ} (Y ∣ X = x) = x^{'} β_{τ} .

10.2 Relation to heteroskedasticity

Y_{i} = x_{i}^{'} β + ε_{i}, ε_{i} ∣ x_{i} \sim N (0, σ^{2} (x_{i})),

then with $z_{τ}$ standard-normal quantile,

Q_{τ} (x_{i}) = x_{i}^{'} β + σ (x_{i}) z_{τ} .

Thus quantile effects vary with $τ$ when variance is covariate-dependent; under homoskedasticity, quantile slopes coincide across $τ$ .

10.3 Quantile-regression estimator

Check-loss objective:

\hat{β} (τ) = ar g β \in R^{p} min i = 1 \sum n ρ_{τ} (y_{i} - x_{i}^{'} β),

with

ρ_{τ} (u) = u (τ - 1 {u \leq 0}) .

Equivalent split form:

Q_{n} (β_{τ}) = i : y_{i} \geq x_{i}^{'} β \sum τ ∣ y_{i} - x_{i}^{'} β_{τ} ∣ + i : y_{i} < x_{i}^{'} β \sum (1 - τ) ∣ y_{i} - x_{i}^{'} β_{τ} ∣.

Asymptotic distribution in source notation:

N (\hat{β}_{q} - β_{q}) d N (0, A^{- 1} B A^{- 1}),

where

A = plim \frac{1}{N} i \sum f_{u_{q}} (0 ∣ x_{i}) x_{i} x_{i}^{'}, B = plim \frac{1}{N} i \sum q (1 - q) x_{i} x_{i}^{'} .

10.4 Lehmann-Doksum quantile treatment effect

Let $F$ be CDF of $Y_{0}$ (control) and $G$ be CDF of $Y_{1}$ (treated). Define horizontal shift $Δ (x)$ via

F (x) = G (x + Δ (x)) .

Then

δ (τ) = G^{- 1} (τ) - F^{- 1} (τ)

is the QTE.

ATE from QTE:

\overset{ˉ}{δ} = \int_{0}^{1} δ (τ) d τ = μ (G) - μ (F) .

Empirical analogue:

\hat{δ} (τ) = \hat{G}^{- 1} (τ) - \hat{F}^{- 1} (τ) .

Regression analogue:

Q_{Y} (τ ∣ D_{i}) = α (τ) + δ (τ) D_{i} .

QTE visual

10.5 Interpreting transformed quantile models

If monotone transform $h$ yields

Q_{h (Y)} (τ ∣ X = x) = h (Q_{Y} (τ ∣ X = x)) = x^{'} β (τ),

then effects on $Q_{Y}$ follow via inverse-transform differentiation.

Example with log outcome: if modeling $Q_{l o g Y}$ , marginal effects on $Q_{Y}$ scale by $exp (x^{'} β)$ .

11. Measurement Error

11.1 Error in outcome variable

If observed $y = y^{*} + u_{y}$ and true model is $y^{*} = x β + ε$ , then observed regression is

y = x β + (ε + u_{y}) .

OLS slope remains unbiased if $u_{y}$ is orthogonal to $x$ :

plim \hat{β} = \frac{Cov ( y , x )}{Var ( x )} = β + \frac{Cov ( ε + u _{y} , x )}{Var ( x )} = β .

Variance inflates because noise increases error variance.

11.2 Error in regressor (classical EIV)

If true regressor $X^{*}$ observed with error $X = X^{*} + V$ and

y = X^{*} β + U,

then observed equation has composite error correlated with $X$ , so OLS is inconsistent.

Scalar attenuation result:

plim \hat{β} = \frac{σ _{X^{*}}^{2}}{σ _{X^{*}}^{2} + σ _{V}^{2}} β,

shrinking toward zero.

If measurement error ratio is $s = σ_{V}^{2} / σ_{X^{*}}^{2}$ , this reliability factor equals $1/ (1 + s)$ .

11.3 Correlated regressors and measurement error

With correlated regressors, attenuation in one regressor can propagate bias into others and often worsens distortion.

11.4 IV remedy

If instrument $Z$ satisfies

Cov (Z, X^{*}) \neq = 0, Cov (Z, m) = 0,

then in bivariate case

\hat{β}_{I V} = \frac{Cov ( Y , Z )}{Cov ( X , Z )} = β .

12. Missing Data

Categories in source notes:

MAR (missing at random): missingness may depend on other observables but not on the missing value itself, conditional on observables.
MCAR (missing completely at random): observed sample is random subsample of full data.
NMAR (not missing at random): neither MAR nor MCAR.

13. Inference on Functions of Parameters

13.1 Bootstrap

Core principle: resample from empirical CDF and recompute statistic to approximate its sampling distribution.

Algorithm:

From data $x_{1}, \dots, x_{N}$ , draw bootstrap sample $x_{1}^{*}, \dots, x_{N}^{*}$ (with replacement).
Compute statistic $\hat{t}^{*} = T (x_{1}^{*}, \dots, x_{N}^{*})$ .
Repeat $B$ times.
Use empirical distribution of ${\hat{t}_{b}^{*}}_{b = 1}^{B}$ for SEs/quantiles/bias correction.

Bootstrap SE:

s_{\hat{θ}, b oo t}^{2} = \frac{1}{B - 1} b = 1 \sum B (\hat{θ}_{b}^{*} - \overset{ˉ}{\hat{θ}}^{*})^{2}, \overset{ˉ}{\hat{θ}}^{*} = \frac{1}{B} b = 1 \sum B \hat{θ}_{b}^{*} .

Bias logic:

E [\hat{t}] - t_{0} \approx E [\tilde{t}] - \hat{t} .

13.2 Edgeworth expansion intuition

For standardized mean,

P (\frac{n ( X ˉ - μ )}{σ} \leq ω) = Φ (ω) + ϕ (ω) [- \frac{κ}{6 n} (ω^{2} - 1) + R_{n}],

with remainder term $R_{n}$ under regularity conditions.

13.3 Jackknife

Define leave-one-out estimates

\tilde{u}_{- i} = T (\hat{F}_{- i}), \tilde{u}_{(\cdot)} = \frac{1}{n} i \sum \tilde{u}_{- i} .

A standard jackknife variance estimate is

\overset{σ}{^}_{ja c k}^{2} = \frac{n - 1}{n} i = 1 \sum n (\tilde{u}_{- i} - \tilde{u}_{(\cdot)})^{2} .

13.4 Asymptotically pivotal statistics

A statistic is asymptotically pivotal if its limit distribution does not depend on unknown nuisance parameters.

13.5 Cluster wild bootstrap (few clusters)

Source algorithm (CGM style):

Estimate restricted model under null and get residuals $\tilde{u}_{i g}$ .
For each bootstrap draw, assign each cluster $g$ a Rademacher weight $d_{g} \in {- 1, + 1}$ .
Form pseudo-residuals $u_{i g}^{*} = d_{g} \tilde{u}_{i g}$ and pseudo-outcomes $y_{i g}^{*} = x_{i g}^{'} \tilde{β}_{H_{0}} + u_{i g}^{*} .$
Re-estimate unrestricted model on pseudo-data and compute test statistic $w_{b}^{*}$ .
Bootstrap $p$ -value from tail proportion of $∣ w_{b}^{*} ∣$ relative to observed $∣ w ∣$ .

13.6 Delta method / propagation of error

If quantity of interest is

\hat{θ} = f (\hat{ϕ}_{1}, \dots, \hat{ϕ}_{p}),

Taylor expansion gives variance propagation:

Var (\hat{θ}) \approx i = 1 \sum p (f_{i}^{'} (\hat{ϕ}))^{2} Var (\hat{ϕ}_{i}) + 2 i < j \sum f_{i}^{'} (\hat{ϕ}) f_{j}^{'} (\hat{ϕ}) Cov (\hat{ϕ}_{i}, \hat{ϕ}_{j}) .

General vector form with $τ = h (β)$ :

n (h (\hat{β}) - h (β)) d N (0, \nabla h (β)^{'} Σ\nabla h (β)) .

Scalar standardized form:

\frac{τ ^ _{n} - τ}{se ( τ ^ _{n} )} d N (0, 1),

with

se (\overset{τ}{^}_{n}) = ∣ g^{'} (\hat{θ}) ∣ se (\hat{θ}_{n}) .

13.7 Parametric bootstrap

Assume asymptotic parameter distribution is valid and simulate:

β^{(m)} \sim N (β^{*}, I_{N} (β)^{- 1}), h^{(m)} = h (β^{(m)}),

for $m = 1, \dots, M$ , then summarize empirical distribution of ${h^{(m)}}$ .

14. Generalized Method of Moments (GMM)

14.1 Linear GMM setup

Data $(Y_{i}, X_{i}, Z_{i})_{i = 1}^{n}$ , with $Y_{i} \in R$ , $X_{i} \in R^{k}$ , instruments $Z_{i} \in R^{ℓ}$ , and $ℓ \geq k$ .

Model:

Y_{i} = X_{i}^{'} β + U_{i}, E [Z_{i} U_{i}] = 0.

If just-identified ( $k = ℓ$ ):

\hat{β} = (i \sum Z_{i} X_{i}^{'})^{- 1} i \sum Z_{i} Y_{i} .

Overidentified weighted criterion:

\hat{β} (W_{n}) = ar g b min (\frac{1}{n} i \sum Z_{i} (Y_{i} - X_{i}^{'} b))^{'} W_{n} (\frac{1}{n} i \sum Z_{i} (Y_{i} - X_{i}^{'} b)) .

14.2 General moment-condition formulation

Given moments

E [g (W_{i}, θ_{0})] = 0,

with $g \in R^{r}$ , $θ \in R^{q}$ , $r \geq q$ , define

g_{n} (θ) = \frac{1}{n} i \sum g (W_{i}, θ) .

GMM objective:

\hat{θ} = ar g θ min Q_{n} (θ), Q_{n} (θ) = g_{n} (θ)^{'} W_{n} g_{n} (θ) .

Jacobian:

D (θ) = E [\frac{\partial g ( W _{i} , θ )}{\partial θ ^{'}}] .

Identification language:

underidentified if rank $D < q$ ,
just identified if rank $D = q$ and $r = q$ ,
overidentified if $r > q$ with full rank for identified directions.

Asymptotic normality:

N (\hat{θ} - θ_{0}) d N (0, V_{θ}),

V_{θ} = (D^{'} W D)^{- 1} (D^{'} W S W D) (D^{'} W D)^{- 1},

where

S = E [g_{i} (θ_{0}) g_{i} (θ_{0})^{'}] .

14.3 Technical regularity conditions (summary)

The notes list standard requirements:

compact parameter space,
global identification,
uniform LLN behavior of moments,
continuity/differentiability of moment functions,
finite moments,
$W_{n} p W$ ,
interior true parameter.

14.4 Two-step efficient GMM

Variance-minimizing asymptotic weight is $W = S^{- 1}$ , giving

V_{θ} = (D^{'} S^{- 1} D)^{- 1} .

Practical two-step algorithm:

Initialize $W_{0} = I$ .
Compute preliminary $\hat{θ}^{(1)}$ .
Estimate $\hat{S}$ at $\hat{θ}^{(1)}$ .
Set $\hat{W} = \hat{S}^{- 1}$ .
Re-optimize to get efficient $\hat{θ}_{GMM}$ .

14.5 Standard methods nested in GMM

OLS moments: $g_{i} (β) = x_{i} (y_{i} - x_{i}^{'} β)$ .
IV/2SLS moments: $g_{i} (β) = z_{i} (y_{i} - x_{i}^{'} β)$ .
MLE moments: score equations $g_{i} (θ) = \partial lo g f (W_{i}; θ) / \partial θ$ .

15. Empirical Likelihood and Generalized Empirical Likelihood

15.1 Nonparametric likelihood

For IID $X_{1}, \dots, X_{n}$ with CDF $F$ , nonparametric likelihood:

L (F) = i = 1 \prod n (F (X_{i}) - F (X_{i} -)) .

ECDF maximizes this criterion:

\hat{F} = \frac{1}{n} i = 1 \sum n δ_{X_{i}} .

15.2 NPMLE as constrained optimization

Assign discrete masses $p_{i}$ on observed points and solve

p_{1}, \dots, p_{n} max \frac{1}{n} i = 1 \sum n lo g p_{i} s.t. i \sum p_{i} = 1.

Solution is $\overset{p}{^}_{i} = 1/ n$ .

15.3 Empirical likelihood with moment restrictions

Given moments

E [m (W, θ_{0})] = 0,

solve

p max \frac{1}{n} i \sum lo g p_{i}

subject to

i \sum p_{i} m (W_{i}, θ) = 0, i \sum p_{i} = 1.

Lagrange multipliers $λ$ induce implicit equations for $(\hat{θ}, \hat{λ})$ :

\frac{1}{n} i \sum \frac{m ( W _{i} , θ ^ )}{1 + λ ^ ^{'} m ( W _{i} , θ ^ )} = 0,

plus corresponding score equation in $θ$ .

Dual saddlepoint form:

θ \in Θ max λ min \frac{1}{n} i \sum [- lo g (1 + λ^{'} m (W_{i}, θ))] .

15.4 Generalized empirical likelihood (GEL)

Replace log criterion with shape-constrained $ρ (v)$ satisfying normalizations at 0:

θ \in Θ min λ \in Λ_{n} sup \frac{1}{n} i \sum ρ (λ^{'} m (W_{i}, θ)) .

Special cases in notes:

$ρ (v) = lo g (1 - v)$ gives EL,
$ρ (v) = - \frac{1}{2} v^{2} - v$ gives CUE,
$ρ (v) = 1 - e^{v}$ gives exponential tilting.

Primal formulation uses Cressie-Read divergence family for weights.

16. M-estimation

M-estimator solves

i = 1 \sum n ψ (O_{i}, \hat{θ}) = 0,

with i.i.d. observations $O_{i}$ and estimating function $ψ$ .

Asymptotic sandwich variance:

Var (\hat{θ}) \approx B_{n}^{- 1} M_{n} (B_{n}^{- 1})^{'},

where

B_{n} = \frac{1}{n} i \sum - ψ_{i}^{'} (\hat{θ}) (bread),

M_{n} = \frac{1}{n} i \sum ψ_{i} (\hat{θ}) ψ_{i} (\hat{θ})^{'} (meat) .

This framework nests many estimators. For MLE, $ψ$ is the score.

Lalgorithms

Explorer

Chapter 02: Linear Regression

Concept map

1. Simple Linear Regression

1.1 OLS in summation form

1.2 Properties of least-squares estimators

1.3 Prediction

1.4 Simple regression in matrix form

2. Classical Linear Model

2.1 Assumptions

2.2 Optimization derivation of OLS

3. Finite and Large Sample Properties of β^​ and σ^2

3.1 Finite-sample unbiasedness (conditional on X)

3.2 Finite-sample variance

3.3 Gauss-Markov (BLUE)

3.4 Large-sample conditions and consistency

3.5 Asymptotic normality and sandwich variance

3.6 Unbiasedness of σ^2

3.7 Polynomial approximation and sieve intuition

4. Geometry of OLS

4.1 Frisch-Waugh-Lovell theorem

4.2 Partitioned regression equations

5. Relationships Between Exogeneity Assumptions

6. Residuals and Diagnostics

6.1 Leverage

6.2 Residual variance by leverage

6.3 Standardized and studentized residuals

6.4 Cook’s distance

7. Other Least-Squares Estimators

7.1 Robust regression (Huber loss)

7.2 Weighted least squares (WLS)

7.3 Generalized least squares (GLS)

7.4 Restricted OLS

8. Goodness of Fit and Model Selection

8.1 R2 and adjusted R2

8.2 Information and subset criteria

8.3 F-statistic and Wald statistic

8.4 Generalization error and LOOCV

9. Multiple Testing Corrections

9.1 FWER

9.2 Joint confidence bands

9.3 FDP / FDR and Benjamini-Hochberg

10. Quantile Regression

10.1 Conditional quantile function

10.2 Relation to heteroskedasticity

10.3 Quantile-regression estimator

10.4 Lehmann-Doksum quantile treatment effect

10.5 Interpreting transformed quantile models

11. Measurement Error

11.1 Error in outcome variable

11.2 Error in regressor (classical EIV)

11.3 Correlated regressors and measurement error

11.4 IV remedy

12. Missing Data

13. Inference on Functions of Parameters

13.1 Bootstrap

13.2 Edgeworth expansion intuition

13.3 Jackknife

13.4 Asymptotically pivotal statistics

13.5 Cluster wild bootstrap (few clusters)

13.6 Delta method / propagation of error

13.7 Parametric bootstrap

14. Generalized Method of Moments (GMM)

14.1 Linear GMM setup

14.2 General moment-condition formulation

14.3 Technical regularity conditions (summary)

14.4 Two-step efficient GMM

14.5 Standard methods nested in GMM

15. Empirical Likelihood and Generalized Empirical Likelihood

15.1 Nonparametric likelihood

15.2 NPMLE as constrained optimization

15.3 Empirical likelihood with moment restrictions

15.4 Generalized empirical likelihood (GEL)

16. M-estimation

Related notes

Graph View

Table of Contents

Backlinks

3. Finite and Large Sample Properties of $\hat{β}$ and $\overset{σ}{^}^{2}$

3.1 Finite-sample unbiasedness (conditional on $X$ )

3.6 Unbiasedness of $\overset{σ}{^}^{2}$

8.1 $R^{2}$ and adjusted $R^{2}$