This note is a high-fidelity Markdown migration of the Maximum Likelihood and Machine Learning chapter from the LaTeX source.

Parent map: index Prerequisites: probability-and-mathstats, linear-regression, causal-inference

Concept map

flowchart TD
  A[Maximum Likelihood] --> B[Score / Fisher Information]
  A --> C[QMLE / Robust SEs]
  A --> D[Binary / Discrete Choice]
  D --> E[Ordered / Unordered Models]
  A --> F[Counts / Rates / Truncation]
  A --> G[GLM Theory]
  H[Machine Learning] --> I[Supervised Learning]
  I --> J[Regularization]
  I --> K[Classification Metrics]
  H --> L[Neural Networks]
  H --> M[Unsupervised Learning]

Maximum Likelihood

Setup

Let ${Z}_{i = 1}^{N}$ be a sequence of iid rv’s with common CDF $F (z ∣ θ^{*})$ . We want to estimate $θ^{*} \in Θ \subset R^{k}$ .

Since the rv’s are IID, we can write

f (y ∣ θ^{*}) = i = 1 \prod N f (y_{i} ∣ θ^{*})

L: $Θ \to R$

L (θ ∣ y) : = i = 1 \prod N f (y_{i} ∣ θ)

We usually work with the log of this $ℓ (θ ∣ y) : = \sum_{i = 1}^{N} lo g f (y_{i} ∣ θ)$ , and drop the conditioning on $y$ though strictly speaking $f (y, X ∣ θ) = f (y ∣ X ∣ θ) f (X ∣ θ)$

is the estimator that maximises the conditional log-likelihood estimator.

\hat{θ}_{M L E} : = θ \in Θ arg max L (θ) = θ \in Θ arg max i = 1 \sum lo g f (Z_{i} ∣ θ)

solves the first-order conditions

\frac{1}{N} \frac{\partial L ( θ )}{\partial θ} = \frac{1}{N} i \sum \frac{\partial ℓ ( y _{i} ∣ x _{i} , θ )}{\partial θ} = 0

conditional density:

f (y_{i} ∣ x_{i}, ∣ β, σ^{2}) = i = 1 \prod N \frac{1}{2 π σ ^{2}} exp (- \frac{( y _{i} - x _{i}^{'} β ) ^{2}}{2 σ ^{2}})

Log likelihood:

ℓ (β, σ^{2}) = - \frac{n}{2} lo g (2 π) - \frac{n}{2} lo g σ^{2} - \frac{1}{2 σ ^{2}} (y - Xβ)^{'} (y - Xβ) = - \frac{N}{2} RSS (β) - \frac{N}{2} lo g (2 π σ^{2})

Maximising this w.r.t. $b, s^{2}$ yields $\hat{β} = (X^{'} X)^{- 1} X^{'} y$ , $\overset{σ}{^}^{2} = \frac{\sum _{i = 1}^{n} u ^ _{i}^{2}}{N}$ .

gradient vector $S (θ) : = \nabla ℓ (θ) = \frac{\partial ℓ ( θ )}{\partial θ}$

S (z, θ) : = \frac{\partial ℓ ( θ )}{\partial θ} (z; θ)

Evaluated at $θ^{*}$ , this is the efficient score.

local maximum that solves the FOCs $S (z, θ) = 0$ , IOW

\frac{1}{N} i = 1 \sum N \frac{lo g f ( Z _{i} ∣ θ )}{\partial θ} = 0

I (θ) : = E [S (θ) S (θ)^{'}] = E [\frac{\partial ℓ ( θ )}{\partial θ} \frac{\partial ℓ ( θ )}{\partial θ ^{'}}] = - E [\frac{\partial ^{2} ℓ ( θ )}{\partial θ \partial θ ^{'}}]

A = B : = V [s_{i} (θ^{*})] = E [s_{i} (θ^{*}) s_{i} (θ^{*})^{'}] = E [\frac{\partial}{\partial θ} s_{i} (θ^{*})]

Variance estimate $V [\hat{θ}] = \frac{1}{n} A^{- 1}$ ;

Estimated as

\hat{I (θ)} = - \frac{1}{N} i = 1 \sum N \frac{\partial ^{2} ℓ _{i} ( θ )}{\partial θ \partial θ ^{'}}_{θ = θ_{M L E}}

For OLS, parameter vector $θ = (β^{'}, σ^{2})^{'}) = (β^{'}, γ)^{'}$ .

Scores:

\frac{\partial ℓ}{\partial β} = \frac{1}{γ} X^{'} (y - Xβ)

\frac{\partial ℓ}{\partial γ} γ = - \frac{n}{2 γ} + \frac{1}{2 γ ^{2}} : = s (y - Xβ)^{'} (y - x β) = s / n = \frac{1}{n} i = 1 \sum n (y_{i} - x_{i}^{'} β)

Information Matrix / Variance

I (θ) = (\frac{1}{σ ^{2}} X^{'} X 0^{'} 0 \frac{n}{2 σ ^{4}}) ⟹ CR L B : = (I (θ))^{- 1} = (σ^{2} (X^{'} X)^{- 1} 0^{'} 0 \frac{2 σ ^{4}}{n})

Properties of Maximum Likelihood Estimators

$N \to \infty$ , probability of missing the true parameter goes to zero.

P (∣ \hat{θ}_{n} - θ ∣ > ϵ) p 0 \forall ϵ > 0

N (\hat{θ} - θ) d N (0, I (θ)^{- 1})

Equivalently,

\hat{θ}_{M L E} \sim^{a} N (θ, - E [\frac{\partial ^{2} ℓ ( θ )}{\partial θ \partial θ ^{'}}]^{- 1})

conditions: (1) $θ \in Θ$ , (2) $ℓ$ is twice-differentiable Variance of MLE is the Cramer-Rao Bound; the asymptotic variance of the MLE is at least as small as that of any other consistent estimator. Let the pdf of the r.v. $X$ be $f_{X} (x ∣ θ)$ for some $θ_{0} \in Θ$ . Let $θ$ be an unbiased estimator for $θ_{0}$ . Suppose the derivative $\partial / \partial θ$ can be passed under the integral $\int f (x ∣ θ) d x$ and $\int θ (x) f (x ∣ θ) d x$ and suppose the fisher information

I (θ) = - E [\frac{\partial ^{2} lo g f}{\partial θ \partial θ ^{'}} (X ∣ θ)]

is finite. Then,

V [θ (X)] \geq I (θ_{0})^{- 1}

If $X_{1}, X_{2}, \dots, X_{n}$ are iid with common density $f_{x} (x ∣ θ)$ , the implied bound on the variance is $N V [θ (X)] \geq I (θ_{0})^{- 1}$ Let $X_{1}, X_{2}, \dots$ be iid random variables with common density $f_{x} (x ∣ θ)$ . A sequence of estimates $θ_{N}$ , a function $X_{1}, X 2, \dots, X_{N}$ that satisfies

N (θ_{m l} - θ) d N (0, I (θ)^{- 1})

whatever the true value of $θ \in Θ$ is, is said to be asymptotically efficient. Suppose $X_{1}, X_{2}, \dots$ are iid with density $X \sim f (x ∣ θ, h (.))$ where $h (.)$ is an unknown function. Next, we pretend to know the infinite dimensional parameter $h ()$ up to a finite dimensional parameter $γ$ , in which we have a fully parametric finite-dimensional parameter $θ$ , thus we can calculate the Cramer-Rao Bound.

f (x ∣ θ, γ) = f (x ∣ θ, h (γ))

Partitioning the information matrix for $(θ^{'}, γ^{'})^{'}$ and its inverse in

I (θ, γ) = [I_{θ θ^{'}} I_{γ^{'} θ} I_{θ γ^{'}} I_{γ γ^{'}}] and I (θ, γ)^{- 1} = [I^{θ θ^{'}} I^{γ θ^{'}} I^{θ γ^{'}} I^{γ γ^{'}}]

The Cramer-Rao bound implies that

A S V (θ) \geq I^{θ θ^{'}} = (I_{θ θ^{'}} - I_{θ γ^{'}} (I_{γ γ^{'}})^{- 1} I_{γ θ^{'}})^{- 1}

This is true for any parametrisation of the unknown function $h (.)$ . The lowest possible variance for any estimator for $θ$ that does not use knowledge of $h (.)$ has to be at least as high as the lowest variance we can get if we know more, that is, the Cramer-Rao bound for any parametric submodel. So, the semiparametric efficiency bound is the largest lower-bound we can get for any parametric submodel.

Suppose we have a candidate Estimator $θ$ and a given parametrisation $h (x; γ)$ . Then,

(I_{θ θ^{'}} - I_{θ γ^{'}} (I_{γ γ^{'}})^{- 1} I_{γ θ^{'}})^{- 1} \leq Semiparametric Efficiency Bound \leq A S V (θ)

For any estimator we can calculate the left hand side, for any parametrization we can calculate the right hand side, so if we find an estimator and a parametrization that the two are equal we have found the efficiency bound. Let $τ = g (θ)$ where $g$ is bijective, continuous, and differentiable. Let $\hat{θ}_{n}$ be the MLE of $θ$ . Then, $\overset{τ}{^}_{n} = g (\hat{θ_{n}}$ is the MLE of $τ$ . For a regression of the form $E [y ∣ x] = g (x^{'} β)$ , one can estimate multiple ‘marginal effects’. For the special case where $E [y ∣ x] = x^{'} β$ , $\frac{\partial E [ y ∣ x ]}{\partial x} = β$ , but this is not generically true.

Average Marginal Effect (AME): $: = \frac{1}{N} \sum_{i} \frac{\partial E [ y _{i} ∣ x _{i} ]}{\partial x _{i}}$

Marginal Effect at Mean (MEM) $\frac{\partial E [ y ∣ x ]}{\partial x} ∣_{\overset{x}{ˉ}}$ Suppose $t (x)$ is a sufficient statistic for $θ$ . Then,

i \prod f (x_{i} ∣ θ) = g (t (x), θ) h (x)

$\hat{θ} (x)$ depends on the data $x$ only through $t (x)$ , the sufficient statistic.

QMLE / Misspecification / Information Theory

If model is misspecified, $f (\cdot ∣ x_{i}, θ) \neq = p_{0} (\cdot ∣ x_{i}) \forall θ \in Θ$

The MLE converges to the best fitting $θ$ for the population (pseudo-true value)

θ^{⋆} = θ \in Θ ar g max plim \frac{1}{N} i = 1 \sum N ℓ_{i} (θ)

For the linear exponential family, the quasi-MLE is consistent even when the density is partially misspecified.

Robust Standard Errors

Asymptotic distribution of QMLE

n (\hat{θ} - θ^{⋆}) \to d N (0, \hat{A}^{- 1} \hat{B} \hat{A}^{- 1})

where

\hat{A} = - E [\nabla^{2} (\hat{β})] = E [\frac{\partial S _{i} ( θ )}{\partial θ ^{'}}] ∣_{\hat{θ}} = \frac{1}{n} i = 1 \sum n \frac{\partial ^{2} ℓ _{i} ( θ )}{\partial θ \partial θ ^{'}} ∣_{\hat{θ}}

\hat{B} = E [S_{i} (θ^{⋆}) S_{i} (θ^{⋆})^{'}] = \frac{1}{n} i = 1 \sum n \frac{\partial ℓ _{i}}{\partial θ} \times \frac{\partial ℓ _{i}}{\partial θ ^{'}} ∣_{\hat{θ}}

let $f (y ∣ θ)$ be the assumed joint density, and let $h (y)$ be the true density. Then,

KL [h (\cdot) ∣∣ f (\cdot)] : = E_{h} [lo g (\frac{h ( y )}{f ( y ∣ θ )})] = \int_{- \infty}^{\infty} h (t) lo g (\frac{h ( t )}{f ( t )}) d t

Minimised when $\exists θ_{0} s.t. h (y) = f (y ∣ θ_{0})$ . QMLE minimises distance between $f (y ∣ θ)$ and $h (y)$ . Notation $KL (h, f)$ denotes ‘information lost when $f$ is used to approximate $h$ ‘.

Discrete version illustrates links to Entropy

KL [p ∣∣ q] : = j = 1 \sum J p_{j} lo g \frac{p _{j}}{q _{j}} = j = 1 \sum J p_{j} lo g p_{j} - j = 1 \sum J p_{j} lo g q_{j} = - H (p) + cross entropy H (p, q)

$K L (p ∣∣ q) \geq 0$ and with equality IFF $p = q$ .

$K L$ ‘distance’, unlike Euclidian Distance, is not the same between $f, g$ as $g, f$ ; i.e. it is directional. Akaike showed that using K-L model selection entails finding a good estimator for

E_{y, h} [E_{x, h} [lo g (f (x ∣ θ (y)))]]

where $x, y$ are independent, random samples from the same distribution and expectations are taken w.r.t. the true distribution $h$ . Estimating this quantity for each model $f_{i}$ is biased upwards. An approximately unbiased estimator of the above target quantity is

For a general class of maximum-likelihood models,

AIC = - 2 lo g L (θ ∣ y) + 2 K

For linear regression models, this simplifies to

A I C = n lo g σ^{2} + 2 K; σ^{2} = \frac{\sum _{i = 1}^{n} ε ^{2}}{n}

B I C = ln (\frac{e ^{'} e}{n}) + \frac{k ln ( n )}{n}

Testing

To test the hypothesis $H_{0} : α = 0$ against the alternative, there are three classical tests.

We partition the parameter $K$ -vector $θ$ into two parts $(θ_{0}^{'}, θ_{1}^{'})^{'}$ s.t. the dimensions of the two sub-vectors s.t. $K_{0} + K_{1} = K$ . $θ_{1}$ is a nuisance parameter: its value is not restricted under the null.

Let $\hat{θ}_{u} : = (\hat{θ}_{0 u}, \hat{θ}_{1 u})$ be the unrestricted MLEs. If we are testing the restriction $θ_{0} = 0$ , then the restricted parameter vector is $\hat{θ}_{R} : = (0, \hat{θ}_{1 r})$ . IOW, test $h (θ_{0}) = 0$ .

If null is true, $ℓ$ at restricted model ( $(0, θ_{1 r}$ ) should not be much smaller than $ℓ$ at the unrestricted model ( $(θ_{0 u}, θ_{1 u})$ .

L R : = 2 \times (ℓ (\hat{θ}_{u}) - ℓ (\hat{θ}_{R}))

Under the null, $L R \sim χ_{K_{0}}^{2}$ (where $K_{0}$ is the number of restrictions being tested). If the limiting $ℓ$ is maximised at $θ_{0} = 0$ , the derivative of the $ℓ$ wrt $θ_{0}$ at that point should be close to zero.

L M : = S (\hat{θ}_{R})^{'} [I^{- 1} (\hat{θ}_{R})] S (\hat{θ}_{R})

Under the null, $L R \sim χ_{K_{0}}^{2}$ (where $K_{0}$ is the number of restrictions being tested). Unrestricted estimates of $θ_{0}$ should be close to zero.

W : = N \cdot \hat{θ}_{0 u}^{'} (\hat{I}^{00})^{- 1} \hat{θ}_{0 u}

Where $\hat{I}^{00}$ is the top-left of the information matrix (corresponding with the restricted parameters). Under the null, $W \sim χ_{K_{0}}^{2}$ .

alternatively, $W = h (\hat{θ}_{u})^{'} Ω^{- 1} h (\hat{θ}_{u})$

where $h_{1} (θ) \dots h_{K_{0}} (θ)$ are restrictions,

Ω = (\frac{\partial h ( θ )}{\partial θ})^{'} V [θ_{u}] (\frac{\partial h ( θ )}{\partial θ})

evaluated at $\hat{θ}_{u}$ . McFadden’s Pseudo- $R^{2}$

R_{bin}^{2} : = 1 - \frac{ℓ ( β ^ )}{ℓ ( y ˉ )}

Binary Choice

Estimate the probability using OLS $Pr (y = 1∣ x) = X β$ . $V [y ∣ x] = X β (1 - X β)$ , so heteroskedasticity is mechanically present unless all coefficients are zero.

Logit(p) = lo g (\frac{p}{1 - p})

Logistic(x) = \frac{1}{1 + exp ( - x )} = \frac{e ^{x}}{1 + e ^{x}}

Logistic regression fits $logit (p_{i}) = x_{i}^{'} β$ , where $l o g i t$ is the link function that scales $x_{i}^{'} β$ onto the probability scale. Alternatively, one can use $Φ (0, 1)$ . For $Y_{i} \in {0, 1}$ , assume latent index model $Y_i^* = X_i’\beta

\epsilon_i $;$ Y_i \defeq \Indic{Y_i^* > 0} $.$ Y_i $i s b er n o u ll i, so$ \mathcal{L}=\prod_{i=1}^{N} \pi_{i}^{Y_{i}}\left(1-\pi_{i}\right)^{1-Y_{i}}$.

Symmetric CDFs. Let $π_{i} = E [Y_{i} ∣ X_{i}] = Pr (Y_{i} = 1∣ X_{i}) = F (X_{i}^{'} β) = 1 - F (- X_{i}^{'} β)$

Probit: $F (u) = Φ (u)$

Logit:

$F (u) = Λ (u)$ = $\frac{1}{1 + e x p ( - u )} = \frac{e x p ( u )}{1 + e x p ( u )}$

$f (u) = Λ^{'} (u) = (1 - e^{- u})^{- 2} e^{- u}$

Model	$π_{i} = Pr (y = 1 ∣ x)$	Marginal Effect $\frac{\partial p}{\partial x _{j}}$
Logit	$Λ (x^{'} β) = \frac{e x p ( x ^{'} β )}{1 + e x p ( x ^{'} β )}$	$Λ (x^{'} β) [1 - Λ (x^{'} β)] β_{j}$
Probit	$Φ (x^{'} β)$	$ϕ (x^{'} β) β_{j}$
Clog-log	$C (x^{'} β) = 1 - exp (- exp (x^{'} β))$	$exp (- exp (x^{'} β)) exp (x^{'} β) β_{j}$
LPM	$x^{'} β$	$β_{j}$

ℓ (β) = \frac{1}{n} \sum (Y_{i} lo g F (X_{i}^{'} β) + (1 - Y_{i}) lo g (1 - F (X_{i}^{'} β)))

Let $f_{i} : = f (x_{i}^{'} β); F_{i} = F (x_{i}^{'} β)$ be the density and CDF evaluated at $x_{i}^{'} β$ .

s_{i} (θ) = \frac{f _{i} x _{i}^{'} [ y _{i} - F _{i} ]}{F _{i} ( 1 - F _{i} )}

Sample Score solves

i = 1 \sum N (\frac{y _{i}}{F _{i}} f_{i} x_{i} - \frac{1 - y _{i}}{1 - F _{i}} f_{i} x_{i}) = 0

Variance:

V [\hat{β}] = (i = 1 \sum N \frac{f _{i}^{2} x _{i} x _{i}^{'}}{F _{i} ( 1 - F _{i} )})^{- 1}

$V [y_{i} ∣ x_{i}] = F_{i} (1 - F_{i})$

Marginal effect:

\frac{\partial Pr ( y _{i} = 1∣ x _{i} )}{\partial x _{i}} = f (x_{i}^{'} β) β

Q (θ) = ℓ (θ) = \frac{1}{n} i \sum (y_{i} lo g Λ (x_{i}^{'} θ) + (1 - y_{i}) lo g [1 - Λ (x_{i}^{'} θ)])

Since for the logistic CDF, $Λ^{'} (v) = λ (v) = Λ (v) (1 - Λ (v))$ , the score and hessian can be written as

S (θ) \nabla^{2} (θ) = [y_{i} - Λ (x_{i}^{'} θ)] x_{i} = - Λ (x_{i}^{'} θ) [1 - Λ (x_{i}^{'} θ)] x_{i} x_{i}^{'} = - λ (x_{i}^{'} θ) x_{i} x_{i}^{'}

Discrete Choice

In many Additive Random Utility Models (ARUMs), $F (ϵ_{1} - ϵ_{0})$ is logistic for multivariate extensions to logit. This assumes that the errors themselves are distributed Gumbel/ type 1 extreme-value distribution

f (ϵ) = exp (- ϵ) exp (- exp (- ϵ)) - \infty < ϵ < ϵ

and $F (ϵ) = exp (- exp (- ϵ))$ .

Ordered

Random utility with multiple cutoffs $ϕ_{1} \dots ϕ_{J}$ , where $ϕ_{1} = 0, ϕ_{J} = \infty$ .

Define $y_{i}^{*}$ latent variable, and

y_{i} = ⎩ ⎨ ⎧ 0 if - \infty (= 1 if ⋮ J if ψ_{0}) < y_{i}^{*} \leq ψ_{1} ψ_{1} < y_{i}^{*} \leq ψ_{2} ⋮ ψ_{J - 1} < y_{i}^{*} \leq \infty (= ψ_{J})

which means

Pr (y_{i} \leq j ∣ x_{i}) = \frac{exp ( ψ _{j} - x _{i}^{'} β )}{1 + exp ( ψ _{j} - x _{i}^{⊤} β )}

Pr (y_{i} \leq j ∣ x_{i}) = Φ (ψ_{j} - x_{i}^{'} β) ⟺ Pr (y = k - 1∣ x_{i}) = Φ (α_{k} - X β) - Φ (α_{k - 1} - X β)

specifications yield a likelihood that is simply the product of binary logit/probit models that switch between adjacent categories for each observation.

ℓ (β, ψ ∣ Y, X) = i = 1 \sum N j = 1 \sum J 1_{y_{i} = j} lo g (F (ψ_{j} - X_{i}^{'} β) - F (ψ_{j - 1} X_{i}^{'} β))

Marginal effects are of the form

\frac{\partial Pr ( Y = j )}{\partial x _{j}}_{\overset{x}{ˉ}} = \hat{β_{j}} (f (\hat{ψ}_{j} - \overset{x}{ˉ}^{'} \hat{β}) - f (\hat{ψ}_{j - 1} - \overset{x}{ˉ}^{'} \hat{β}))

Unordered

Multinomial distribution $: = p (y_{i}) = \prod_{j = 1}^{J} π_{j}^{1_{y_{j} = j}}$

ℓ (π ∣ Y) = i = 1 \sum N j = 1 \sum J 1_{ij} lo g π_{j}

π_{ij} = Pr (y_{i} = j ∣ x_{i}) = \frac{exp ( x _{i}^{'} β )}{[ 1 + \sum _{j = 1}^{J} exp ( x _{i}^{'} β ) ]} = \frac{exp ( x _{i}^{'} β _{j} )}{\sum _{k = 2}^{J} exp ( x _{i}^{'} β _{k} )}

where we adopt normalisation $Pr (y = 0∣ x_{i}^{'} β) = \frac{1}{[ 1 + \sum _{j = 1}^{J} e x p ( x _{i}^{'} β ) ]}$ for identification.

Coefficient interpretation:

\frac{p _{j} ( x _{i} , β )}{p _{0} ( x , β )} = exp (x β_{j}) ⟺ lo g [\frac{p _{j} ( x _{i} , β )}{p _{h} ( x , β )}] = x_{i}^{'} (β_{j} - β_{h}) \forall j, h \in 1, \dots J

which implies that the log-odds ratio is linear in $x$ . Permits incorporation of choice-varying predictors $X_{ij}$ , nests MNL.

π_{ij} = Pr (Y_{i} = j ∣ X_{ij}) = \frac{exp ( X _{ij}^{'} β )}{\sum _{k = 2}^{J} exp ( X _{ik}^{'} β )}

Log likelihood of the form

ℓ = i = 1 \sum n (h = 1 \sum M 1_{ij} [x_{i}^{'} β_{h}] - lo g (l = 1 \sum M exp x_{i}^{'} β_{l}))

Relative risk $π_{ij} / π_{ik}$ independent of other choices $\neg {j, k}$ ; choices are series of pairwise comparisons. $p_{j} (x_{j}) / p_{h} (x_{h}) = exp [(x_{j} - x_{h}) β]$ . IoW $ϵ_{ij} ⊥ ⊥ ϵ_{ik} for j \neq = k$ . $ϵ_{i} \sim_{iid} MVN (0, Σ_{J})$

π_{ij} = \int_{- \infty}^{- \ddot{X}_{1 j}^{⊤} β} \dots \int_{- \infty}^{- \ddot{X}_{J j}^{⊤} β} ϕ (\overset{ϵ}{¨}_{1 j}, \dots, \overset{ϵ}{¨}_{J j}) d \overset{ϵ}{¨}_{1 j} \dots d \overset{ϵ}{¨}_{J j}

where $\ddot{X}_{k l} = X_{ik} - X_{i l}; \overset{ϵ}{¨}_{k l} = ϵ_{ik} - ϵ_{i l}$

Counts and Rates

Counts

$f (y ∣ λ) = λ^{y} exp (- λ) / y!$

Poisson specification: $λ = exp (x_{i}^{'} β)$ . Yields log density

lo g f (y ∣ x, β) = y_{i} exp (x_{i}^{'} β) - x_{i}^{'} β - lo g y!

Score:

s_{i} (θ) = - exp (x_{i}^{'} β) x_{i}^{'} y_{i} x_{i}^{'} = x_{i}^{'} (y_{i} - exp (x_{i}^{'} β))

solves

\sum x_{i}^{'} (y_{i} - exp (x_{i}^{'} β)) = 0

Hessian

\nabla^{2} (β) = \frac{\partial s ( β )}{\partial β} = - exp (x_{i}^{'} β) x_{i} x_{i}^{'} ⟹ Avar (\hat{β}) = (i = 1 \sum n exp (x_{i}^{'} β) x_{i} x_{i}^{'})^{- 1}

Assumes $λ : = exp (x_{i}^{'} β) = E [Y ∣ X] = V [Y ∣ X]$ .

Marginal Effect: Since $E [y ∣ x] = exp x^{'} β$ for poisson, $\frac{\partial E [ y ∣ x ]}{\partial x _{j}} = E [y ∣ x] β_{j}$ . Parameters can be interpreted as semi elasticities, since

β = \frac{\partial E [ y ∣ x ]}{\partial x} \times \frac{1}{E [ y ∣ x ]} = \frac{\partial lo g E [ y ∣ x ]}{\partial x}

E (Y_{i} ∣ X_{i}) = λ_{i} = exp (X_{i}^{'} β); Var (Y_{i} ∣ X_{i}) = V_{i} = σ^{2} λ_{i}; σ^{2} > 1

define a bernoulli $π_{i} = 1 w . p . θ_{i}$ for $y = 0$ observation, and specify separate models for zero and nonzero data, with potentially different covariates on $θ$ and $λ$ . Yields the following (difficult to maximise) likelihood

L = i = 1 \prod n (θ_{i} + (1 - θ_{i}) exp (- λ_{i}) ((1 - θ_{i}) \frac{exp ( - λ _{i} ) λ _{i}^{y_{i}}}{y _{i} !})^{1 - π_{i}})

p (y_{i}) = \frac{Γ ( \frac{λ}{σ ^{2} - 1} + y _{i} )}{y _{i} ! Γ ( \frac{λ}{σ ^{2} - 1} )} (\frac{σ ^{2} - 1}{σ ^{2}})^{y_{i}} (σ^{2})^{\frac{- λ}{σ ^{2} - 1}}

$E [Y_{i}] = λ; V [Y] = λ σ^{2}$ . Let $μ_{i} = exp (x_{i}^{'} β)$ , $r_{i} = α / (α + μ_{i})$ , $q_{i} = α μ_{i}^{2 - p}$

ℓ = i = 1 \sum n lo g Γ (y_{i} + q_{i}) - lo g Γ (q_{i}) - lo g Γ (y_{i} + 1) + q_{i} + lo g r_{i} + y_{i} lo g (1 - r_{i})

Rates

Survival: $S (y) : = 1 - F (y)$

Hazard: $λ (y) = h (y) : = \frac{f ( y )}{1 - F ( y )} = \frac{f ( y )}{S ( y )}$ ;

Cumulative Hazard $Λ (y) : = \int_{0}^{y} λ (s) d s = - lo g S (y)$ .

S (t_{j}) = k = j \prod J (1 - \hat{λ} (t_{k})) = k = j \prod J \frac{r _{k} - d _{k}}{r _{k}}

Proportional Hazard Models

Conditional hazard rate $λ (t ∣ x)$ can be factored as

λ (t ∣ x, β) = baseline hazard λ_{0} (t) e x p (x^{'} β) ϕ (x, β)

baseline hazard ( $= 1$ for exponential and $α y^{α - 1}$ for Weibull).

Parametric model	Hazard	Survival
Exponential	$γ$	$exp (- γ t)$
Weibull	$γ α t^{α - 1}$	$exp (- γ t^{α})$
Generalised Weibull	$γ α t^{α - 1}$	$[1 - μ γ t^{α}]^{1/ μ}$
Gompertz	$γ exp (α t)$	$exp (- (γ / α) (e^{α t} - 1))$

For survival models with censoring, Likelihood is often written as

L (θ) = i \prod f (t_{i} ∣ θ)^{d_{i}} S (t_{i} ∣ θ)^{1 - d_{i}}

where $d_{i}$ is a right-censoring indicator and $t_{i}$ is the observed time.

Weibull Density: $f (y) = γ α y^{α - 1} exp (- γ y^{α}), y, α, γ > 0$ . $E [y] = γ^{- 1/ α} Γ (α^{- 1} + 1)$ . Specify $γ = exp (x^{'} β)$ , so $E [y ∣ x] = exp (- x^{'} β / α) Γ (α^{- 1} + 1)$ . Then, the log-likelihood is

ℓ (θ) = \frac{1}{N} i \sum {x_{i}^{'} β + lo g α + (α - 1) lo g y_{i} - exp (x_{i}^{'} β) y_{i}^{α}}

FOCs are

N^{- 1} i \sum {1 - exp (x_{i}^{'} β) y_{i}^{α}} x_{i} N^{- 1} {α^{- 1} + lo g y_{i} - exp (x_{i}^{'} β) y_{i}^{α} lo g y_{i}} = 0 = 0

Model needs to be correctly specified to be consistent. Unlike OLS or poisson.

Truncation and Censored Regressions

If a c.r.v $y \sim f (y)$ and is truncated at $c$ ,

f (y ∣ y > c) = \frac{f ( y )}{Pr ( y > c )} = \frac{f ( y )}{1 - F ( c )}

For the truncated normal distribution where $y \sim N (μ_{0}, σ_{0}^{2})$ is truncated at $c$ ,

$E [y ∣ y > c] = μ_{0} + σ_{0} λ (v)$ , $V [y ∣ y > c] = σ_{0}^{2} {1 - λ (v) [λ (v) - v]}$

where $v = (c - μ_{0}) / σ_{0}$ and $λ (v) = \frac{ϕ ( v )}{1 - Φ ( v )}$ is the inverse Mills ratio / Hazard function.

Tobit Regression

Censored $Y_{i}$ s.t. $y_{i}^{*} = β^{'} x_{i} + ϵ_{i} ϵ_{i} \sim N (0, σ^{2})$ and $y_{i} = {y_{i}^{*} c if y_{i}^{*} > c if y_{i}^{*} \leq c$ (i.e. $y$ is censored from below at c).

Truncated MLE maximises

lo g L_{n} (θ) = i = 1 \sum n (lo g f (y_{i} ∣ x_{i}, θ) - lo g [1 - F (c ∣ x_{i}, θ)])

with $f$ and $F$ denoting the density and distribution of $y^{*}$ respectively.

Type-I Tobit assumes $y^{*}$ is normally distributed, which gives us the following likelihood

L = 0 \prod [1 - Φ (x_{i}^{'} β / σ)] 1 \prod σ^{- 1} ϕ [(y_{i} - x_{i}^{'} β) / σ]

Censored Regression

Consider a model $y_{i} = x_{i}^{'} β + ϵ_{i}; ϵ_{i} ∣ x_{i} \sim N (0, σ^{2})$ and $y_{i}$ is not observed if $y_{i} > c$ .

Yields log-likelihood

ℓ (y_{i} ∣ x_{i}; β, σ^{2}) = (- \frac{1}{2} lo g (σ^{2}) - \frac{1}{2} (\frac{y _{i} - x _{i}^{'} β}{σ})^{2}) - lo g (1 - Φ (\frac{c - x _{i}^{'} β}{σ}))

Generalised Linear Models Theory

Semi-robust likelihoods belong to the Linear exponential Family of the following form:

Response observations $y_{i}$ are realisations of random variables $Y_{i}$ with densities of the form

f (y ∣ θ, ϕ) = exp (\frac{y θ - b ( θ )}{a ( ϕ )} + c (y, ϕ))

$θ \subset R$ is called the canonical / natural parameter, $ϕ \subset R^{+}$ is the dispersion parameter. $E [Y ∣ θ, ϕ] = b^{'} (θ)$ , $V [Y ∣ θ, ϕ] = a (ϕ) b^{''} (θ)$

f (y_{i}) = exp {\frac{y _{i} μ _{i} - \frac{1}{2} μ _{i}^{2}}{σ ^{2}} - \frac{y _{i}^{2}}{2 σ ^{2}} - \frac{1}{2} lo g (2 π σ^{2})}

Linear predictor $η_{i} : = X_{i}^{'} β$ specifying the variation in $Y$ accounted for by known covariates. $g$ is a transformation of the mean that addresses scaling. It is so called because it links the expected value of the response variable $E [Y ∣ θ, ϕ] = μ_{i} = b^{'} (θ_{i})$ to the explanatory covariates.

g (μ_{i}) = η_{i} = X_{i}^{'} β ⟹ μ_{i} = g^{- 1} (X_{i}^{'} β)

Since $μ_{i} = b^{'} (θ_{i})$ , under a canonical link ( $g (μ_{i}) = θ_{i} (μ_{i})$ ), $θ_{i} = X_{i}^{'} β$ .

ML estimation

log likelihood

L (θ, ϕ ∣ y) = i = 1 \sum N (\frac{y _{i} θ _{i} - b ( θ _{i} )}{a _{i} ( ϕ )} + c (y_{i}, ϕ))

Score function

S (β, y) = i = 1 \sum N \frac{\partial ℓ _{i}}{β _{j}} = i = 1 \sum N \frac{\partial ℓ _{i}}{\partial θ _{i}} \frac{\partial θ _{i}}{\partial μ _{i}} \frac{\partial μ _{i}}{\partial η _{i}} \frac{\partial η _{i}}{\partial β _{j}} = \frac{y _{i} - μ _{i}}{a _{i} ( ϕ )} \frac{1}{V [ μ _{i} ]} \frac{\partial g ^{- 1} ( η _{i} )}{\partial η _{i}} x_{ij}

The FoC can be written as

\frac{\partial lo g L ( θ , ϕ ∣ y )}{\partial β} = X^{'} (W)^{- 1} [y - y] = 0

where $W$ is a weight matrix (which depends on $β$ ). The fitted value

y = m (x) = E [y ∣ X = x] = g^{- 1} (x^{'} β)

By a first-order taylor expansion, define

z = g (y) + (y - y) \nabla g (y)

This gives us an update rule

β_{k + 1} = (X (W_{k})^{- 1} X)^{- 1} X (W_{k})^{- 1} z_{k}

repeat until convergence $β_{\infty}$ .

Model Density Link

OLS Gaussian Identity Logistic Binomial Logistic Logistic Binomial Normal Poisson Poisson Log

Machine Learning

Supervised Learning

Every Supervised ML algorithm essentially involves a function class $F$ and a regulariser $R (f)$ that expresses the complexity of the representation. Then, two steps

conditional on a level of complexity, choose best in-sample loss-minimising function

min in-sample loss i = 1 \sum n L (f (x_{i}), y_{i}) over function class f \in F s.t. complexity restriction R (f) \leq c

Estimate the ‘optimal’ level of complexity using empirical tuning

Aspect	Discriminative model	Generative model
Goal	Directly estimate $E [y ∣ x]$	Estimate $Pr (x ∣ y)$ to deduce $Pr (y ∣ x)$
What is learned	Decision boundary	Probability distribution of the data
Examples	Regressions, SVM	GDA, Naive Bayes

$L : (z, y) \in R \times Y \mapsto L (z, y) \in R$ that takes as inputs predicted value $z$ and real data value $y$ and outputs how different they are.

Least Squares: $\frac{1}{2} (y - z)^{2}$

Logistic: $lo g (1 + exp (- yz))$

Hinge: $max (0, 1 - yz)$

Cross Entropy: $- [y lo g z + (1 - y) lo g (1 - z)]$

Predictor class	Examples	Regulariser / tuning parameters
Global / parametric predictors	Linear $β^{'} x$ and generalisations	Subset selection $∥ β ∥_{0} = \sum_{j = 1}^{k} 1_{β_{j} \neq = 0}$ ; LASSO $∥ β ∥_{1} = \sum_{j = 1}^{k} ∣ β_{j} ∣$ ; Ridge $∥ β ∥_{2}^{2} = \sum_{j = 1}^{k} β_{j}^{2}$ ; Elastic Net $α ∥ β ∥_{1} + (1 - α) ∥ β ∥_{2}^{2}$
Local / nonparametric predictors	Decision / Regression trees; Random forest; Nearest neighbours; Kernel regression	Tree depth, number of nodes/leaves, minimal leaf size, information gain at splits; number of trees; number of variables used in each tree; bootstrap sample size; number of neighbours; kernel bandwidth
Mixed predictors	Neural networks (including deep and convolutional); splines	Number of layers, number of neurons per layer, connectivity between neurons; number of knots and order
Combining predictors	Bagging; Boosting; Ensemble methods	Number of draws, bootstrap size, individual tuning parameters; learning rate and number of iterations; ensemble weights

Reference: Mullainathan and Spiess (2017), Table 2.

take a unit hypercube in dimension $p$ and we put another hypercube within it that captures a fraction $r$ of observations within the cube. Each edge will be $e_{p} (r) = r^{1/ p}$ . For moderately high dimensions $p = 10$ , $e_{10} (0.01) = 0.63; e_{10} (0.1) = 0.8$ . Need 80% data to cover 10% of sample.

Define d(p,N) as distance from the origin to the closest point. $n = 500, p = 10 ⟹ d = 0.52$ (closest point closer to the boundary than to the origin).

d (p, N) = (1 - (\frac{1}{2})^{1/ N})^{1/ p}

Regularised Regression

In general, we want to impose a penalty for model complexity in order to minimise MSE (trade off some bias for lower variance).

Estimate the following regression

f (β, X, y) = i = 1 \sum N (y_{i} - X_{i}^{'} β) + λ j = 1 \sum J β^{2}

\hat{β}^{R i d g e} = (X^{'} X + λ I_{k})^{- 1} X^{'} y \equiv \hat{β}_{j}^{R i d g e} = \frac{β ^}{1 + λ}

where $X$ is a standardized design matrix (s.t. all Xs have unit variance). Let $X = UD V^{'}$ be the SVD of $X$ .

Then, ridge coefficients can also be written as

\hat{β}_{λ}^{R i d g e} = V (D^{2} λ I)^{- 1} D U^{'} y = j = 1 \sum p \frac{d _{j}}{d _{j}^{2} + λ} ⟨ U_{j}, Y ⟩ V_{j}

This can be used to compute the ridge coefficient efficiently for a fine grid of $λ$ s.

Compute SVD of $X$ and save $U, D, V$ Compute and store $w_{j} = \frac{1}{d _{j}} ⟨ U_{j}, Y ⟩ V_{j}$ for $j = 1, \dots, p$ For each $λ_{m}, m = [M]$ compute $γ_{j} = \frac{d _{j}^{2}}{d _{j}^{2} + λ _{m}}$ compute $β_{λ_{m}} = \sum_{j = 1}^{p} γ_{j} w_{j}$

The solution vector is ‘biased’ towards the leading right singular vectors of $X$ , which gives it the property of a ‘smoothed’ Principal Components regression. For Ridge regression,

dof (λ) = j \sum \frac{λ _{j}}{λ _{j} + λ}

Where $λ_{j}$ s are the eigenvalues of the Covariance Matrix.

More generally, for any smoother matrix $W$ , $df (\overset{μ}{^}) = t r (W)$ , which may not be an integer for semi/non-parametric smoothers. In the special parametric case of OLS, $W = X (X^{'} X)^{- 1} X^{'}$ , so the DoF is simply $k$ . Consider the objective function

J (β, X, y) = i = 1 \sum N (y_{i} - x_{i}^{'} β) + λ j = 1 \sum J ∥ β_{j} ∥_{1}

fit using sequential coordinate descent. Coefficient vector is soft-thresholded:

β_{j}^{lasso} = sgn (\hat{β}_{j}) max (\hat{β}_{j} - λ, 0)

where $X$ is a standardized design matrix (s.t. all Xs have unit variance), and $∣ ∣$ is the $l_{1}$ norm. both cases, pick tuning parameter $λ$ using cross-validation.

ML analogue to LASSO. Define

θ = θ \in Θ arg min (- ℓ (θ ∣ Y, X) + λ θ^{P}_{1})

where

θ^{P}_{1} = k \sum ∣ θ^{P} ∣ θ_{k}^{P}

Combines ridge regression and the lasso by adding a $ℓ_{2}$ penalty to the LASSO’s objective function

J (β) = ∥ y - X β ∥_{2}^{2} + λ_{1} ∥ β ∥_{1} + \frac{λ _{2}}{2} ∥ β ∥_{2}^{2}

Generalise the $ℓ_{2}$ penalty to a class of penalty functions of the form

β^{T} VZ V^{T} β

where $Z$ is a diagonal matrix whose diagonal elements are functions of the squared singular values.

Define the following objective function

J (β) = \frac{1}{2} ∥ y - X β ∥_{2}^{2} + λ ∥ β ∥_{1} + \frac{θ}{2} V D_{d_{1}^{2} - d_{j}^{2}} V^{T} β

Where $D_{d_{1}^{2} - d_{j}^{2}}$ is a $m \times m$ diagonal matrix with diagonal entries equalling $d_{1}^{2} - d_{1}^{2}, d_{1}^{2} - d_{2}^{2} \dots$ . This penalty term gives no weight to the component of $β$ that aligns with the first right singular vector of $X$ (i.e. the first principal component. This gives it better predictive accuracy in some settings.

Comparing principal-coordinate predictions of ridge and pcLASSO:

X β_{Ridge} X β_{pcL} = j = 1 \sum m \frac{d _{j}^{2}}{d _{j}^{2} + θ} u_{j} u_{j}^{T} y = j = 1 \sum m \frac{d _{j}^{2}}{d _{j}^{2} + θ ( d _{1}^{2} - d _{j}^{2} )} u_{j} u_{j}^{T} y

The latter corresponds to a more aggressive form of shrinkage towards the leading singular vectors.

Classification

Training sample $(x_{i}, y_{i})$ where $y \in Y : = {- 1, + 1}$ (can relabel to Bernoulli). A predictor $m : X \to Y$ , where the labels are produced by an (unknown) classifier $f$ . Let $P$ be an (unknown) distribution on $X$ . The error of $m$ w.r.t. $f$ is defined by

R_{P, f} (m) = Pr m (X) \neq = f (X) = Pr {x \in X : m (x \neq = f (x))} where X \sim P

The empirical risk is defined as

R (m) = \frac{1}{n} i = 1 \sum n 1_{m (x_{i}) \neq = y_{i}}

A perfect classifier (in the sense that $R_{P, f} (m) = 0)$ does not exist, so we aim for Probably Approximately Correct (PAC) learners that have $R_{P, f} (m) \leq ε$ w.p. $1 - δ$ . The space of models $m$ is restricted to be in finite set $M$ . It can be shown that $\forall ε, δ, P, f$ , if $n \geq ε^{- 1} lo g [(δ)^{- 1} ∣ M] ∣$ , then $R_{P, f} (m^{*}) \leq ε w . p . \geq 1 - δ$ where

m^{*} \in m \in M arg min [\frac{1}{n} i = 1 \sum n 1_{m (x_{i}) = y_{i}}]

loosely represents the expressive capacity of a set of functions.

Consider $k$ points ${x_{1}, \dots, x_{k}}$ and the set

E_{k} = {m (x_{1}), \dots, m (x_{k}) : for m \in M} \equiv {- 1, + 1}^{k}

we say that m shatters all the points if $∣ E_{k} ∣ = 2^{k}$ , i.e. all combinations are possible.

Linear functions can shatter 2 points.

The VC dimension of $M$ is

VC (M) : = sup {k s.t. M shatters {x_{1}, \dots, x_{k}}}

Let $y \in {- 1, 1}$ . A linear classifier can then be written as $h (x) = sgn (H (x))$ where

H (x) = a_{0} + i = 1 \sum d a_{i} x_{i}

Suppose $\exists$ a hyperplane $H (x)$ s.t. $Y_{i} H (x_{i}) \geq 1 \forall i$ .

The hyperplane $\hat{H} (x) = \overset{a}{^}_{0} + \sum_{i = 1}^{N} \overset{a}{^}_{i} x_{i}$ that separates the data and maximises the ‘margin’ is given by minimising $1/2 \sum_{j = 1}^{d} a_{j}^{2} subject to Y_{i} H (x_{i}) \geq 1$ . Typically, the function space $M$ is large and complex, so a natural idea is to learn iteratively. Loosely, estimate a model $m_{1}$ for $y$ from $X$ , which produces error $ε_{1}$ . Next, estimate $m_{2}$ for $ε_{1}$ from $X$ , which produces $ε_{2}$ , and so on. So, after $k$ steps,

m^{k} (\cdot) = \sim y m_{1} (\cdot) + \sim ε_{1} m_{2} (\cdot) + \dots \sim ε_{k - 1} m_{k} (\cdot)

where the first error is $y - m (x)$ and so on, and can also be seen as the gradient associated with the quadratic loss function, $ε = \nabla ℓ$ . So, an equivalent representation is

m^{(k)} = m^{(k - 1)} + h \in H arg min ⎩ ⎨ ⎧ i = 1 \sum n ℓ ε_{k, i} y_{i} - m^{(k - 1)} (x_{i}), h (x_{i}) ⎭ ⎬ ⎫

where $H$ is a space of ‘weak learners’ (typically step functions). To ensure ‘slow’ learning, one typically applies a shrinkage parameter $ε_{1} = y - α m_{1} (x_{1}) α \in (0, 1)$ . Arthur Charpentier’s series on probabilistic foundations of econometrics and machine learning covers this topic and includes a useful bibliography.

Goodness of Fit for Classification

calibration: Bin predicted probabilities $y$ into bins ${g_{k}}$ , and within each compute $\overline{Y}_{g_{k}}$ (average predicted probability) and $\overline{Y}_{g_{k}}$ . Plot the two average against each other. In a well calibrated model, the binned averages trace the identity line.

discrimination: Discrimination is a measure of whether $Y = 1$ observations have high $Y$ , and correspondingly $Y = 0$ values have low $Y$ . Many measures; listed below Observed $Y = 1$ $Y = 0$

Predicted positive ( $Y > c$ ) True Positive (TP) False Positive (FP) Predicted negative ( $Y < c$ ) False Negative (FN) True Negative (TN) Total Positive(P) Total Negative(N) Accuracy = $(TP + TN) / (P + N)$ - Overall performance

Precision = $TP / (TP + FP)$ - How accurate positive predictions are

Sensitivity = Recall = True positive Rate = $TP / P$ - Coverage of actual positive sample

Specificity = True Negative Rate = $TN / N$ - Coverage of actual negative sample

Brier Score =

\frac{1}{N} i \sum (\hat{Y}_{i} - Y_{i})^{2} = \frac{1}{N} k \sum K (\hat{Y}_{k} - \overset{ˉ}{Y}_{k})^{2} Calibration + \frac{1}{N} k \sum K n_{k} (\overset{ˉ}{Y}_{k} (1 - \overset{ˉ}{Y}_{k})) Refinement

F1 Score = $\frac{2 TP}{2 TP + FP + FN}$ : hybrid metric for unbalanced classes

Is the plot of TPR vs FPR by varying the threshold $c$ .

AUC Wikipedia table for confusion matrix

Suppose we have a training set ${(X_{i}, Y_{i}, D_{i})}_{i = 1}^{N}$ , a test point $x$ , and a tree predictor

μ (x) = T (x; {(X_{i}, Y_{i}, D_{i})}_{i = 1}^{N})

Equivalently,

μ (x) = i = 1 \sum n α_{i} (x) Y_{i} where α_{i} (x) = \frac{1 _{x_{i} \in L (X)}}{∣ i : x _{i} \in L ( x ) ∣}

where $X$ is partitioned into leaves $L (x)$ , where leaves are constructed to maximise heterogeneity between nodes . Do this until all leaves have $2 \times$ minimum leaf size observations. Regression trees overfit, so we need to use cross-validation + other tricks.

Random forests build and average many different trees $T^{*}$ by

Bagging / subsampling training set (Breiman)

Selecting the splitting variable at each step from $m$ out of $p$ randomly drawn features (Amit and Geman)

τ (x) \frac{1}{B} b = 1 \sum B T_{b}^{*} (x; {(X_{i}, Y_{i}, D_{i})}_{i = 1}^{N})

Generalised nonparametric regression with many ‘layers’, with components outline in the referenced figure. For the $i^{t h}$ layer of the network and $j^{t h}$ hidden layer of the unit, we have

z_{j}^{[i]} = w_{j}^{[i] T} x + b_{j}^{[i]}

where $w, b, z$ are the weight (coefficient), bias (intercept) and output respectively.

Neural Network Components Activation functions are used at the end of a hidden layer to introduce non-linearities into the model. Common ones are

Activation Functions

Networks frequently use the cross-entropy loss function.

Learning rate is denoted by $η$ , which is the pace at which the weights get updated. This can be fixed or adaptively changed using ADAM.

Back-propagation is a method to update the weights in the neural net by taking into account the actual output and desired output. The derivative with respect to weight $w$ is computed using the chain rule and is of the following form

\frac{\partial L ( z , y )}{\partial w} = \frac{\partial L ( z , y )}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial w}

So the weight is updated

w \leftarrow w - η \frac{\partial L ( z , y )}{\partial w}

Take a batch of training data.
Perform forward propagation to compute corresponding loss.
Perform back propagation to compute gradients.
Use the gradients to update the weights over the network.

Unsupervised Learning

There is no distinction between a label/outcome $y_{i}$ and predictor $X_{i}$ in a wide variety of problems. The goal of unsupervised methods is to characterise the joint distribution of the data $X$ using latent factors, clusters, etc.

Original data $x_{i}$ in $R^{k}$ . We approximate orthogonal unit vectors $w_{l} \in R^{k}$ and associated scores ( $L \leq k$ weights $z_{i l}$ ) to minimise reconstruction error

J (X, θ) = \frac{1}{n} i = 1 \sum n ∥ x_{i} - x_{i} ∥^{2} = \frac{1}{n} i = 1 \sum n x_{i} - l = 1 \sum L z_{i l} w_{l}^{2}

where $x_{i} = W z_{i}$ subject to the constraint that the smoother matrix $W$ is orthonormal. Equivalently, the objective function can be written as

$J (W, Z) = X - W Z^{T}_{F}$ where $Z$ is $N \times L$ with $z_{i}$ in its rows.

The optimal solution sets each $w_{l}$ to be the l-th eigenvector of the empirical covariance matrix. Equivalently, $W = V_{L}$ , which contains the $L$ eigenvectors with the largest eigenvalues of empirical covariance matrix $Σ = \frac{1}{n} \sum_{i = 1}^{n} x_{i} x_{i}^{'}$ . If we rank singular values of the data matrix $X$ , we can construct a rank $L$ approximation, the truncated SVD

X \approx U_{:, 1 : L} S_{1 : L, 1 : L} V_{:, 1 : L}^{'}

This is identical to the optimal reconstruction $X = Z W^{'}$ .

j = L + 1 \sum J λ_{l} = error (L)

The error is the sum of remaining eigenvalues of the covariance matrix. Total variance explained = (sum of included eigenvalues)/(sum of all eigenvalues)

Lalgorithms

Explorer

Chapter 04: Maximum Likelihood and Machine Learning

Concept map

Maximum Likelihood

Setup

Properties of Maximum Likelihood Estimators

QMLE / Misspecification / Information Theory

Robust Standard Errors

Testing

Binary Choice

Discrete Choice

Ordered

Unordered

Counts and Rates

Counts

Rates

Truncation and Censored Regressions

Tobit Regression

Censored Regression

Generalised Linear Models Theory

ML estimation

Machine Learning

Supervised Learning

Regularised Regression

Classification

Goodness of Fit for Classification

Unsupervised Learning

Graph View

Table of Contents

Backlinks

Lalgorithms

Explorer

Chapter 04: Maximum Likelihood and Machine Learning

Concept map

Maximum Likelihood

Setup

Properties of Maximum Likelihood Estimators

QMLE / Misspecification / Information Theory

Robust Standard Errors

Testing

Binary Choice

Discrete Choice

Ordered

Unordered

Counts and Rates

Counts

Rates

Truncation and Censored Regressions

Tobit Regression

Censored Regression

Generalised Linear Models Theory

ML estimation

Machine Learning

Supervised Learning

Regularised Regression

Classification

Goodness of Fit for Classification

Unsupervised Learning

Related notes

Graph View

Table of Contents

Backlinks