This note is a high-fidelity Markdown migration of the Maximum Likelihood and Machine Learning chapter from the LaTeX source.

Parent map: index Prerequisites: probability-and-mathstats, linear-regression, causal-inference

Concept map

flowchart TD
  A[Maximum Likelihood] --> B[Score / Fisher Information]
  A --> C[QMLE / Robust SEs]
  A --> D[Binary / Discrete Choice]
  D --> E[Ordered / Unordered Models]
  A --> F[Counts / Rates / Truncation]
  A --> G[GLM Theory]
  H[Machine Learning] --> I[Supervised Learning]
  I --> J[Regularization]
  I --> K[Classification Metrics]
  H --> L[Neural Networks]
  H --> M[Unsupervised Learning]

Maximum Likelihood

Setup

Let be a sequence of iid rv’s with common CDF . We want to estimate .

Since the rv’s are IID, we can write

L:

We usually work with the log of this , and drop the conditioning on though strictly speaking

is the estimator that maximises the conditional log-likelihood estimator.

solves the first-order conditions

conditional density:

Log likelihood:

Maximising this w.r.t. yields , .

gradient vector

Evaluated at , this is the efficient score.

local maximum that solves the FOCs , IOW

Variance estimate ;

Estimated as

For OLS, parameter vector .

Scores:

Information Matrix / Variance

Properties of Maximum Likelihood Estimators

, probability of missing the true parameter goes to zero.

Equivalently,

conditions: (1) , (2) is twice-differentiable Variance of MLE is the Cramer-Rao Bound; the asymptotic variance of the MLE is at least as small as that of any other consistent estimator. Let the pdf of the r.v. be for some . Let be an unbiased estimator for . Suppose the derivative can be passed under the integral and and suppose the fisher information

is finite. Then,

If are iid with common density , the implied bound on the variance is Let be iid random variables with common density . A sequence of estimates , a function that satisfies

whatever the true value of is, is said to be asymptotically efficient. Suppose are iid with density where is an unknown function. Next, we pretend to know the infinite dimensional parameter up to a finite dimensional parameter , in which we have a fully parametric finite-dimensional parameter , thus we can calculate the Cramer-Rao Bound.

Partitioning the information matrix for and its inverse in

The Cramer-Rao bound implies that

This is true for any parametrisation of the unknown function . The lowest possible variance for any estimator for that does not use knowledge of has to be at least as high as the lowest variance we can get if we know more, that is, the Cramer-Rao bound for any parametric submodel. So, the semiparametric efficiency bound is the largest lower-bound we can get for any parametric submodel.

Suppose we have a candidate Estimator and a given parametrisation . Then,

For any estimator we can calculate the left hand side, for any parametrization we can calculate the right hand side, so if we find an estimator and a parametrization that the two are equal we have found the efficiency bound. Let where is bijective, continuous, and differentiable. Let be the MLE of . Then, is the MLE of . For a regression of the form , one can estimate multiple ‘marginal effects’. For the special case where , , but this is not generically true.

Average Marginal Effect (AME):

Marginal Effect at Mean (MEM) Suppose is a sufficient statistic for . Then,

depends on the data only through , the sufficient statistic.

QMLE / Misspecification / Information Theory

If model is misspecified,

The MLE converges to the best fitting for the population (pseudo-true value)

For the linear exponential family, the quasi-MLE is consistent even when the density is partially misspecified.

Robust Standard Errors

Asymptotic distribution of QMLE

where

let be the assumed joint density, and let be the true density. Then,

Minimised when . QMLE minimises distance between and . Notation denotes ‘information lost when is used to approximate ‘.

Discrete version illustrates links to Entropy

and with equality IFF .

‘distance’, unlike Euclidian Distance, is not the same between as ; i.e. it is directional. Akaike showed that using K-L model selection entails finding a good estimator for

where are independent, random samples from the same distribution and expectations are taken w.r.t. the true distribution . Estimating this quantity for each model is biased upwards. An approximately unbiased estimator of the above target quantity is

For a general class of maximum-likelihood models,

For linear regression models, this simplifies to

Testing

To test the hypothesis against the alternative, there are three classical tests.

We partition the parameter -vector into two parts s.t. the dimensions of the two sub-vectors s.t. . is a nuisance parameter: its value is not restricted under the null.

Let be the unrestricted MLEs. If we are testing the restriction , then the restricted parameter vector is . IOW, test .

If null is true, at restricted model () should not be much smaller than at the unrestricted model (.

Under the null, (where is the number of restrictions being tested). If the limiting is maximised at , the derivative of the wrt at that point should be close to zero.

Under the null, (where is the number of restrictions being tested). Unrestricted estimates of should be close to zero.

Where is the top-left of the information matrix (corresponding with the restricted parameters). Under the null, .

alternatively,

where are restrictions,

evaluated at . McFadden’s Pseudo-

Binary Choice

Estimate the probability using OLS . , so heteroskedasticity is mechanically present unless all coefficients are zero.

Logistic regression fits , where is the link function that scales onto the probability scale. Alternatively, one can use . For , assume latent index model $Y_i^* = X_i’\beta

  • \epsilon_iY_i \defeq \Indic{Y_i^* > 0}Y_i\mathcal{L}=\prod_{i=1}^{N} \pi_{i}^{Y_{i}}\left(1-\pi_{i}\right)^{1-Y_{i}}$.

Symmetric CDFs. Let

Probit:

Logit:

=

ModelMarginal Effect
Logit
Probit
Clog-log
LPM

Let be the density and CDF evaluated at .

Sample Score solves

Variance:

Marginal effect:

Since for the logistic CDF, , the score and hessian can be written as

Discrete Choice

In many Additive Random Utility Models (ARUMs), is logistic for multivariate extensions to logit. This assumes that the errors themselves are distributed Gumbel/ type 1 extreme-value distribution

and .

Ordered

Random utility with multiple cutoffs , where .

Define latent variable, and

which means

specifications yield a likelihood that is simply the product of binary logit/probit models that switch between adjacent categories for each observation.

Marginal effects are of the form

Unordered

Multinomial distribution

where we adopt normalisation for identification.

Coefficient interpretation:

which implies that the log-odds ratio is linear in . Permits incorporation of choice-varying predictors , nests MNL.

Log likelihood of the form

Relative risk independent of other choices ; choices are series of pairwise comparisons. . IoW .

where

Counts and Rates

Counts

Poisson specification: . Yields log density

Score:

solves

Hessian

Assumes .

Marginal Effect: Since for poisson, . Parameters can be interpreted as semi elasticities, since

define a bernoulli for observation, and specify separate models for zero and nonzero data, with potentially different covariates on and . Yields the following (difficult to maximise) likelihood

. Let , ,

Rates

Survival:

Hazard: ;

Cumulative Hazard .

Proportional Hazard Models

Conditional hazard rate can be factored as

baseline hazard ( for exponential and for Weibull).

Parametric modelHazardSurvival
Exponential
Weibull
Generalised Weibull
Gompertz

For survival models with censoring, Likelihood is often written as

where is a right-censoring indicator and is the observed time.

Weibull Density: . . Specify , so . Then, the log-likelihood is

FOCs are

Model needs to be correctly specified to be consistent. Unlike OLS or poisson.

Truncation and Censored Regressions

If a c.r.v and is truncated at ,

For the truncated normal distribution where is truncated at ,

,

where and is the inverse Mills ratio / Hazard function.

Tobit Regression

Censored s.t. and (i.e. is censored from below at c).

Truncated MLE maximises

with and denoting the density and distribution of respectively.

Type-I Tobit assumes is normally distributed, which gives us the following likelihood

Censored Regression

Consider a model and is not observed if .

Yields log-likelihood

Generalised Linear Models Theory

Semi-robust likelihoods belong to the Linear exponential Family of the following form:

Response observations are realisations of random variables with densities of the form

is called the canonical / natural parameter, is the dispersion parameter. ,

Linear predictor specifying the variation in accounted for by known covariates. is a transformation of the mean that addresses scaling. It is so called because it links the expected value of the response variable to the explanatory covariates.

Since , under a canonical link ( ), .

ML estimation

log likelihood

Score function

The FoC can be written as

where is a weight matrix (which depends on ). The fitted value

By a first-order taylor expansion, define

This gives us an update rule

repeat until convergence .

Model Density Link


OLS Gaussian Identity Logistic Binomial Logistic Logistic Binomial Normal Poisson Poisson Log

Machine Learning

Supervised Learning

Every Supervised ML algorithm essentially involves a function class and a regulariser that expresses the complexity of the representation. Then, two steps

conditional on a level of complexity, choose best in-sample loss-minimising function

Estimate the ‘optimal’ level of complexity using empirical tuning

AspectDiscriminative modelGenerative model
GoalDirectly estimate Estimate to deduce
What is learnedDecision boundaryProbability distribution of the data
ExamplesRegressions, SVMGDA, Naive Bayes

that takes as inputs predicted value and real data value and outputs how different they are.

Least Squares:

Logistic:

Hinge:

Cross Entropy:

Predictor classExamplesRegulariser / tuning parameters
Global / parametric predictorsLinear and generalisationsSubset selection ; LASSO ; Ridge ; Elastic Net
Local / nonparametric predictorsDecision / Regression trees; Random forest; Nearest neighbours; Kernel regressionTree depth, number of nodes/leaves, minimal leaf size, information gain at splits; number of trees; number of variables used in each tree; bootstrap sample size; number of neighbours; kernel bandwidth
Mixed predictorsNeural networks (including deep and convolutional); splinesNumber of layers, number of neurons per layer, connectivity between neurons; number of knots and order
Combining predictorsBagging; Boosting; Ensemble methodsNumber of draws, bootstrap size, individual tuning parameters; learning rate and number of iterations; ensemble weights

Reference: Mullainathan and Spiess (2017), Table 2.

take a unit hypercube in dimension and we put another hypercube within it that captures a fraction of observations within the cube. Each edge will be . For moderately high dimensions , . Need 80% data to cover 10% of sample.

Define d(p,N) as distance from the origin to the closest point. (closest point closer to the boundary than to the origin).

Regularised Regression

In general, we want to impose a penalty for model complexity in order to minimise MSE (trade off some bias for lower variance).

Estimate the following regression

where is a standardized design matrix (s.t. all Xs have unit variance). Let be the SVD of .

Then, ridge coefficients can also be written as

This can be used to compute the ridge coefficient efficiently for a fine grid of s.

Compute SVD of and save Compute and store for For each compute compute

The solution vector is ‘biased’ towards the leading right singular vectors of , which gives it the property of a ‘smoothed’ Principal Components regression. For Ridge regression,

Where s are the eigenvalues of the Covariance Matrix.

More generally, for any smoother matrix , , which may not be an integer for semi/non-parametric smoothers. In the special parametric case of OLS, , so the DoF is simply . Consider the objective function

fit using sequential coordinate descent. Coefficient vector is soft-thresholded:

where is a standardized design matrix (s.t. all Xs have unit variance), and is the norm. both cases, pick tuning parameter using cross-validation.

ML analogue to LASSO. Define

where

Combines ridge regression and the lasso by adding a penalty to the LASSO’s objective function

Generalise the penalty to a class of penalty functions of the form

where is a diagonal matrix whose diagonal elements are functions of the squared singular values.

Define the following objective function

Where is a diagonal matrix with diagonal entries equalling . This penalty term gives no weight to the component of that aligns with the first right singular vector of (i.e. the first principal component. This gives it better predictive accuracy in some settings.

Comparing principal-coordinate predictions of ridge and pcLASSO:

The latter corresponds to a more aggressive form of shrinkage towards the leading singular vectors.

Classification

Training sample where (can relabel to Bernoulli). A predictor , where the labels are produced by an (unknown) classifier . Let be an (unknown) distribution on . The error of w.r.t. is defined by

The empirical risk is defined as

A perfect classifier (in the sense that does not exist, so we aim for Probably Approximately Correct (PAC) learners that have w.p. . The space of models is restricted to be in finite set . It can be shown that , if , then where

loosely represents the expressive capacity of a set of functions.

Consider points and the set

we say that m shatters all the points if , i.e. all combinations are possible.

Linear functions can shatter 2 points.

The VC dimension of is

Let . A linear classifier can then be written as where

Suppose a hyperplane s.t. .

The hyperplane that separates the data and maximises the ‘margin’ is given by minimising . Typically, the function space is large and complex, so a natural idea is to learn iteratively. Loosely, estimate a model for from , which produces error . Next, estimate for from , which produces , and so on. So, after steps,

where the first error is and so on, and can also be seen as the gradient associated with the quadratic loss function, . So, an equivalent representation is

where is a space of ‘weak learners’ (typically step functions). To ensure ‘slow’ learning, one typically applies a shrinkage parameter . Arthur Charpentier’s series on probabilistic foundations of econometrics and machine learning covers this topic and includes a useful bibliography.

Goodness of Fit for Classification

calibration: Bin predicted probabilities into bins , and within each compute (average predicted probability) and . Plot the two average against each other. In a well calibrated model, the binned averages trace the identity line.

discrimination: Discrimination is a measure of whether observations have high , and correspondingly values have low . Many measures; listed below Observed


Predicted positive () True Positive (TP) False Positive (FP) Predicted negative () False Negative (FN) True Negative (TN) Total Positive(P) Total Negative(N) Accuracy = - Overall performance

Precision = - How accurate positive predictions are

Sensitivity = Recall = True positive Rate = - Coverage of actual positive sample

Specificity = True Negative Rate = - Coverage of actual negative sample

Brier Score =

F1 Score = : hybrid metric for unbalanced classes

Is the plot of TPR vs FPR by varying the threshold .

AUC Wikipedia table for confusion matrix

Suppose we have a training set , a test point , and a tree predictor

Equivalently,

where is partitioned into leaves , where leaves are constructed to maximise heterogeneity between nodes . Do this until all leaves have minimum leaf size observations. Regression trees overfit, so we need to use cross-validation + other tricks.

Random forests build and average many different trees by

Bagging / subsampling training set (Breiman)

Selecting the splitting variable at each step from out of randomly drawn features (Amit and Geman)

Generalised nonparametric regression with many ‘layers’, with components outline in the referenced figure. For the layer of the network and hidden layer of the unit, we have

where are the weight (coefficient), bias (intercept) and output respectively.

Neural Network Components Activation functions are used at the end of a hidden layer to introduce non-linearities into the model. Common ones are

Activation Functions

Networks frequently use the cross-entropy loss function.

Learning rate is denoted by , which is the pace at which the weights get updated. This can be fixed or adaptively changed using ADAM.

Back-propagation is a method to update the weights in the neural net by taking into account the actual output and desired output. The derivative with respect to weight is computed using the chain rule and is of the following form

So the weight is updated

  1. Take a batch of training data.
  2. Perform forward propagation to compute corresponding loss.
  3. Perform back propagation to compute gradients.
  4. Use the gradients to update the weights over the network.

Unsupervised Learning

There is no distinction between a label/outcome and predictor in a wide variety of problems. The goal of unsupervised methods is to characterise the joint distribution of the data using latent factors, clusters, etc.

Original data in . We approximate orthogonal unit vectors and associated scores ( weights ) to minimise reconstruction error

where subject to the constraint that the smoother matrix is orthonormal. Equivalently, the objective function can be written as

where is with in its rows.

The optimal solution sets each to be the l-th eigenvector of the empirical covariance matrix. Equivalently, , which contains the eigenvectors with the largest eigenvalues of empirical covariance matrix . If we rank singular values of the data matrix , we can construct a rank approximation, the truncated SVD

This is identical to the optimal reconstruction .

The error is the sum of remaining eigenvalues of the covariance matrix. Total variance explained = (sum of included eigenvalues)/(sum of all eigenvalues)