This note is a high-fidelity Markdown migration of the Bayesian Statistics chapter from the LaTeX source.

Parent map: index Prerequisites: probability-and-mathstats, linear-regression, maximum-likelihood-and-machine-learning

Concept map

flowchart TD
  A[Bayes Theorem] --> B[Priors]
  B --> C[Conjugate Updating]
  A --> D[Posterior Predictive]
  A --> E[Model Selection]
  E --> F[Marginal Likelihood]
  C --> G[Hierarchical Models]
  G --> H[Empirical Bayes]
  A --> I[Computation]
  I --> J[MCMC]
  J --> K[Gibbs]
  J --> L[Metropolis-Hastings]
  I --> M[EM Algorithm]
  A --> N[Graphical Models]

Bayesian Statistics

Setup

Notation: per the Murphy textbook, some statements use notation as shorthand for data.

  1. Use Bayes Rule to come up with a posterior probability of some hypothesis given event . Your prior is .

Call the posterior probability .

  1. Given second event , use posterior probability from step 1 as your prior in the second update step.

A sequence of random variables is finitely exchangeable if their joint density remains the same under any re-ordering or re-labeling of the indices of the data.

Exchangeability justifies use of the prior: If the data are exchangeable, then there is a parameter that drives the stochastic model generating the data and there exists a density over that does not depend on the data itself. The data are conditionally i.i.d., given the prior . Independence vs. Exchangeability: Independence is a stronger condition than exchangeability (it is a special case of exchangeability). Exchangeability only requires that the marginal distribution of each random variable is the same, i.e. . Independence requires that . As a result, you can have exchangeability in situations where you do not have independence, most notably sampling without replacement. If the marginal probabilities are unknown, then we only have exchangeability (not independence) even if the samples are drawn with replacement, due to the possibility that there is only one unit with a particular value of .

With the full posterior, one can compute Posterior Mean, median, and mode (the latter is sometimes called the Maximum A Posteriori estimate).

One can also compute a Highest Posterior Density Region , which is a region such that the parameter lies in the region with probability .

Consider out-of-sample prediction for a single observation . The posterior predictive density is

Because is independent of conditional on (exchangeability), we can simplify this as

This is just the data density for multiplied by the posterior density for .

Consider . The posterior predictive density is

So if we want to know the posterior predictive probability , we can compute it as

which is the posterior mean.

An uninformative prior on produces a posterior density that is proportional to the likelihood (differing only by the constant of proportionality). This implies that the mode of the posterior density is the that maximizes the likelihood function. An informative prior on yields a posterior mean that is a precision-weighted average of the prior mean and the MLE.

Stan dev team recommendations: https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations

As , the likelihood component of the posterior becomes dominant and as a result frequentist and bayesian inferences will be based on the same limiting multivariate normal distribution.

To choose between Bayesian models, we compute the posterior over models

which allows us to pick the MAP model . If we use a uniform prior over models , this amounts to picking the model wich maximises

which is called the marginal likelihood / integrated likelihood / evidence for model .

Conjugate Priors and Updating

Priors of the form are improper because . Improper priors generally not a problem as long as resulting posterior is well defined.

Suppose , and we choose prior = 1. Define transformation . Resulting distribution of is

Jeffreys’ Prior is a method of constructing invariant priors.

. For multiparameter model,

Given , and

Taking expectations wrt sample density sends second piece to zero (since ), so

Analytically tractable expressions for the posterior are derived when sample and prior densities form a natural conjugate pair, defined as having the property that sample, prior, and posterior densities all lie in the same class of densities.

Exponential family is essentially the only class of densities to have natural conjugate priors.

A one parameter member of the exponential family has density for N obs that can be expressed as

Let , and we take prior . By Bayes thm, the posterior is of the form

Instead we take . Uniform prior is a special case with . In general, the posterior is of the form

QuantityFormula
Posterior Mean
Posterior Mode
Posterior Variance
Posterior Predictive Distribution with updated parameters ; library(extraDistr), rbbinom(n, size, alpha, beta)

Suppose we have a proportion from a previous study , with variance . Then we can create a constant

And compute the hyper-parameters and for our prior distribution as

Surprisingly this works! See Jackman p.55 for a worked out example.

Let . This means that

We specify a Gamma prior on , which has density

So then the posterior for is

A flat prior is .

QuantityFormula
Posterior Mean
Posterior Mode
Posterior Variance
Posterior Predictive Distribution and ; rnbinom(n, size, prob)

where is the count of observations in category . For 3 categories, the posterior is:

, where is known but mean is not known. The joint density of is

Given a normal prior , we can write the posterior density of the form

where and . Posterior mean is a weighted sum of prior mean and sample mean with weights that reflect the precision of the likelihood via and prior precision . Three cases (ref. Jackman p.80-94):

  1. Variance known, mean unknown. Model:
QuantityFormula
Posterior Mean
Posterior Variance
Posterior Predictive Distribution, with and
  1. Variance and mean both unknown. Prior densities:
Conditional posterior densities: 

where

Marginal posterior density of $\mu$: 

where

Posterior predictive distribution for $\tilde{y}$: 

where

  1. Improper reference prior. Prior densities:

Posterior densities 
which implies

Posterior predictive distribution 

where

Posterior can often be approximated by simulation.

  • Draw .
  • Histogram of approximates posterior density .

Methods for this: Markov-Chain Monte-Carlo, Metropolis-Hastings, Hamiltonian Monte-Carlo.

Conjugacy for Discrete Distributions

LikelihoodConjugate priorPosterior hyperparameters
BernoulliBeta
BinomialBeta
Negative BinomialBeta
PoissonGamma
MultinomialDirichlet

Conjugacy for Continuous Distributions

LikelihoodConjugate priorPosterior hyperparameters
UniformPareto
ExponentialGamma
NormalNormal
NormalScaled Inverse Chi-square
NormalNormal-Scaled Inverse Gamma

Computation / Markov Chains

A stochastic process is a collection of random variables.

The process is a Markov Chain if

First rewrite the integral to be evaluated as follows

where and . Since is the probability density for a uniform r.v. over , we can write where . If we generate , by LLN,

The goal is the generate sequences from . An MCMC scheme will generate samples from the conditional if

where is the pdf of given . The LHS is the joint pdf of from the chain, if is from . Integrating RHS over yields , so the result states that given is from the correct posterior distribution, the chain generates also from the posterior .

Basic idea - turn high dimensional problem into several one-dimensional problems. Suppose has joint density . Suppose it is possible to simulate from conditional distributions and . Let be starting values. Assuming we have drawn , we generate as follows

for multiple parameters

. Let . Define precision

Likelihood:

(Noninformative) Prior:

Posterior Distribution

full conditionals:

However, it is typically impossible to write out or sample from full-conditionals.

Let be an arbitrary, friendly distribution we can sample from. The conditional density is called the proposal distribution. MH creates a sequence of observations as follows Choose arbitrarily. Suppose we have generated . Generate as follows

  • Generate proposal

  • Evaluate where

  • Set

Let be observed and missing. The goal is to maximise the log-likelihood of the observed data

Cannot push log inside the sum because of unobserved variables. EM tackles the problem as follows. Define complete data log likelihood as . This cannot be computed, since is unknown.

Instead, define where is the iteration number and is called the auxiliary function.

Expectation (E) Step: Compute , which is an expectation wrt old params .

Maximisation (M) Step: Optimise the function wrt .

Compute . For MAP estimation,

Probit has the form where is the latent variable. The complete data log likelihood, assuming a prior on .

The posterior in the E step is a truncated Gaussian

where . In the step, we estimate using ridge, where .

Hierarchical Models

Parameters in a prior are modeled as having a distribution that depends on hyperparameters. This results in joint posteriors of the form

Represented by the graphical model .

We are typically interested in the marginal posterior of , which is obtained by integrating the joint posterior w.r.t .

By treating as a latent variable, we allow data-poor observations to borrow strength from data rich ones.

Empirical Bayes

In hierarchical models, we need to compute the posterior on multiple layers of latent variables. For example, for a two-level model, we need

We can employ a computational shortcut by approximating the posterior on the hyper-parameters with a point-estimate , where . Since is usually much smaller than in dimensionality, we can safely use a uniform prior on . Then, the estimate becomes

This violates the principle that the prior should be chosen independently of the data, but is a cheap computational trick. This produces a hierarchy of Bayesian methods in increasing order of the number of integrals performed.

Suppose we measure the number of people in various cities and the number of people who died of cancer . We assume and want to estimate the cancer rates . The MLE solution would be to either estimate them all separately, or estimate a single for all cities.

The hierarchical approach is to model , and write a joint distribution

where . We can also put covariates on .

Hierarchy of Bayesianity

MethodDefinition
Maximum Likelihood
MAP Estimation
Empirical Bayes
MAP-II
Full Bayes

Graphical Models

Any joint distribution can be represented as follows

where is the number of variables [and we have dropped the parameter vector ]. It follows that

The joint distribution can be written as

where Pa denotes the parent nodes of , which are nodes that have arrows pointing to .

and are said to be conditionally independent iff the conditional joint can be written as the product of the conditional marginal.

If the condition , it must be the case that all paths are blocked. All paths are blocked iff

Arrows on the path meet either head to tail or tail to tail at the node, and the note is in the set

Arrows meet head to head at the node, and neither the node nor any of its descendants is in the set

Empirical Bayes

We suppose that each player’s MLE value (his batting average in the first 90 tries) is a binomial proportion,

Here is his true average, how he would perform over an infinite number of tries; TRUTH is itself a binomial proportion, taken over an average of 370 more tries per player.

At this point there are two ways to proceed. The simplest uses a normal approximation to (7.17)

where is the binomial variance

with the average of the ‘s. Letting , applying (7.13), and transforming back to gives James-Stein estimates

A second approach begins with the arcsin transformation