This note is a high-fidelity Markdown migration of the Bayesian Statistics chapter from the LaTeX source.
Parent map: index Prerequisites: probability-and-mathstats, linear-regression, maximum-likelihood-and-machine-learning
Concept map
flowchart TD A[Bayes Theorem] --> B[Priors] B --> C[Conjugate Updating] A --> D[Posterior Predictive] A --> E[Model Selection] E --> F[Marginal Likelihood] C --> G[Hierarchical Models] G --> H[Empirical Bayes] A --> I[Computation] I --> J[MCMC] J --> K[Gibbs] J --> L[Metropolis-Hastings] I --> M[EM Algorithm] A --> N[Graphical Models]
Bayesian Statistics
Setup
Notation: per the Murphy textbook, some statements use notation as shorthand for data.
- Use Bayes Rule to come up with a posterior probability of some hypothesis given event . Your prior is .
Call the posterior probability .
- Given second event , use posterior probability from step 1 as your prior in the second update step.
A sequence of random variables is finitely exchangeable if their joint density remains the same under any re-ordering or re-labeling of the indices of the data.
Exchangeability justifies use of the prior: If the data are exchangeable, then there is a parameter that drives the stochastic model generating the data and there exists a density over that does not depend on the data itself. The data are conditionally i.i.d., given the prior . Independence vs. Exchangeability: Independence is a stronger condition than exchangeability (it is a special case of exchangeability). Exchangeability only requires that the marginal distribution of each random variable is the same, i.e. . Independence requires that . As a result, you can have exchangeability in situations where you do not have independence, most notably sampling without replacement. If the marginal probabilities are unknown, then we only have exchangeability (not independence) even if the samples are drawn with replacement, due to the possibility that there is only one unit with a particular value of .
With the full posterior, one can compute Posterior Mean, median, and mode (the latter is sometimes called the Maximum A Posteriori estimate).
One can also compute a Highest Posterior Density Region , which is a region such that the parameter lies in the region with probability .
Consider out-of-sample prediction for a single observation . The posterior predictive density is
Because is independent of conditional on (exchangeability), we can simplify this as
This is just the data density for multiplied by the posterior density for .
Consider . The posterior predictive density is
So if we want to know the posterior predictive probability , we can compute it as
which is the posterior mean.
An uninformative prior on produces a posterior density that is proportional to the likelihood (differing only by the constant of proportionality). This implies that the mode of the posterior density is the that maximizes the likelihood function. An informative prior on yields a posterior mean that is a precision-weighted average of the prior mean and the MLE.
Stan dev team recommendations: https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations
As , the likelihood component of the posterior becomes dominant and as a result frequentist and bayesian inferences will be based on the same limiting multivariate normal distribution.
To choose between Bayesian models, we compute the posterior over models
which allows us to pick the MAP model . If we use a uniform prior over models , this amounts to picking the model wich maximises
which is called the marginal likelihood / integrated likelihood / evidence for model .
Conjugate Priors and Updating
Priors of the form are improper because . Improper priors generally not a problem as long as resulting posterior is well defined.
Suppose , and we choose prior = 1. Define transformation . Resulting distribution of is
Jeffreys’ Prior is a method of constructing invariant priors.
. For multiparameter model,
Given , and
Taking expectations wrt sample density sends second piece to zero (since ), so
Analytically tractable expressions for the posterior are derived when sample and prior densities form a natural conjugate pair, defined as having the property that sample, prior, and posterior densities all lie in the same class of densities.
Exponential family is essentially the only class of densities to have natural conjugate priors.
A one parameter member of the exponential family has density for N obs that can be expressed as
Let , and we take prior . By Bayes thm, the posterior is of the form
Instead we take . Uniform prior is a special case with . In general, the posterior is of the form
| Quantity | Formula |
|---|---|
| Posterior Mean | |
| Posterior Mode | |
| Posterior Variance | |
| Posterior Predictive Distribution | with updated parameters ; library(extraDistr), rbbinom(n, size, alpha, beta) |
Suppose we have a proportion from a previous study , with variance . Then we can create a constant
And compute the hyper-parameters and for our prior distribution as
Surprisingly this works! See Jackman p.55 for a worked out example.
Let . This means that
We specify a Gamma prior on , which has density
So then the posterior for is
A flat prior is .
| Quantity | Formula |
|---|---|
| Posterior Mean | |
| Posterior Mode | |
| Posterior Variance | |
| Posterior Predictive Distribution | and ; rnbinom(n, size, prob) |
where is the count of observations in category . For 3 categories, the posterior is:
, where is known but mean is not known. The joint density of is
Given a normal prior , we can write the posterior density of the form
where and . Posterior mean is a weighted sum of prior mean and sample mean with weights that reflect the precision of the likelihood via and prior precision . Three cases (ref. Jackman p.80-94):
- Variance known, mean unknown. Model:
| Quantity | Formula |
|---|---|
| Posterior Mean | |
| Posterior Variance | |
| Posterior Predictive Distribution | , with and |
- Variance and mean both unknown. Prior densities:
Conditional posterior densities:
where
Marginal posterior density of $\mu$:
where
Posterior predictive distribution for $\tilde{y}$:
where
-
Improper reference prior. Prior densities:
Posterior densities
which implies
Posterior predictive distribution
where
Posterior can often be approximated by simulation.
- Draw .
- Histogram of approximates posterior density .
Methods for this: Markov-Chain Monte-Carlo, Metropolis-Hastings, Hamiltonian Monte-Carlo.
Conjugacy for Discrete Distributions
| Likelihood | Conjugate prior | Posterior hyperparameters |
|---|---|---|
| Bernoulli | Beta | |
| Binomial | Beta | |
| Negative Binomial | Beta | |
| Poisson | Gamma | |
| Multinomial | Dirichlet |
Conjugacy for Continuous Distributions
| Likelihood | Conjugate prior | Posterior hyperparameters |
|---|---|---|
| Uniform | Pareto | |
| Exponential | Gamma | |
| Normal | Normal | |
| Normal | Scaled Inverse Chi-square | |
| Normal | Normal-Scaled Inverse Gamma |
Computation / Markov Chains
A stochastic process is a collection of random variables.
The process is a Markov Chain if
First rewrite the integral to be evaluated as follows
where and . Since is the probability density for a uniform r.v. over , we can write where . If we generate , by LLN,
The goal is the generate sequences from . An MCMC scheme will generate samples from the conditional if
where is the pdf of given . The LHS is the joint pdf of from the chain, if is from . Integrating RHS over yields , so the result states that given is from the correct posterior distribution, the chain generates also from the posterior .
Basic idea - turn high dimensional problem into several one-dimensional problems. Suppose has joint density . Suppose it is possible to simulate from conditional distributions and . Let be starting values. Assuming we have drawn , we generate as follows
for multiple parameters
. Let . Define precision
Likelihood:
(Noninformative) Prior:
Posterior Distribution
full conditionals:
However, it is typically impossible to write out or sample from full-conditionals.
Let be an arbitrary, friendly distribution we can sample from. The conditional density is called the proposal distribution. MH creates a sequence of observations as follows Choose arbitrarily. Suppose we have generated . Generate as follows
-
Generate proposal
-
Evaluate where
- Set
Let be observed and missing. The goal is to maximise the log-likelihood of the observed data
Cannot push log inside the sum because of unobserved variables. EM tackles the problem as follows. Define complete data log likelihood as . This cannot be computed, since is unknown.
Instead, define where is the iteration number and is called the auxiliary function.
Expectation (E) Step: Compute , which is an expectation wrt old params .
Maximisation (M) Step: Optimise the function wrt .
Compute . For MAP estimation,
Probit has the form where is the latent variable. The complete data log likelihood, assuming a prior on .
The posterior in the E step is a truncated Gaussian
where . In the step, we estimate using ridge, where .
Hierarchical Models
Parameters in a prior are modeled as having a distribution that depends on hyperparameters. This results in joint posteriors of the form
Represented by the graphical model .
We are typically interested in the marginal posterior of , which is obtained by integrating the joint posterior w.r.t .
By treating as a latent variable, we allow data-poor observations to borrow strength from data rich ones.
Empirical Bayes
In hierarchical models, we need to compute the posterior on multiple layers of latent variables. For example, for a two-level model, we need
We can employ a computational shortcut by approximating the posterior on the hyper-parameters with a point-estimate , where . Since is usually much smaller than in dimensionality, we can safely use a uniform prior on . Then, the estimate becomes
This violates the principle that the prior should be chosen independently of the data, but is a cheap computational trick. This produces a hierarchy of Bayesian methods in increasing order of the number of integrals performed.
Suppose we measure the number of people in various cities and the number of people who died of cancer . We assume and want to estimate the cancer rates . The MLE solution would be to either estimate them all separately, or estimate a single for all cities.
The hierarchical approach is to model , and write a joint distribution
where . We can also put covariates on .
Hierarchy of Bayesianity
| Method | Definition |
|---|---|
| Maximum Likelihood | |
| MAP Estimation | |
| Empirical Bayes | |
| MAP-II | |
| Full Bayes |
Graphical Models
Any joint distribution can be represented as follows
where is the number of variables [and we have dropped the parameter vector ]. It follows that
The joint distribution can be written as
where Pa denotes the parent nodes of , which are nodes that have arrows pointing to .
and are said to be conditionally independent iff the conditional joint can be written as the product of the conditional marginal.
If the condition , it must be the case that all paths are blocked. All paths are blocked iff
Arrows on the path meet either head to tail or tail to tail at the node, and the note is in the set
Arrows meet head to head at the node, and neither the node nor any of its descendants is in the set
Empirical Bayes
We suppose that each player’s MLE value (his batting average in the first 90 tries) is a binomial proportion,
Here is his true average, how he would perform over an infinite number of tries; TRUTH is itself a binomial proportion, taken over an average of 370 more tries per player.
At this point there are two ways to proceed. The simplest uses a normal approximation to (7.17)
where is the binomial variance
with the average of the ‘s. Letting , applying (7.13), and transforming back to gives James-Stein estimates
A second approach begins with the arcsin transformation