This note is a high-fidelity Markdown migration of the Bayesian Statistics chapter from the LaTeX source.

Parent map: index Prerequisites: probability-and-mathstats, linear-regression, maximum-likelihood-and-machine-learning

Concept map

flowchart TD
  A[Bayes Theorem] --> B[Priors]
  B --> C[Conjugate Updating]
  A --> D[Posterior Predictive]
  A --> E[Model Selection]
  E --> F[Marginal Likelihood]
  C --> G[Hierarchical Models]
  G --> H[Empirical Bayes]
  A --> I[Computation]
  I --> J[MCMC]
  J --> K[Gibbs]
  J --> L[Metropolis-Hastings]
  I --> M[EM Algorithm]
  A --> N[Graphical Models]

Bayesian Statistics

Setup

Notation: per the Murphy textbook, some statements use notation $D : = {(x_{i}, y_{i})}_{i = 1}^{N}$ as shorthand for data.

f (θ ∣ X) = \frac{f ( X ∣ θ ) f ( θ )}{\int f ( X ∣ θ ) f ( θ ) d θ} \propto L (θ) f (θ)

Use Bayes Rule to come up with a posterior probability of some hypothesis $H_{u}$ given event $E$ . Your prior is $P (H_{u})$ .

P (H_{u} ∣ E) = \frac{P ( E ∣ H _{u} ) P ( H _{u} )}{P ( E ∣ H _{u} ) P ( H _{u} ) + P ( E ∣ H _{u}^{c} ) P ( H _{u}^{c} )}

Call the posterior probability $P (H_{u}^{'})$ .

Given second event $E^{'}$ , use posterior probability from step 1 as your prior in the second update step.

P (H_{u} ∣ E^{'}) = \frac{P ( E ^{'} ∣ H _{u} ) P ( H _{u}^{'} )}{P ( E ^{'} ∣ H _{u} ) P ( H _{u}^{'} ) + P ( E ^{'} ∣ H _{u}^{c} ) P (( H _{u}^{'} ) ^{c} )}

A sequence of random variables $y_{1}, \dots, y_{n}$ is finitely exchangeable if their joint density remains the same under any re-ordering or re-labeling of the indices of the data.

p (y_{1}, \dots, y_{n}) = p (y_{z (1)}, \dots, y_{z (n)}) .

Exchangeability justifies use of the prior: If the data are exchangeable, then there is a parameter $θ$ that drives the stochastic model generating the data and there exists a density over $θ$ that does not depend on the data itself. The data are conditionally i.i.d., given the prior $θ$ . Independence vs. Exchangeability: Independence is a stronger condition than exchangeability (it is a special case of exchangeability). Exchangeability only requires that the marginal distribution of each random variable is the same, i.e. $p (y_{1}) = p (y_{2})$ . Independence requires that $p (y_{1} ∣ y_{2}) = p (y_{1})$ . As a result, you can have exchangeability in situations where you do not have independence, most notably sampling without replacement. If the marginal probabilities are unknown, then we only have exchangeability (not independence) even if the samples are drawn with replacement, due to the possibility that there is only one unit with a particular value of $y$ .

With the full posterior, one can compute Posterior Mean, median, and mode (the latter is sometimes called the Maximum A Posteriori estimate).

One can also compute a Highest Posterior Density Region $R (θ)$ , which is a region such that the parameter lies in the region with probability $1 - α$ .

1 - α = Pr (θ \in R (θ) ∣ y) = \int_{R (θ)} p (θ ∣ y) d θ

Consider out-of-sample prediction for a single observation $\tilde{y}$ . The posterior predictive density is

p (\tilde{y} ∣ y_{1}, \dots, y_{n}) = \int_{- \infty}^{\infty} p (\tilde{y} ∣ θ, y_{1}, \dots, y_{n}) p (θ ∣ y_{1}, \dots, y_{n}) d θ

Because $\tilde{y}$ is independent of $y$ conditional on $θ$ (exchangeability), we can simplify this as

p (\tilde{y} ∣ y_{1}, \dots, y_{n}) = \int_{- \infty}^{\infty} p (\tilde{y} ∣ θ, y_{1}, \dots, y_{n}) p (θ ∣ y_{1}, \dots, y_{n}) d θ = \int_{- \infty}^{\infty} p (\tilde{y} ∣ θ) p (θ ∣ y_{1}, \dots, y_{n}) d θ

This is just the data density for $y$ multiplied by the posterior density for $θ$ .

Consider $\tilde{y} \sim Bernoulli (θ)$ . The posterior predictive density is

p (\tilde{y} ∣ y) = \int_{0}^{1} p (\tilde{y} ∣ θ) p (θ ∣ y) d θ = \int_{0}^{1} θ^{\tilde{y}} (1 - θ)^{1 - \tilde{y}} p (θ ∣ y) d θ

So if we want to know the posterior predictive probability $p (\tilde{y} = 1∣ θ)$ , we can compute it as

p (\tilde{y} = 1∣ θ) = \int_{0}^{1} θp (θ ∣ y) d θ = E [θ ∣ y]

which is the posterior mean.

An uninformative prior on $θ$ produces a posterior density that is proportional to the likelihood (differing only by the constant of proportionality). This implies that the mode of the posterior density is the $θ$ that maximizes the likelihood function. An informative prior on $θ$ yields a posterior mean that is a precision-weighted average of the prior mean and the MLE.

Stan dev team recommendations: https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations

θ ∣ y \sim^{a} N (\hat{θ}, I (\hat{θ})^{- 1})

As $N \to \infty$ , the likelihood component of the posterior becomes dominant and as a result frequentist and bayesian inferences will be based on the same limiting multivariate normal distribution.

To choose between Bayesian models, we compute the posterior over models

p (m ∣ D) = \frac{p ( D ∣ m ) p ( m )}{\sum _{m \in M} p ( m , D )}

which allows us to pick the MAP model $m = arg max p (m ∣ D)$ . If we use a uniform prior over models $p (m) \propto 1$ , this amounts to picking the model wich maximises

p (D ∣ m) = \int p (D ∣ θ) p (θ ∣ m) d θ

which is called the marginal likelihood / integrated likelihood / evidence for model $m$ .

Conjugate Priors and Updating

Priors of the form $f (θ) \propto c c > 0$ are improper because $\int f (θ) d θ = \infty$ . Improper priors generally not a problem as long as resulting posterior is well defined.

Suppose $X \sim Bernoulli (p)$ , and we choose prior $f (p)$ = 1. Define transformation $ψ = lo g (p / (1 - p))$ . Resulting distribution of $ψ$ is $f_{ψ} (ψ) = \frac{e ^{ψ}}{( 1 + e ^{ψ} ) ^{2}}$

Jeffreys’ Prior is a method of constructing invariant priors.

$f (θ) \propto I (θ)^{1/2}$ . For multiparameter model, $f (θ) \propto ∣ I (θ) ∣^{1/2}$

Given $γ = h (θ)$ , $\frac{\partial L}{\partial γ} = \frac{\partial L}{\partial θ} = \frac{\partial θ}{\partial γ}$ and

\frac{\partial ^{2} L}{\partial γ ^{2}} = \frac{\partial ^{2} L}{\partial θ ^{2}} (\frac{\partial θ}{\partial γ})^{2} + \frac{\partial L}{\partial θ} \frac{\partial ^{2} θ}{\partial γ ^{2}}

Taking expectations wrt sample density sends second piece to zero (since $E [\frac{\partial L}{\partial θ}] = 0$ ), so

I (γ) = I (θ) (\frac{\partial θ}{\partial γ})^{2} ⟹ ∣ I (γ) ∣^{1/2} = ∣ I (θ) ∣^{1/2} \frac{\partial θ}{\partial γ}

Analytically tractable expressions for the posterior are derived when sample and prior densities form a natural conjugate pair, defined as having the property that sample, prior, and posterior densities all lie in the same class of densities.

Exponential family is essentially the only class of densities to have natural conjugate priors.

A one parameter member of the exponential family has density for N obs that can be expressed as

L (y ∣ θ) = \prod exp ((a (θ)) + b (y) + c (θ) u (y)) \propto exp (N a (θ) + c (θ) i \sum u (y))

Let $X_{1}, \dots, X_{n} \sim Bernoulli (p)$ , and we take prior $f (p) = 1$ . By Bayes thm, the posterior is of the form

f (p ∣ x^{n}) \propto f (p) L_{n} (p) = p^{s} (1 - p)^{n - s} = p^{\sum x_{i}} (1 - p)^{n - \sum x_{i}}

Instead we take $f (p) = Beta (α, β)$ . Uniform prior is a special case with $α = β = 1$ . In general, the posterior is of the form

f (p ∣ x^{n}) p ∣ x^{n} = \frac{Γ ( n + 2 )}{Γ ( s + 1 ) Γ ( n - s + 1 )} p^{\sum x_{i}} (1 - p)^{n - \sum x_{i}} \sim Beta (\sum x_{i} + 1, n - \sum x_{i} + 1) \sim Beta (α^{'}, β^{'}) rbeta(n, shape1, shape2)

Quantity	Formula
Posterior Mean	$\frac{α ^{'}}{α ^{'} + β ^{'}} = \frac{\sum x _{i} + 1}{n + 2}$
Posterior Mode	$\frac{α ^{'} - 1}{α ^{'} + β ^{'} - 2} = \frac{\sum x _{i}}{n}$
Posterior Variance	$\frac{α ^{'} β ^{'}}{( α ^{'} + β ^{'} ) ^{2} ( α ^{'} + β ^{'} + 1 )}$
Posterior Predictive Distribution	$\sim Beta - Binomial (n, a, b)$ with updated parameters $Beta - Binomial (n, α + \sum x_{i}, β + n - \sum x_{i})$ ; `library(extraDistr)`, `rbbinom(n, size, alpha, beta)`

Suppose we have a proportion from a previous study $θ_{0}$ , with variance $V (θ_{0}; α, β)$ . Then we can create a constant

γ = \frac{θ _{0} ( 1 - θ _{0} )}{V ( θ _{0} ; α , β )} - 1

And compute the hyper-parameters $α$ and $β$ for our prior distribution as

α β = γ θ_{0} = γ (1 - θ_{0})

Surprisingly this works! See Jackman p.55 for a worked out example.

Let $Y_{1}, \dots, Y_{n} \sim Poisson (λ)$ . This means that

p (y ∣ λ) = i = 1 \prod N \frac{exp ( - λ ) λ ^{y_{i}}}{y _{i} !} \propto λ^{\sum y_{i}} exp (- nλ)

We specify a Gamma prior on $λ$ , which has density

p (λ; a, b) = \frac{b ^{a}}{Γ ( a )} λ^{a - 1} exp (- bλ)

So then the posterior for $λ$ is

p (λ ∣ y) \propto p (λ) p (y ∣ λ) \propto λ^{a - 1} exp (- bλ) λ^{\sum y_{i}} exp (- nλ) = λ^{\sum y_{i} + a - 1} exp (- λ (b + n)) \sim Gamma (\sum y_{i} + a, b + n) \sim Gamma (a^{'}, b^{'}) rgamma(n, shape, rate)

A flat prior is $a = b = 0$ .

Quantity	Formula
Posterior Mean	$\frac{a ^{'}}{b ^{'}} = \frac{\sum y _{i} + a}{b + n}$
Posterior Mode	$\frac{a ^{'} - 1}{b ^{'}} = \frac{\sum y _{i} + a - 1}{b + n}$
Posterior Variance	$\frac{a ^{'}}{( b ^{'} ) ^{2}} = \frac{\sum y _{i} + a}{( b + n ) ^{2}}$
Posterior Predictive Distribution	$\sim Negative Binomial (y, θ)$ and $\sim Negative Binomial (a, 1 - \frac{1}{b + 1})$ ; `rnbinom(n, size, prob)`

p (θ ∣ α_{1}, \dots α_{k}) p (y ∣ θ) \propto j = 1 \prod K θ_{j}^{α_{j} - 1} \propto j = 1 \prod K θ_{j}^{y_{j}}

where $y_{j}$ is the count of observations in category $j$ . For 3 categories, the posterior is:

p (θ_{1}, θ_{2}, 1 - θ_{1} - θ_{2} ∣ y) \propto θ_{1}^{α_{1} + y_{1} - 1} θ_{2}^{α_{2} + y_{2} - 1} (1 - θ_{1} - θ_{2})^{α_{3} + y_{3} - 1} \sim Dirichlet (α_{1} + y_{1}, α_{2} + y_{2}, α_{3} + y_{3})

$y \sim N (μ, σ^{2})$ , where $σ^{2}$ is known but mean $μ$ is not known. The joint density of $y$ is

L (y ∣ θ) = i = 1 \prod N (2 π σ^{2})^{- 1/2} exp (- \frac{( y _{i} - θ ) ^{2}}{2 σ ^{2}}) \propto exp (- \frac{N}{2 σ ^{2}} (\overset{y}{ˉ} - θ)^{2})

Given a normal prior $θ \sim N (μ, τ^{2}) ⟹ f (θ) \propto exp (- \frac{( θ - μ ) ^{2}}{2 τ ^{2}})$ , we can write the posterior density of the form

f (θ ∣ y) \propto exp (- \frac{N}{2 σ ^{2}} (θ - \overset{y}{ˉ})^{2}) exp (- \frac{( θ - μ ) ^{2}}{2 τ ^{2}}) \propto exp (- \frac{1}{2} (\frac{( θ - μ _{1} ) ^{2}}{τ _{1}^{2}}))

where $μ_{1} = τ_{1}^{2} (N \overset{y}{ˉ} / σ^{2} + μ / τ^{2})$ and $τ_{1}^{2} = (N / σ^{2} + 1/ τ^{2})^{- 1}$ . Posterior mean is a weighted sum of prior mean $μ$ and sample mean $\overset{y}{ˉ}$ with weights that reflect the precision of the likelihood via $N / σ^{2}$ and prior precision $1/ τ^{2}$ . Three cases (ref. Jackman p.80-94):

Variance known, mean unknown. Model:

μ \sim N (μ_{0}, σ_{0}^{2}) y \sim N (μ, σ^{2})

Quantity	Formula
Posterior Mean	$\frac{μ _{0} ( 1/ σ _{0}^{2} ) + y ˉ ( n / σ ^{2} )}{( 1/ σ _{0}^{2} ) + ( n / σ ^{2} )} = \overset{y}{ˉ} \frac{n σ _{0}^{2}}{n σ _{0}^{2} + σ ^{2}} + μ_{0} \frac{σ ^{2}}{n σ _{0}^{2} + σ ^{2}}$
Posterior Variance	$(\frac{1}{σ _{0}^{2}} + \frac{n}{σ ^{2}})^{- 1} = \frac{σ _{0}^{2} σ ^{2}}{σ ^{2} + n σ _{0}^{2}}$
Posterior Predictive Distribution	$\sim N (\tilde{μ}, \tilde{σ}^{2})$ , with $\tilde{μ} = \frac{n _{0} μ _{0} + n y ˉ}{n _{0} + n}$ and $\tilde{σ}^{2} = σ^{2} + (\frac{1}{σ _{0}^{2}} + \frac{n}{σ ^{2}})^{- 1}$

Variance and mean both unknown. Prior densities:

p (μ, σ^{2}) p (μ ∣ σ^{2}) p (σ^{2}) = p (μ ∣ σ^{2}) p (σ^{2}) \sim Normal (μ_{0}, σ^{2} / n_{0}) \sim Scaled-Invese- χ^{2} (ν_{0} /2, ν_{0} σ_{0}^{2} /2)

Conditional posterior densities:

μ ∣ σ^{2}, y σ^{2} ∣ y \sim N (μ_{1}, σ^{2} / n_{1}) \sim Scaled-Invese- χ^{2} (ν_{1} /2, ν_{1} σ_{1}^{2} /2)

where

n_{1} μ_{1} ν_{1} ν_{1} σ_{1}^{2} = n_{0} + n = \frac{n _{0} μ _{0} + n y ˉ}{n _{1}} = ν_{0} + n = v_{0} σ_{0}^{2} + i = 1 \sum N (y_{i} - \overset{y}{ˉ})^{2} + \frac{n _{0} n}{n _{0} + n} (μ_{0} - \overset{y}{ˉ})^{2}

Marginal posterior density of $\mu$:

p (μ) \sim Student-T (μ_{1}, σ_{1}^{2} / n_{1}, v_{1}) \sim brms::rstudent_t(n, df, mu = 0, sigma = 1)

where

n_{1} μ_{1} ν_{1} σ_{1}^{2} S_{1} = n_{0} + n = \frac{n _{0} μ _{0} + n y ˉ}{n _{1}} = ν_{0} + n = S_{1} / ν_{1} = ν_{0} σ_{0}^{2} + (n - 1) i = 1 \sum N (y_{i} - \overset{y}{ˉ})^{2} + \frac{n _{0} n}{n _{1}} (\overset{y}{ˉ} - μ_{0})^{2}

Posterior predictive distribution for $\tilde{y}$:

p (\tilde{y} ∣ y) \sim Student-T (μ_{1}, σ_{1} (n_{1} + 1) / n_{1}, v_{1})

where

n_{1} μ_{1} ν_{1} σ_{1}^{2} S_{1} = n_{0} + n = \frac{n _{0} μ _{0} + n y ˉ}{n _{1}} = ν_{0} + n = S_{1} / ν_{1} = ν_{0} σ_{0}^{2} + (n - 1) i = 1 \sum N (y_{i} - \overset{y}{ˉ})^{2} + \frac{n _{0} n}{n _{1}} (\overset{y}{ˉ} - μ_{0})^{2}

Improper reference prior. Prior densities:

p (μ, σ^{2}) \propto 1/ σ^{2}

Posterior densities

μ ∣ σ^{2}, y σ^{2} ∣ y \sim N (\overset{y}{ˉ}, σ^{2} / n) \sim Scaled-Invese- χ^{2} (\frac{n - 1}{2}, \frac{\sum _{i = 1}^{N} ( y _{i} - y ˉ ) ^{2}}{2})

which implies

\frac{μ - y ˉ}{S / (( n - 1 ) n )} \sim t_{n - 1}

Posterior predictive distribution

p (\tilde{y} ∣ y) \sim Student-T (\overset{y}{ˉ}, s \frac{n + 1}{n}, n - 1)

where

s^{2} \overset{y}{ˉ} = \frac{1}{n - 1} i = 1 \sum n (y - \overset{y}{ˉ})^{2} = \frac{1}{n} i = 1 \sum n y_{i}

Posterior can often be approximated by simulation.

Draw $θ_{1}, \dots, θ_{B} \sim p (θ ∣ x^{n})$ .
Histogram of $θ_{1}, \dots, θ_{B}$ approximates posterior density $p (θ ∣ x^{n})$ .

Methods for this: Markov-Chain Monte-Carlo, Metropolis-Hastings, Hamiltonian Monte-Carlo.

Conjugacy for Discrete Distributions

Likelihood	Conjugate prior	Posterior hyperparameters
Bernoulli $(p)$	Beta $(α, β)$	$α + \sum_{i} x_{i}, β + n - \sum_{i} x_{i}$
Binomial $(p)$	Beta $(α, β)$	$α + \sum_{i} x_{i}, β + \sum_{i} N_{i} - \sum_{i} x_{i}$
Negative Binomial $(p)$	Beta $(α, β)$	$α + r n, β + \sum_{i} x_{i}$
Poisson $(λ)$	Gamma $(α, β)$	$α + \sum_{i} x_{i}, β + n$
Multinomial $(p)$	Dirichlet $(α)$	$α + \sum_{i} x^{(i)}$

Conjugacy for Continuous Distributions

Likelihood	Conjugate prior	Posterior hyperparameters
Uniform $[0, θ]$	Pareto $(x_{m}, k)$	$max {x_{(n)}, x_{m}}, k + n$
Exponential $(λ)$	Gamma $(α, β)$	$α + n, β + \sum_{i} x_{i}$
Normal $(μ, σ_{c}^{2})$	Normal $(μ_{0}, σ_{0}^{2})$	$(\frac{μ _{0}}{σ _{0}^{2}} + \frac{\sum _{i} x _{i}}{σ _{c}^{2}}) / (\frac{1}{σ _{0}^{2}} + \frac{n}{σ _{c}^{2}}), (\frac{1}{σ _{0}^{2}} + \frac{n}{σ _{c}^{2}})^{- 1}$
Normal $(μ_{c}, σ^{2})$	Scaled Inverse Chi-square $(ν, σ_{0}^{2})$	$ν + n, \frac{ν σ _{0}^{2} + \sum _{i} ( x _{i} - μ ) ^{2}}{ν + n}$
Normal $(μ, σ^{2})$	Normal-Scaled Inverse Gamma $(λ, ν, α, β)$	$\frac{ν λ + n x ˉ}{ν + n}, ν + n, α + \frac{n}{2}, β + \frac{1}{2} \sum_{i} (x_{i} - \overset{x}{ˉ})^{2} + \frac{γ ( x ˉ - λ ) ^{2}}{2 ( n + γ )}$

Computation / Markov Chains

A stochastic process ${X_{t} : t \in T}$ is a collection of random variables.

The process ${X_{n} : n \in T}$ is a Markov Chain if

Pr (X_{n} = X ∣ X_{0}, \dots, X_{n - 1}) = Pr (X_{n} = x ∣ X_{n - 1})

First rewrite the integral to be evaluated $I = \int_{a}^{b} h (x) d x$ as follows

I = \int_{a}^{b} h (x) d x = \int_{a}^{b} w (x) f (x) d x

where $w (x) = h (x) (b - a)$ and $f (x) = \frac{1}{b - 1}$ . Since $f$ is the probability density for a uniform r.v. over $(a, b)$ , we can write $I = E_{f} [w (X)]$ where $X \sim Unif (a, b)$ . If we generate $X_{1}, \dots, X_{N} \sim Unif (a, b)$ , by LLN,

\hat{I} = \frac{1}{N} i = 1 \sum N w (X_{i}) p E [w (X)] = I

The goal is the generate sequences $θ_{1}, θ_{2}, \dots$ from $f (θ ∣ y)$ . An MCMC scheme will generate samples from the conditional if

P (θ_{j} ∣ θ_{j - 1}) f (θ_{j - 1} ∣ y) = P (θ_{j - 1} ∣ θ_{j}) f (θ_{j} ∣ y)

where $P (θ_{i} ∣ θ_{j})$ is the pdf of $θ_{i}$ given $θ_{j}$ . The LHS is the joint pdf of $θ_{j}, θ_{j - 1}$ from the chain, if $θ_{j - 1}$ is from $f (θ ∣ y)$ . Integrating RHS over $d θ_{j - 1}$ yields $f (θ_{j} ∣ y)$ , so the result states that given $θ_{j - 1}$ is from the correct posterior distribution, the chain generates $θ_{j}$ also from the posterior $f (θ ∣ y)$ .

Basic idea - turn high dimensional problem into several one-dimensional problems. Suppose $(X, Y)$ has joint density $f_{X, Y} (x, y)$ . Suppose it is possible to simulate from conditional distributions $f_{X ∣ Y} (x ∣ y)$ and $f_{Y ∣ X} (y ∣ x)$ . Let $(X_{0}, Y_{0})$ be starting values. Assuming we have drawn $(X_{0}, Y_{0}), \dots, (X_{i}, Y_{i})$ , we generate $(X_{i + 1}, Y_{i + 1})$ as follows

$X_{n + 1} \sim f_{X ∣ Y} (x ∣ Y_{n})$

$Y_{n + 1} \sim f_{Y ∣ X} (y ∣ X_{n + 1})$

$\dots$ for multiple parameters

. Let $y_{i} \sim iid N (μ, σ^{2})$ . Define precision $τ = 1/ σ^{2}$

Likelihood: $f (y ∣ μ, τ) \sim τ^{n /2} exp (\frac{1}{2} τ \sum_{i = 1}^{n} (y_{i} - μ)^{2})$

(Noninformative) Prior: $π μ, τ \sim τ$

Posterior Distribution

π (μ, τ ∣ y) \sim τ^{(n /2) + 1} exp (- \frac{1}{2} τ i = 1 \sum n (y_{i} - μ)^{2})

full conditionals:

$π (μ ∣ τ, y) = N (\overset{y}{ˉ}, (n τ)^{- 1})$

$π (τ ∣ μ, y) = Γ (\frac{n}{2}, \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - μ)^{2})$

However, it is typically impossible to write out or sample from full-conditionals.

Let $q (y ∣ x)$ be an arbitrary, friendly distribution we can sample from. The conditional density $q (y ∣ x)$ is called the proposal distribution. MH creates a sequence of observations $X_{0}, \dots$ as follows Choose $X_{0}$ arbitrarily. Suppose we have generated $X_{0}, X_{1}, \dots, X_{i}$ . Generate $X_{i + 1}$ as follows

Generate proposal $Y \sim q (y ∣ X_{i})$
Evaluate $r : = r (X_{i}, Y)$ where

r (x, y) = min {\frac{f ( y )}{f ( x )} \frac{q ( x ∣ y )}{q ( y ∣ x )}, 1}

X_{i + 1} = {Y X_{i} w.p r w.p 1 - r

Let $x_{i}$ be observed and $z_{i}$ missing. The goal is to maximise the log-likelihood of the observed data

ℓ (θ) = i = 1 \sum n lo g p (x_{i} ∣ θ) = i = 1 \sum n [z_{i} \sum p (x_{i}, z_{i} ∣ θ)]

Cannot push log inside the sum because of unobserved variables. EM tackles the problem as follows. Define complete data log likelihood as $ℓ_{c} (θ) : = \sum_{i = 1}^{n} lo g p (x_{i}, z_{i} ∣ θ)$ . This cannot be computed, since $z_{i}$ is unknown.

Instead, define $Q (θ, θ^{t - 1}) = E [ℓ_{c} (θ ∣ D, θ^{t - 1})]$ where $t$ is the iteration number and $Q$ is called the auxiliary function.

Expectation (E) Step: Compute $Q (θ, θ^{t - 1})$ , which is an expectation wrt old params $θ^{t - 1}$ .

Maximisation (M) Step: Optimise the $Q$ function wrt $θ$ .

Compute $θ^{t} = arg max_{θ} Q (θ, θ^{t - 1})$ . For MAP estimation, $θ^{t} = arg max_{θ} Q (θ, θ^{t - 1}) + lo g p (θ)$

Probit has the form $p (y_{i} = 1∣ z_{i}) = 1_{z_{i} > 0}$ where $z_{i} \sim N (x_{i}^{'} β, 1)$ is the latent variable. The complete data log likelihood, assuming a $N (0, Σ_{0})$ prior on $β$ .

ℓ (z, β ∣ Σ_{0}) = lo g p (y ∣ z) + lo g N (z ∣ X β, I) + lo g N (β ∣ 0, Σ_{0}) = i = 1 \sum n lo g p (y_{i} ∣ z_{i}) - \frac{1}{2} (z - X β)^{'} (z - X β) - \frac{1}{2} β^{'} (Σ_{0})^{- 1} β + const

The posterior in the E step is a truncated Gaussian

p (z_{i} ∣ x_{i}, β) = {μ_{i} + \frac{ϕ ( μ _{i} )}{Φ ( μ _{i} )} μ_{i} - \frac{ϕ ( μ _{i} )}{Φ ( μ _{i} )} if y_{i} = 1 if y_{i} = 0

where $μ_{i} = x_{i}^{'} β$ . In the $M$ step, we estimate $β$ using ridge, where $μ = E [z]$ .

\hat{β} = ((Σ_{0})^{- 1} + X^{'} X)^{- 1} X^{'} μ

Hierarchical Models

Parameters in a prior are modeled as having a distribution that depends on hyperparameters. This results in joint posteriors of the form

f (θ, τ ∣ y) \propto likelihood L (y ∣ θ) parameter prior f (θ ∣ τ) hyperparameter prior f (τ)

Represented by the graphical model $τ \to θ \to D$ .

We are typically interested in the marginal posterior of $θ$ , which is obtained by integrating the joint posterior w.r.t $τ$ .

By treating $τ$ as a latent variable, we allow data-poor observations to borrow strength from data rich ones.

Empirical Bayes

In hierarchical models, we need to compute the posterior on multiple layers of latent variables. For example, for a two-level model, we need

p (η, θ ∣ D) \propto p (D ∣ θ) p (θ ∣ η) p (η)

We can employ a computational shortcut by approximating the posterior on the hyper-parameters with a point-estimate $p (η ∣ D) \approx δ_{\hat{η}} (η)$ , where $\hat{η} = arg max p (η ∣ D)$ . Since $η$ is usually much smaller than $θ$ in dimensionality, we can safely use a uniform prior on $η$ . Then, the estimate becomes

\overset{η}{^} = arg max p (D ∣ η) = arg max marginal likelihood [\int p (D ∣ θ) p (θ ∣ η) d θ]

This violates the principle that the prior should be chosen independently of the data, but is a cheap computational trick. This produces a hierarchy of Bayesian methods in increasing order of the number of integrals performed.

Suppose we measure the number of people in various cities $N_{i}$ and the number of people who died of cancer $x_{i}$ . We assume $x_{i} \sim Binomial [N_{i}, θ_{i}]$ and want to estimate the cancer rates $θ_{i}$ . The MLE solution would be to either estimate them all separately, or estimate a single $θ$ for all cities.

The hierarchical approach is to model $θ_{i} \sim Beta [a, b]$ , and write a joint distribution

p (D, θ, η ∣ N) = p (η) i = 1 \prod n Binomial [x_{i} ∣ N_{i}, θ_{i}] Beta [θ_{i}, η] analytically integrate out θ_{i} = i = 1 \prod n \int Binomial [x_{i} ∣ N_{i}, θ_{i}] Beta [θ_{i}, η] d θ_{i} = i = 1 \prod n \frac{Beta [ a + x _{i} , b + N _{i} - x _{i} ]}{Beta [ a , b ]}

where $η : = (a, b)$ . We can also put covariates on $θ_{i} = f (x_{i}^{'} β)$ .

Hierarchy of Bayesianity

Method	Definition
Maximum Likelihood	$\hat{θ} = arg max_{θ} p (D ∣ θ)$
MAP Estimation	$\hat{θ} = arg max_{θ} p (D ∣ θ) p (θ ∣ η)$
Empirical Bayes	$\hat{η} = arg max_{η} \int p (D ∣ θ) p (θ ∣ η) d θ = arg max_{η} p (D ∣ η)$
MAP-II	$\hat{η} = arg max_{η} \int p (D ∣ θ) p (θ ∣ η) p (η) d θ = arg max_{η} p (D ∣ η) p (η)$
Full Bayes	$p (θ, η ∣ D) \propto p (D ∣ θ) p (θ ∣ η) p (η)$

Graphical Models

Any joint distribution can be represented as follows

p (x_{1 : v}) = p (x_{1}) p (x_{2} ∣ x_{1}) p (x_{3} ∣ x_{2}, x_{1}) \dots p (x_{v} ∣ x_{1 : V - 1})

where $V$ is the number of variables [and we have dropped the parameter vector $θ$ ]. It follows that

The joint distribution $p (x) = p (x_{1}, \dots, x_{K})$ can be written as

p (x) = k = 1 \prod K p (x_{k} ∣ Pa_{k})

where Pa $_{k}$ denotes the parent nodes of $x_{k}$ , which are nodes that have arrows pointing to $x_{k}$ .

$X$ and $Y$ are said to be conditionally independent iff the conditional joint can be written as the product of the conditional marginal.

X ⊥ ⊥ Y ∣ Z ⟺ p (X, Y ∣ Z) = p (X ∣ Z) p (Y ∣ Z)

If the condition $A ⊥ ⊥ B ∣ C$ , it must be the case that all paths are blocked. All paths are blocked iff

Arrows on the path meet either head to tail or tail to tail at the node, and the note is in the set $C$

Arrows meet head to head at the node, and neither the node nor any of its descendants is in the set $C$

Empirical Bayes

We suppose that each player’s MLE value $p_{i}$ (his batting average in the first 90 tries) is a binomial proportion,

p_{i} \sim Bi (90, P_{i}) /90

Here $P_{i}$ is his true average, how he would perform over an infinite number of tries; TRUTH $_{i}$ is itself a binomial proportion, taken over an average of 370 more tries per player.

At this point there are two ways to proceed. The simplest uses a normal approximation to (7.17)

p_{i} \tilde{˙} N (P_{i}, σ_{0}^{2})

where $σ_{0}^{2}$ is the binomial variance

σ_{0}^{2} = \overset{p}{ˉ} (1 - \overset{p}{ˉ}) /90

with $\overset{p}{ˉ} = 0.254$ the average of the $p_{i}$ ‘s. Letting $x_{i} = p_{i} / σ_{0}$ , applying (7.13), and transforming back to $\overset{p}{^}_{i}^{JS} = σ_{0} \overset{μ}{^}_{i}^{JS},$ gives James-Stein estimates

\overset{p}{^}_{i}^{JS} = \overset{p}{ˉ} + [1 - \frac{( N - 3 ) σ _{0}^{2}}{\sum ( p _{i} - p ˉ ) ^{2}}] (p_{i} - \overset{p}{ˉ})

A second approach begins with the arcsin transformation

x_{i} = 2 (n + 0.5)^{1/2} sin^{- 1} [(\frac{n p _{i} + 0.375}{n + 0.75})^{1/2}]

Lalgorithms

Explorer

Chapter 05: Bayesian Statistics

Concept map

Bayesian Statistics

Setup

Conjugate Priors and Updating

Computation / Markov Chains

Hierarchical Models

Empirical Bayes

Hierarchy of Bayesianity

Graphical Models

Empirical Bayes

Graph View

Table of Contents

Backlinks