This note is a high-fidelity Markdown migration of the Dependent Data: Time Series and Spatial Statistics chapter from the LaTeX source.

Parent map: index Prerequisites: probability-and-mathstats, linear-regression, maximum-likelihood-and-machine-learning

Concept map

flowchart TD
  A[Dependent Data] --> B[Time Series]
  B --> C[Stationarity]
  C --> D[Ergodicity]
  B --> E[AR MA ARMA]
  B --> F[Unit Root]
  F --> G[Cointegration]
  B --> H[HAC Inference]
  A --> I[Spatial Statistics]
  I --> J[Kriging]
  I --> K[Spatial Autocorrelation]
  K --> L[Variogram]
  I --> M[Spatial Regression]
  I --> N[GMRF GP CAR]

Dependent Data: Time series and spatial statistics

Time Series

A time series is a sequence of data points ${w_{t}}_{t = 1}^{T}$ observed over time. In a random sample, points are iid, so the joint distribution $f (w_{1}, \dots, w_{T}) = \prod_{t = 1}^{T} f (w_{t})$ . In time series, this is clearly violated, since observations that are temporally close to each other tend to be more similar.

A stochastic process is a sequence of random variables ${\dots, Y_{- 1}, Y_{0}, Y_{1}, \dots}$ indexed by elements in a set of indices ${Y_{t} : t \in T}$ . Hypothetical repeated realisations of a stochastic process look like

{w_{t}^{(1)}, w_{t}^{(2)}, \dots, w_{t}^{(n)}}_{t = - \infty}^{\infty}

The index set $T$ may be either countable, in which case we get a discrete time process or uncountable, in which case we get a continuous time process.

State Space

We assume $\exists$ a set $Y \in R s.t. \forall t \in T, Y_{t} \in Y$ . Then, $Y$ is called the State Space of the stochastic process.

Consider a random process ${Y_{t}}_{t = 1}^{\infty}$ and an increasing sequence of information sets ${F_{t}}_{t = 1}^{\infty}$ i.e. collection of $σ -$ fields s.t. $F_{0} \subset F_{1} \dots F_{\infty} \subset F$ . If $Y_{t}$ belongs to the information set $F_{t}$ and is absolutely integrable [i.e. $y_{t} \in L_{0} (F_{t}) \cap L_{1} (F)$ ], and $E [Y_{t + 1} ∣ F_{t}] = Y_{t} \forall t < \infty$ then ${Y_{t}}_{t = 0}^{\infty}$ is called a martingale. In words, the conditional expected value of the next observation, given all the past observations, is equal to the most recent observation.

The autocovariance of $Y_{t}$ is the covariance between $Y_{t}$ and its $j^{t h}$ lagged value

γ_{j t} : = E [Y_{t} - μ_{t}] [Y_{t - j} - μ_{t - j}]

the variance-covariance matrix of $y = {y_{t}}$ has Toeplitz form:

V [y] = [γ_{∣ i - j ∣}]_{i, j = 1}^{T} .

the $j^{t h}$ order correlation coefficient $ρ_{j} : = γ_{j} / γ_{0}$ .

A random process is said to be stationary if the distribution functions of $(X_{t_{1}}, X_{t_{2}} \dots)$ and $(X_{t_{1} + j}, X_{t_{2} + j} \dots)$ are the same $\forall t_{1}, \dots, t_{k}, h \in Z$ .

A process is said to be covariance (or weakly) stationary if

$E [Y_{t}] = μ \forall t \in T$

$γ_{j t} = E [Y_{t} - μ_{t}] [Y_{t - j} - μ_{t - j}] = γ_{j} \forall t \in T$

i.e. neither the mean nor the autocovariances depend on the date $t$ ; stationary expectation, variance, and covariance. Most relevant variables aren’t stationary, but their detrended or first-differenced versions may be.

If $X_{0}, X_{1}, \dots$ is a Markov Process,

Pr (X_{n + 1} \leq x ∣ X_{1}, \dots, X_{n}) = Pr (X_{n + 1} \leq x ∣ X_{n})

that is, the conditional distribution of $X_{n + 1}$ given $X_{1}, \dots, X_{n}$ does not depend on $X_{1}, \dots, X_{n - 1}$ .

Markov Chain

A Markov chain is simply a Markov process in which the state-space is a countable set. Since a Markov chain is a markov process, the conditional distribution of $X_{t + 1} ∣ X_{1}, \dots, X_{t}$ depends only on $X_{t}$ . The conditional distribution is often represented by a Transition matrix where

P_{ij}^{t, t + 1} = Pr (X_{t + 1} = j ∣ X_{t} = i); i, j = 1, \dots J

If $P$ is the same $\forall t$ , we say the Markov chain has stationary transition probabilities.

A stationary process is ergodic if any two variables positioned far apart in the sequence are almost independently distributed.

${x_{t}}$ is ergodic if, for any two bounded functions $f (.)$ in $k + 1$ variables and $g (.)$ in $l + 1$ variables,

N \to \infty lim ∣ E [f (x_{t}, \dots, x_{t + k}) g (x_{t + N}, \dots, x_{t + l + N})] ∣ - ∣ E [f (x_{t}, \dots, x_{t + k})] ∣ ∣ E [g (x_{t + N}, \dots, x_{t + l + N})] ∣ = 0

i.e. $lim_{j \to \infty} γ_{j} = 0$

Sufficient condition for ergodicity is $x_{t}$ be covariance stationary and $\sum_{j = 0}^{\infty} ∣ γ_{j} ∣ < \infty$

Ergodic processes have the following property

V [t = 1 \sum T x_{t}] = j = 1 - T \sum T - 1 (T - ∣ j ∣) γ_{j}

this result implies that

T \to \infty lim V [\frac{1}{T} t = 1 \sum T x_{t}] = j = - \infty \sum \infty γ_{j} < \infty

This permits us to swap $i$ s for $t$ s and derive Asymptotic theory with dependent observations, such as LLN and CLT.

A family of r.v.s ${X_{t}}$ indexed by a continuous variable $t$ over $[0, \infty)$ is a Brownian Motion iff

$X (0) = 0$

${X (s_{i} + t_{i}) - X_{(} S_{i})}$ over an arbitrary collection of disjoint intervals $(s_{i}, s_{i} + t_{i}$ are independent r.v.s

$\forall s, t \geq 0, X (s + t) - X (s) \sim N (0, t)$

White noise is a sequence ${ε_{t}}$ whose elements have mean zero and variance $σ^{2}$ , and for which $ε_{t}$ ‘s are uncorrelated over time

$E [ε_{t}] = 0$

$E [ε_{t}^{2}] = σ^{2}$

$E [ε_{t} ε_{t - j}] = 0 \forall j \neq = 0$

A moving average of order $q$ , $M A (q)$ is a weighted average of the $q$ most recent values of a white noise defined as

Y_{t} = μ + ε_{t} + θ_{1} ε_{t - 1} + \dots + θ_{q} + ε_{t - q}

An autoregressive process of order $p$ , $A R (p)$ is given by $Y_{t}$ as a linear combination of $p$ lags of itself and one white noise

Y_{t} = μ + ϕ_{1} Y_{t - 1} + \dots + ϕ_{p} Y_{t - p} + ε_{t}

ARMA(p, q) combines AR(p) and MA(q)

Y_{t} = μ + MA(q) ε_{t} + ε_{t} + θ_{1} ε_{t - 1} + \dots + θ_{q} + ε_{t - q} + AR(p) ϕ_{1} Y_{t - 1} + \dots + ϕ_{p} Y_{t - p}

Consider AR(1): $Y_{t} = ρ Y_{t - 1} + ε_{t}$ . Since this holds at $t$ , it holds at $t - 1 ⟹ Y_{t - 1} = ρ Y_{t - 2} + ε_{t - 1}$ . Substitute into original to get $Y_{t} = ρ (ρ Y_{t - 2} + ε_{t - 1}) + ε_{t}$ . Repeat ad infinitum to obtain, as long as $ρ < 1$

Y_{t} = s = 0 \sum \infty ρ^{s} ε_{t - s}

In other words, AR(1) $\equiv$ MA( $\infty$ ) ; they are different representations of the same underlying stochastic process.

Wold Representation: All covariance-stationary time series processes can be represented by / decomposed into a deterministic component and a $M A (\infty)$

In a stationary process, $E [x_{t}] = μ$ , which is seldom true. A less restrictive assumption that allows for nonstationarity is to specify the mean as a function of time.

E [x_{t}] = α + βt specify x_{t} = α + βt + ε_{t}; ε stationary

A random walk ( $\equiv I (1)$ ) is a process such that $E [x_{t} ∣ x_{t - 1}, x_{t - 2}, \dots] = x_{t - 1}$ .

$x_{t} = x_{t - 1} + ε_{t}, ε_{t} \sim N (0, σ^{2})$ = AR(1) process with $ϕ = 1 = :$ Unit Root. Rewrite as

x_{t} = x_{t - 1} + ε_{t} = x_{0} + j \sum t ε_{j}

Random walk with drift

x_{t} = x_{t - 1} + δ + ε_{t} = δ t + ε_{t} + ε_{t - 1} \dots ε_{1} + x_{0} = x_{0} + δ t + j \sum t ε_{j}

For the following model $x_{t} = ρ x_{t - 1} + ε_{t}, ε_{t} \sim N (0, σ^{2})$ = AR(1)

test $ρ = 1$ . Distribution of $ρ$ under the null $ρ = 1$ is non-standard: CLT not valid. test to use: Dickey Fuller, Augmented Dickey Fuller, Phillips-Perron.

Let $y_{t} \sim I (1) \land x_{t} \sim I (1)$ . $y_{t}$ and $x_{t}$ are said to be cointegrated if $\exists ψ s.t. y_{t} - ψ x_{t} \sim I (0)$ . For example, let

y_{t} x_{t} = β x_{t} + ε_{t} = x_{t - 1} + υ_{t}

where $(ε_{t}, υ_{t})$ is white noise. Then, $y_{t}, x_{t} \sim I (1)$ , but $y_{t} - β x_{t} \sim I (0)$ , with cointegration vector $a = (1, - β)$ .

decomposes an observed time series $X_{t}, t = 1, 2, \dots, n$ into a trend $\overline{X}_{t}$ and a stationary component $X_{t} = X_{t} - \overline{X}_{t}$ so that the trend ${\overline{X}_{t}}_{t = 1}^{n}$ minimises

\frac{1}{n} t = 1 \sum n (X_{t} - \overline{X}_{t})^{2} + Penalty for incorporating fluctuations w \frac{1}{n} t = 2 \sum n - 1 ((\overline{X}_{t + 1} - \overline{X}_{t}) - (\overline{X}_{t} - \overline{X}_{t - 1}))^{2}

$w$ is a tuning parameter. In quarterly data, $w = 1600$ .

Regression with time series

Basic assumption in conventional OLS with time series is $E [y_{t} ∣ x_{1}, \dots, x_{T}] = E [y_{t} ∣ x_{t}] = x_{t}^{'} β$ . Equivalently, $y_{t} = x_{t}^{'} β + ε_{t} E [ε_{t} ∣ X] = 0$ where $X = (x_{1}, \dots, x_{T})^{'}$ . The second classical assumption is $E [ε_{t}^{2} ∣ x] = σ^{2} \forall t; E [ε_{t} ε_{t - j}] = 0 \forall t, j$ .

$E [u_{t}, u_{t - j} ∣ X] \neq = 0$ is called autocorrelation. Fix: Newey-West HAC consistent variance estimator ‘meat’

V = Γ_{0} + i = 1 \sum m (1 - \frac{j}{m + 1}) (Γ_{j} + Γ_{j}^{'}) where Γ_{j} = \frac{1}{T - j} t = j + 1 \sum T ε_{j} ε_{t - j} x_{t} x_{t - j}^{'}

with variance estimated the normal way

W = (\frac{1}{T} t = 1 \sum T x_{t} x_{t}^{'})^{- 1} V (\frac{1}{T} t = 1 \sum T x_{t} x_{t}^{'})^{- 1}

Consider the model

y_{t} = δ + α y_{t - 1} β_{0} x_{t} + β_{1} x_{t - 1} + ε_{t}

Subtract $y_{t - 1}$ and add $- β_{0} x_{t - 1} + β_{0} x_{t - 1}$ to the l.h.s. we get

y_{t} - y_{t - 1} = δ - (1 - α) y_{t - 1} + β_{0} (x_{t} - x_{t - 1}) + (β_{0} + β_{1}) x_{t - 1} + ε_{t}

and

Δ y_{t} = δ + β_{0} Δ x_{t} - (1 - α) (y_{t - 1} - γ x_{t - 1}) + ε_{t}

where $γ$ is the long run effect

γ = \frac{β _{0} + β _{1}}{1 - α}

A Quandt Likelihood ratio test begins with no knowledge of when the trend break occurs [although researchers typically know of the timing for substantive reasons], and sequentially estimates the following model

Δ Y_{t} = lo g Y_{t} - lo g Y_{t - 1} = α + δ_{0} D_{t} (τ) + ε_{t}

where $Δ Y_{t}$ is the first difference of the outcome, and $D_{t} (τ)$ is an indicator variable equal to zero for all years before $τ$ and one for all subsequent years. The researcher varies $τ$ and tests the null that $δ_{0} = 0$ , and the largest F-statistic is used to determine the best possible break point. Use Andrews (2003) critical values to account for multiple-testing.

Spatial Statistics

A spatial stochastic process is a collection of random variables $y (u)$ indexed by location $u$ : ${y_{i}, i \in D \subset R^{d}}$ , where $D$ is either a continuous surface of a finite set of discrete locations.

For each location $u$ , $y (u)$ is a random variable, and thus needs to be modeled. Basic approach is to assume $E [y (u)], V [y (u)]$ exist, and decompose

y (u) = mean function m (u) + error e (u)

mean function $m (u) = E [y (u)]$ and stochastic error process $e (u) s.t. E [e (u)] = 0$ .

Kriging - modeling $m (u)$

Main reference: Christensen (2019, ch. 8).

Assume linear structure for $m (u)$ . $p$ known functions of $u$ , $x_{1} (u), \dots, x_{p} (u)$ s.t.

m (u) = j = 1 \sum p β_{j} x_{j} (u)

A special case of this is the Ordinary Kriging model where

m (u) = μ

for unknown $μ$ . The most basic model is Simple Kriging where

m (u) = μ_{0}

with known $μ_{0}$ .

Assume the universal kriging model $m (u) = \sum_{j}^{p} β_{j} x_{j} (u)$ holds, we have data on locations $u_{1}, \dots, u_{n}$ , and that we wish to predict the value of $y (u_{0})$ . The model can be written

Y E [e] Cov (e) = X β + e = 0 = Σ = [σ_{ij}] = σ (u_{i}, u_{j}) i, j = 1, \dots, n

Let $Σ_{Y 0} : = σ_{10} ⋮ σ_{n 0}$

The best linear unbiased predictor of $y_{0}$ is

y_{0} = x_{0}^{'} β + δ^{'} (Y - X β)

where $β = (X^{'} Σ^{- 1} X)^{- 1} X^{'} Σ^{- 1} Y$ and $δ = Σ^{- 1} Σ_{Y 0}$ .

Spatial Autocorrelation: Modelling $e (u)$

Spatial Autocorrelation is expressed as

σ (u, w) : = Cov (y (u), y (w)) = Cov (e (u), e (w)) = σ (w, u) = E [y (u) y (w)] - E [y (u)] E [y (w)] \neq = 0 \forall i \neq = j

Covariance is often modelled in terms of an unknown parameter $θ$ , in which case we write $σ (u, w; θ)$ . Assumptions made about $e (u)$ include:

second-order stationarity,
strict stationarity,
intrinsic stationarity,
increment stationarity,
isotropy.

Covariance functions can be modelled in three basic ways:

specify a functional form for the stochastic process generating ${y_{i}, i \in D}$ , and derive covariance from that process,
model covariance directly as a function of a small number of parameters,
leave covariance unspecified and estimate nonparametrically.

A process $y (u)$ is strictly stationary if for all $k$ , locations $u_{1}, \dots, u_{k}$ , Borel sets $C_{1}, \dots, C_{k}$ , and shifts $h \in R^{d}$ ,

Pr (y (u_{1}) \in C_{1}, \dots, y (u_{k}) \in C_{k}) = Pr (y (u_{1} + h) \in C_{1}, \dots, y (u_{k} + h) \in C_{k}) .

This implies translation invariance of the joint law. In particular:

m (u) = μ, σ (u, w) = σ (u - w) = σ (h) .

If, in addition, the finite-dimensional distributions are multivariate Gaussian, the process is a Gaussian Process.

Second-order (weak) stationarity imposes the same constant mean and covariance depending only on distance, but does not require full strict stationarity.

Increment-stationarity requires invariant increment laws:

Pr (y (u_{2}) - y (u_{1}) \in C_{1}, \dots, y (u_{k}) - y (u_{k - 1}) \in C_{k}) = Pr (y (u_{2} + h) - y (u_{1} + h) \in C_{1}, \dots, y (u_{k} + h) - y (u_{k - 1} + h) \in C_{k}) .

Brownian motion is increment-stationary but not strictly stationary.

For increment-stationary processes, the semivariogram is

γ (u, w) = \frac{1}{2} E [(y (u) - y (w))^{2}] = \frac{1}{2} V [y (u) - y (w)] = \frac{1}{2} (σ (u, u) + σ (w, w) - 2 σ (w, u)) .

The variogram is $2 γ (u, w)$ . Under increment-stationarity:

γ (u, w) = γ (u + h, w + h) = γ (u - w, 0) = γ (u - w) .

An intrinsically-stationary process satisfies the constant-mean restriction and this semivariogram invariance condition. All second-order stationary processes are intrinsically stationary, but not vice versa.

For a linear model, stipulate a nonnegative definate weighting matrix, and fit

Y = Xβ + e, E [e] = 0, Cov (e) = Σ_{0} .

to obtain residuals $\overset{e}{^}_{0} = Y - X β$ . For any vector $h$ , there is a finite number $N_{h}$ of pairs of observations $y_{i}, y_{j}$ for which $u_{i} - u_{j} = h$ . For each of these pairs, list the corresponding residual pairs, $(e_{0 i}, e_{0 i (h)}), i = 1, \dots, N_{h}$ . If $N_{h} \geq 1$ , the traditional empirical covariance estimator is

σ (h) = σ (- h) = \frac{1}{N _{h}} i = 1 \sum N_{h} e_{i} e_{i (h)}

The traditional empirical semivariogram estimator in ordinary kriging (no covariates) is

γ (h) = \frac{1}{2 N _{h}} i = 1 \sum N_{h} (y_{i} - y_{i (h)})^{2}

A second-order stationary process is said to be isotropic if

σ (u - w) = σ (∣ ∣ u - w ∣ ∣)

An intrinsically stationary process is isotropic if

γ (u - w) = γ (∣ ∣ u - w ∣ ∣)

y - μ ι = ρ W (y - μ ι) + ε = (I - ρ W)^{- 1} + ε

where $W$ is a $N \times N$ weight matrix. a spatial lag for $y_{i}$

W y_{i} = j \sum w_{ij} y_{j}

A parsimonious specification of a small number of parameters for the covariance matrix is typically presumed.

Cov (ε_{i}, ε_{j}) = σ^{2} f (d_{ij}, φ)

where $ε_{i}, ε_{j}$ are residuals, $σ^{2}$ is the error variance, $d_{ij}$ is the distance between $i, j$ , and $f$ is a distance decay function such that $\frac{\partial f}{\partial d} < 0$ and $∣ f (d_{ij}, φ) ∣ \leq 1$ , with $φ \in Φ$ being a $p \times 1$ vector.

The generalised Moran’s I is a weighted, scaled cross-product

I : = \frac{n \sum _{i = 1}^{n} \sum _{j \neq = i} w _{ij} ( y _{i} - y ) ( y _{j} - y )}{\sum _{i = 1}^{n} \sum _{j \neq = i} w _{ij} \sum _{i} ( y _{i} - y ) ^{2}}

Its expected value is $\frac{- 1}{n - 1}$ .

A test for Moran’s I involves shuffling the locations of points and computing $I$ $S$ times. This produces a randomization distribution under $H_{0}$ .

A Monte-carlo P-value is

p = \frac{1 + \sum _{s = 1}^{S} 1 _{I_{s}^{*} \geq I_{o b s}}}{S + 1}

Spatial Linear Regression

A simple spatial regression is

y = ρ Wy + X β + ε

the solution is

β = (X X^{⊤})^{- 1} X (^{⊤} I - ρ W) y

Its reduced form is

y = (I - ρ W)^{- 1} X β + (I - ρ W)^{- 1} ε

The spatial lag term induces correlation between the error and explanatory variables, and thus must be treated as an endogenous variable.

A spatial error model is simply an linear model with a non-spherical but typically parametric structure in the error covariance matrix.

y = X β + Composite error ε λ W ξ + η, η \sim N (0, σ^{2} I)

E [ε ε^{'}] = Ω (θ)

A covariance function decomposes into a systematic part and idiosyncratic noise as follows

Σ_{ij} = σ^{2} C (∥ i - j ∥, π) + τ^{2} 1_{ij} \equiv σ^{2} C_{UU} + τ^{2} I

where $C$ is a correlation function, $∥ i - j ∥$ is the distance between points $i, j$ .

Kelly recommends using a Whittle-Matern function defined next. These parameters can be fitted on the error distribution to estimate the covariance matrix.

A covariance function $Cov (Y (x), Y (x^{'}))$ describes the joint variability between a stochastic process $Y (\cdot)$ at two locations $x$ and $x^{'}$ . This covariance function is vital in spatial prediction. The fields package includes common parametric covariance families (e.g. exponential and Matern) as well as nonparametric models (e.g. radial and tensor basis functions).

When modeling $Cov (Y (x), Y (x^{'})),$ we are often forced to make simplifying assumptions.

Stationarity assumes we can represent the covariance function as

Cov (Y (x + h), Y (x)) = C (h)

for some function $C : R^{d} \to R$ where $dim (x) = d$ .

Isotropy assumes we can represent the covariance function as

Cov (Y (x + h), Y (x)) = C (∥ h ∥)

for some function $C : R \to R,$ where $∥ \cdot ∥$ is a vector norm.

Exponential :

Cov (Y (x), Y (x^{'})) = C (r) = ρ e^{- r / θ} + σ^{2} 1_{x = x^{'}}

Matern:

Cov (Y (x), Y (x^{'})) = C (r) = ρ (\frac{2 ^{1 - ν}}{Γ ( ν )} (\frac{r}{θ})^{ν} K_{ν} (\frac{r}{θ})) + σ^{2} 1_{x = x^{'}}

where $K_{ν}$ is a modified Bessel function of the second kind, of order $ν$

Matern covariance depends on $(ρ, θ, ν, σ^{2})$ , while exponential depends on $ρ, θ, σ^{2})$ , where

$θ$ : is the range of the process at which observations become uncorrelated

$ρ$ : marginal variance / ‘sil’

$σ^{2}$ : small scale variation such as measurement error

$ν$ : smoothness

y = X γ + β Wy AC in Y + WX θ AC in X + W ν λ AC in errors + ε

Here, $W$ is a weight matrix (typically row-standardised), so $WM$ is a spatial lag. In spatial econometrics, the above form nests many popular regressions

Spatially Autoregressive (SAR) Model : $λ = θ = 0$

Spatially lagged $x$ : $β = λ = 0$

Spatial Durbin Model : $λ = 0$

Spatial Error model : $β = λ = 0$

In the Social Interactions literature (e.g., Manski 1993), the above expression is written in the form of conditional expectations

y_{i} = x_{i}^{'} γ + E [y ∣ w_{i}]^{'} β Endogenous + E [x_{i} ∣ w_{i}]^{'} θ Contextual + E [ν ∣ w_{i}]^{'} Correlated λ + ε_{i}

in practice, the expectations are replaced with empirical counterparts $E (y ∣ w_{i}) = W y$ and so on, so the estimation steps are isomorphic.

Define unobservables as $υ = W ν + ε$ , and assume they are uncorrelated with observables $x$ ; that is, there is no sorting and no omitted spatial variables. Then, we can write

y = X γ + Wy β + WX θ + υ

Premultiplying by $Wy$ gives

Wy = WX γ + WWy β + W WX θ + W υ

This shows that $Gy$ is correlated with $υ$ , i.e. $E [υ ∣ Wy] \neq = 0$ , and least square estimates of the above regression are biased.

If we assume $W$ is idempotent (by constructing a block-diagonal, transitive matrix), we can simplify the above expression to

W y y = WX \frac{γ + θ}{1 - β} + W υ / (1 - β) Plugging in definition for Wy = X γ γ / (1 - β) + WX θ (γ β + θ) / (1 - β) + υ υ + W υ β / (1 - β)

In summary, $β, θ$ cannot be separately identified from the composite parameters $β, θ$ . This is the reflection problem discussed by Manski (1993).

Spatial Modelling

Based on Rue and Held (2005) and lecture notes.

$x_{1}, x_{2}$ are conditionally independent given $x_{3}$ if, for a given value of $x_{3}$ , learning $x_{2}$ gives one no additional information about $x_{1}$ . The density representation is therefore

f (x) = f (x_{1} ∣ x_{3}) f (x_{2} ∣ x_{3}) f (x_{3})

which is a simplification of the general representation.

f (x) = f (x_{1} ∣ x_{2}, x_{3}) f (x_{2} ∣ x_{3}) f (x_{3})

x ⊥ ⊥ y ∣ z ⟺ f (x, y, z) = g (x, z) h (y, z)

for some functions $f, g$ , and $\forall z with f (z) > 0$

x_{t} = ϕ x_{t - 1} + ε_{t}; ε \sim iid N (0, 1), ∣ ϕ ∣ < 1

This can be re-expressed as

x_{t} ∣ x_{1}, \dots, x_{t - 1} \sim N (ϕ x_{t - 1}, 1) \forall t = 2, \dots, n

So, for $x_{s}, x_{t}, 1 \leq s < t \leq n$ ,

x_{s} ⊥ ⊥ x_{t} ∣ {x_{s + 1}, \dots, x_{t - 1}} if t - s > 1

In addition to the conditional distribution, also assume the marginal distribution of $x_{1} \sim iid N (0, 1/ (1 - ϕ)^{2})$ , which is the stationary distribution of this process. Then, the join distribution of $x$ is

f (x) = f (x_{1}) f (x_{2} ∣ x_{1}) \dots, f (x_{n} ∣ x_{n - 1}) = \frac{1}{( 2 π ) ^{n /2}} ∣ Q ∣^{1/2} exp (- \frac{1}{2} x^{'} Qx)

where $Q$ is a precision matrix of the form

Q = 1 - ϕ - ϕ 1 + ϕ^{2} - ϕ ⋱ ⋱ ⋱ - ϕ 1 + ϕ^{2} - ϕ - ϕ 1

This tridiagonal form is due to the fact that $x_{i} ⊥ ⊥ x_{j}$ if $∣ i - j ∣ > 1$ given the rest of the sequence. This is generally true for any GMRF: $Q_{ij} = 0, i \neq = j ⟹ x_{i} ⊥ ⊥ x_{j} ∣ {x_{k} : k \neq = i, j}$ .

While the conditional independence structure is readily apparent from the precision matrix, it isn’t evident in the covariance matrix $Σ = Q^{- 1}$ , which is completely dense with entries

σ_{ij} = \frac{1}{1 - ϕ ^{2}} ϕ^{∣ i - j ∣}

Entries of the covariance matrix $Σ$ only give direct information about the marginal dependence structure, not the conditional one.

A spatial process $Y (s) s \in D \subset R^{2}$ is said to follow a Gaussian Process if any realisation $Y = (Y (s)_{1}, \dots, Y (s)_{n})^{'}$ at the finite number of locations $s_{1}, \dots, s_{n}$ follows an $N -$ variate Gaussian. More precisely, let $μ (s : D \to R$ denote a mean function returning a mean at location $s$ (typically assumed to be linear in covariates $X (s) = (1, X_{1} (s), \dots, X_{p} (s))^{'}$ ) and $C (s_{1}, s_{2}) : D^{2} \to R^{+}$ denote a covariance function. Then, $Y (s)$ follows a spatial Gaussian process, and $Y$ has a density

f_{Y} (y) = (\frac{1}{2 π}) ∣ Σ ∣^{- 1/2} exp {- \frac{1}{2} (y - μ)^{'} (Σ)^{- 1} (y - μ)}

Where $μ = (μ (s_{1}), \dots, μ (s_{N}))^{'}$ is the mean vector and $Σ = {C (s_{i}, s_{j})}_{ij}$ is the $N \times N$ covariance matrix. Evaluating this density requires $O (N^{3})$ operations and $O (N^{2})$ memory, which means it does not scale well with large datasets. See Heaton et al. (2019) for an overview of alternatives.

Let $x$ be associated with some property of points (typically location), with no natural ordering of the indices. The joint density of a zero-mean GMRF is specified by each of the $n$ full-conditionals

x_{i} ∣ x_{- i} \sim N j : j \neq = i \sum β_{ij} x_{j}, (κ)_{i}^{- 1}

these are called CAR models. The associated precision matrix is

Q = Q_{ij} = {κ_{i} - κ_{i} β_{ij} i = j i \neq = j

which is symmetric and positive-definite.

A random vector $x = (x_{1}, \dots, x_{n})^{'} \in R^{n}$ is called a GMRF wrt a labelled graph $G = (V, E)$ with mean $μ$ and precision matrix $Q > 0$ iff its density has the form

f (x) = (2 π)^{- n /2} ∣ Q ∣^{1/2} exp (- \frac{1}{2} (x - μ)^{'} Q (x - μ))

and $Q_{ij} \neq = 0 ⟺ {i, j} \in E \forall i \neq = j$ . If $Q$ is completely dense, $G$ is completely connected. In spatial settings, $Q$ is typically sparse [depending on how neighbours are defined.]

Key summary quantities

E [x_{i} ∣ x_{- i}] = μ_{i} - \frac{1}{Q _{ii}} j : j \sim i \sum Q_{ij} (x_{j} - μ_{j})

$Prec (x_{i}, x_{- i}) = Q_{ii}$ and

Corr (x_{i}, x_{j} ∣ x_{- ij}) = \frac{- Q _{ij}}{Q _{ii} Q _{jj}}, i \neq = j

Let $x$ be a GMRF wrt $G = (V, E)$ . The following are equivalent

Pairwise Markov Property: $x_{i} ⊥ x_{j} ∣ x_{- ij} if {i, j} \neq \in E \land i \neq = j$

Local Markov Property; $x_{i} ⊥ x_{{i, n e (i)}} ∣ x_{n e (i)} \forall i \in V$

Global Markov: $x_{A} ⊥ x_{B} ∣ x_{C}$ for disjoint sets $A, B, C$ where $C$ separates $A, B$ and $A and B$ are nonempty.

let the spatial process at location $s \in D$ be

Z (s) = X (s) β + w (s), \forall s \in D

where $X (s)$ collects a $p -$ vectors of covariates for site $s$ , and $β$ is a p-vector of coefficients. Spatial dependence can be imposed by modelling ${w (s) : s \in D}$ as a zero-mean stationary Gaussian Process. Distributionally, this implies that for any $s_{1}, \dots, s_{n} \in D$ , if we let $w = (w (s_{1}), \dots, w (s_{n}))^{'}$ , and $Θ$ be the parameters of the model

w ∣ Θ \sim N (0, Σ (Θ))

where $Σ (Θ)$ is the covariance matrix of a n-dimensional normal density. We need $Σ (Θ)$ to be Symmetric, PD for this distribution to be proper.

Special cases:

Exponential Covariance Matrix: $Θ = (ψ, ϕ, κ)$ $Σ (Θ) = ψ I + κ H (ϕ)$ , where the $i, j$ th element of $H (ϕ) = exp (- ∥ s_{i} - s_{j} ∥ / ϕ)$ . The ‘nugget’ $ψ$ is the variance of the non-spatial error, $κ$ dictates the scale, and $ϕ$ dictates the range of the spatial dependence.

Matern Covariance: $Θ = (ψ, κ, ϕ, ν) > 0$ for distance $x : = ∥ s_{i} - s_{j} ∥$ .

Cov (x; ϕ, ψ, κ, ν) = {\frac{κ}{2 ^{ν - 1} Γ ( ν )} (2 ν x / ϕ)^{ν} K_{ν} (2 ν x / ϕ) ψ + κ if x > 0 if x = 0

where $K_{ν} (x)$ is a modified Bessel function of order $ν$ .

Specifying $Σ$ directly can be awkward when dealing with irregular spatial data [i.e. every real use case].

So, random effects $w$ are modelled conditionally. Let $w_{- i}$ denote the vector of $w$ excluding $w (s_{i})$ . Model $w (s_{i})$ in terms of its full-conditional.

w (s_{i}) ∣ w_{- i}, Θ \sim N (j = 1 \sum n c_{ij} w (s_{j}), κ_{i}^{- 1}), i = 1, \dots, n

where $c_{ij}$ describes the neighbourhood structure.

Besag (1974) proved that if $Q$ is symmetric PD, with $κ_{i}$ in the diagonals and $- κ_{i} c_{ij}$ in the off-diagonals. $w ∣ Θ \sim N (0, Q^{- 1})$ . Simplest version assumes common precision parameter $κ_{i} = τ$ .

Intrinsic GMRF: $f (w ∣ Θ) \sim τ^{(N - 1) /2} exp (- w^{'} Q (τ) w)$ . When $c_{ij} = 1$ for neighbours (i.e. adjacency matrix instead of distances), it simplifies further to

(w ∣ Θ) \sim τ^{(N - 1) /2} exp (- \frac{1}{2} i \sim j \sum (w (s_{i}) - w (s_{j}))^{2})

Let ${Z (s) : s \in D}$ and ${w (s : s \in D)}$ be two spatial processes on $D \subset R^{d} (d \in Z^{+})$ . Assume $Z (s_{i})$ s are conditional independent given random effects $w (s_{1}), \dots, w (s_{n})$ , and that $Z (s_{i})$ follow some common distributional form, and

E [Z (s_{i}) ∣ w] = μ (s_{i}) \forall i = 1, \dots n

Let $η (s) = h (μ (s))$ for some known link function $h (\cdot)$ e.g. $h (x) = lo g (\frac{x}{1 - x})$ for logit. Assume linear form for projection

$η (s) = X (s) β + w (s)$ . Spatial dependence via $w ∣ Θ \sim N (0, Σ (Θ))$ , where $Σ$ is often Matern.

Lalgorithms

Explorer

Chapter 06: Dependent Data - Time Series and Spatial Statistics

Concept map

Dependent Data: Time series and spatial statistics

Time Series

State Space

Markov Chain

Regression with time series

Spatial Statistics

Kriging - modeling $m (u)$

Spatial Autocorrelation: Modelling $e (u)$

Spatial Linear Regression

Spatial Modelling

Graph View

Table of Contents

Backlinks

Lalgorithms

Explorer

Chapter 06: Dependent Data - Time Series and Spatial Statistics

Concept map

Dependent Data: Time series and spatial statistics

Time Series

State Space

Markov Chain

Regression with time series

Spatial Statistics

Kriging - modeling m(u)

Spatial Autocorrelation: Modelling e(u)

Spatial Linear Regression

Spatial Modelling

Graph View

Table of Contents

Backlinks

Kriging - modeling $m (u)$

Spatial Autocorrelation: Modelling $e (u)$