This note is a high-fidelity Markdown migration of the Mathematical Background appendix from the LaTeX source.

Parent map: index Related chapters: probability-and-mathstats, linear-regression, maximum-likelihood-and-machine-learning

Concept map

flowchart TD
  A[Proof Techniques] --> B[Set Theory]
  B --> C[Analysis and Topology]
  C --> D[Measure and Integration]
  D --> E[Probability Theory]
  C --> F[Linear Algebra]
  F --> G[Function Spaces]
  G --> H[Calculus and Optimisation]

Mathematical Background

Proof Techniques

Direct Proof / modus ponens: If $R$ is a true statement and $R ⟹ S$ is a true conditional statement, then $S$ is a true statement. Direct proofs typically involve backwards-forwards reasoning - take all statements that follow from $R$ that might relate to $S$ , list them in $R$ . Then, take all statements that follow from $S$ , list them in $S$ . Then, look for statements $r, s \in R \times S$ that have a straightforward proof, and write proof of the form $R ⟹ r ⟹ s ⟹ S$ .
Contrapositive: Since every conditional statement is equivalent to its contrapositive, proving $\neg Q ⟹ \neg P$ is equivalent to proving $P ⟹ Q$ .
Proof by contradiction: Assume $P$ is true, and assume $Q$ is false [i.e. $\neg Q$ is true], and show that $\neg Q ⟹ S$ (using $P$ and other possible intermediate results) where $S$ is known to be false. Conclude that $\neg Q$ must be false, so $Q$ must be true, and we have proved that $P ⟹ Q$ .
Induction [only applies to statements pertaining to well ordered sets] $N$
- Assume a base case - $P (0)$ is a true statement
- Prove whenever $P (k)$ is true, $P (k + 1)$ is true
- Therefore $P (n)$ is true for every $n \in N$

Set Theory

A set is a collection of objects. E.g. $R, Q, Z, N$ .

Set operations:

Intersection : $A \cap B$
Union : $A \cup B$
Difference: $A ∖ B : = {x : x \in A \land x \in / B}$
cartesian product: $A \times B : = {(a, b) : a \in A \land b \in B}$

Definition. Set of all subsets of S is itself a set. Denoted as $P (S)$

$∣ C ∣ > ∣ R ∣ > ∣ Q ∣ > ∣ Z ∣ > ∣ N ∣$

Relations

Given two sets $X$ and $Y$ , any subset of their Cartesian product $X \times Y$ is called a binary relation. For any pair of elements $(x, y) \in R \subseteq RX \times Y ⟹ x R y$ .

Types of Relations

Properties of binary relations:

reflexive $x R x \forall x \in X$
transitive if $x R y \land y R z ⟹ x R z$
symmetric If $x R y ⟹ y R x$
antisymmetric If $x R y \land y R x ⟹ x = y$
asymmetric if $x R y ⟹ \neg (y R x)$
complete if either $x R y$ or $y R x$ or both $\forall x, y, z \in X$

Definition. An equivalence relation $R$ on a set $X$ is a relation that is reflexive, transitive, and symmetric. Given an equivalence relation $\sim$ , the set of elements that are related to a given element $a$ :

$\sim (a) : = {x \in X : x \sim a}$

is called the equivalence class of $a$ . e.g. Indifference $\sim$ preference relation is an equivalence relation, but the preference relation $⪰$ is not because it isn’t symmetric.

Definition. A relation that is reflexive and transitive but not symmetric is called an order relation: $x ⪰ y$ . This is also called a weak order. $≻$ is not an order relation because it is not reflexive (and is called a strong order). Every order relation also induces an equivalence relation: $x \sim y ⟺ x ⪰ y \land y ⪰ x$

An ordered set $(X, ⪰)$ consists of a set $X$ together with an order relation $⪰$ defined on X.

Intervals and Contour Sets

Given an ordered set and two elements $a, b \in X s.t. b ⪰ a$ , we can define

The open interval $(a, b)$ : set of all elements strictly between $a$ and $b$ .
The closed interval $[a, b]$ : set of all elements between $a$ and $b$ $s.t. [a, b] = {x \in X : a ≼ x ≼ b}$

Analogously, for arbitrary ordered sets, $(X, ⪰)$ we can define

Upper contour set $⪰ (a) : = {x \in X : x ⪰ a}$ : set of all elements that follow or dominate a
Lower contour set $≼ (a) : = {x \in X : x ≼ a}$ : set of all elements that precede $a$ in the order $⪰$

A partial order is a relation that is reflexive, transitive, and antisymmetric.

Definition. The join of a partially ordered set $S$ is the supremum and is denoted $⋁ S$ . $max (a, b)$ is sometimes written $a \lor b$ .

The meet of a poset is the infimum and is denoted $⋀ S$ . $min (a, b)$ is sometimes written $a \land b$ .

Algebra

Definition. A set $G$ and an operation $\otimes : G \times G \to G$ defined on $G$ . Then $G : = (G, \otimes)$ is called a group if the following conditions hold:

Closure of $G$ under $\otimes$ : $\forall x, y \in G, x \otimes y \in G$
Associativity: $\forall x, y, z \in G, (x \otimes y) \otimes z = x \otimes (y \otimes z)$
Neutral element: $\exists e \in G \forall x \in G s.t. x \otimes e = e \otimes x = x$
Inverse element: $\forall x \in G, \exists y \in G : x \otimes y = e \land y \otimes x = e$ .
if additionally $\forall x, y \in G : x \otimes y = y \otimes x$ , then $G$ is an Abelian/Commutative group

$(Z, +), (R^{m \times n}, +) (R \ {0}, .)$ are all groups

Definition. A real valued vector space $V = (V, +, \cdot)$ is a vector space with two operations

+ \cdot : V \times V \to V : R \times V \to V

Where

$(V, +)$ is an Abelian Group
Distributivity
- $\forall λ \in R, x, y \in V : λ \cdot (x + y) = λ \cdot x + λ \cdot y$
- $\forall λ, ψ \in R, x \in V : (λ + ψ) \cdot x = λ \cdot x + ψ \cdot x$
Associativity : $\forall λ, ψ \in R, x \in V : λ \cdot (ψ \cdot x) = (λ ψ) \cdot x$
Neutral Item wrt outer operation: $\forall x \in V : 1 \cdot x = x$

Analysis and Topology

Preliminaries:

Vectors : $x : = (x_{1}, \dots, x_{k})$ , where $x_{i} \in R$

Metric Spaces

Definition.

d_{2} (x, y) : = ∥ x - y ∥_{2} : = (i = 1 \sum k (x_{i} - y_{i})^{2})^{1/2}

Requirements for a metric(e.g. $d_{2} : R^{2} \times R \to R \forall x, y, v \in R^{k}$ ):

$d_{2} (x, y) = 0 ⟺ x = y$ : a point is at zero distance from itself
$d_{2} (x, y) = d_{2} (y, x)$ : distance is symmetric
$d_{2} (x, y) \leq d_{2} (x, v) + d_{2} (v, y)$ : triangle inequality

We can generalise this definition to arbitrary nonempty sets $S$ .

Definition. A metric space is a nonempty set $S$ and a metric of distance $ρ : S \times S \to R \forall x, y, v \in S s.t.$

$ρ (x, y) = 0 ⟺ x = y$
$ρ (x, y) = ρ (y, x)$
$ρ (x, y) \leq ρ (x, v) + ρ (v, y)$

For example, $(R^{k}, d_{2})$ is a metric space. Many additional metric spaces in $R^{k}$ are generated by a norm.

Definition. A norm on $X \subseteq R^{k}$ is a mapping $X ∋ x \mapsto ∥ x ∥ \in R s.t. \forall x, y \in R^{k} and γ \in R$ satisfying

Nonnegativity: $∥ x ∥ \geq 0 \forall x in X$
Non degeneracy: $∥ x ∥ = 0 ⟺ x = 0$
Homogeneity: $∥ γ x ∥ = ∣ γ ∣ ∥ x ∥$
Triangle Inequality: $∥ x + y ∥ \leq ∥ x ∥ + ∥ y ∥$

Each norm $∥ . ∥$ on $R^{k}$ generates a metric $ρ$ on $R^{k}$ via $ρ (x, y) : = ∥ x - y ∥$ .

E.g. $∥ x ∥_{2} : = (\sum_{i = 1}^{k} x_{i}^{2})^{1/2}$ generates Euclidean distance $d_{2}$ .

The pair $(X, ∥ \cdot ∥)$ consisting of a vector space $X$ together with a norm $∥ \cdot ∥$ is called a normed linear space.

Definition. A Banach space $(X, ∥ \cdot ∥)$ is a normed linear space that is a complete (in the Cauchy-convergence sense) metric space with respect to the metric derived from its norm.

Definition. Also known as Minkowski Norm

A class of norms that includes $∥ . ∥_{2}$ as a special case is the $∥ \cdot ∥_{p}$ norm defined by

∥ x ∥_{p} : = (i = 1 \sum k ∣ x_{i} ∣^{p})^{1/ p}, x \in R^{k}

$∥ \cdot ∥_{p}$ norms give rise to a class of metric spaces $(R^{k}, d_{p})$ where $d_{p} (x, y) : = ∥ x - y ∥_{p} \forall x, y \in R^{k}$ .

Examples:

1: Taxicab
2: Euclidian
$\infty$ : Chebychev

Definition. Frobinius Norm of a matrix $A$ is

∥ A ∥_{F} = i = 1 \sum M j = 1 \sum N a_{ij}^{2} = tr (A^{'} A)

Definition. $a, b \in R$ , $[a, b]$ denotes the set of real numbers satisfying $a \leq x \leq b$ . $($ or $)$ denotes a strict inequality (i.e. closed from above or below).

If $S \subset R$ is bounded from above, $\exists y s.t. x \leq y \forall x \in S$ .. Then $y$ is the least upper bound or supremum of $sup {x : x \in S}$ . If $S$ is not bounded from above, we write $sup_{x \in S} = \infty$ .

Similarly, the greatest lower bound of a set or infimum is denoted $in f_{x \in S} (x) \lor in f {x : x \in S}$

Definition. A sequence $x_{1}, x_{2}, \dots x_{n}$ is denoted by ${x_{i}}_{i = 1}^{\infty}$ or ${x_{i}}$ when the range of the indices is clear.

Let ${x_{i}}$ be an infinite sequence of real numbers and $\exists S s.t. (1) \forall ε > 0, \exists N s.t. \forall n > N, x_{n} < S + ε$ and $(2) \forall ε > 0 and M > 0, \exists n > M s.t. x_{n} > S - ε$ . Then, $S$ is the $lim sup {x_{n}}$ .

If ${x_{n}}$ is not Bounded from above, $lim sup x_{n} = \infty$ .

Definition. A sequence $(x_{n})$ in a metric space $(S, ρ)$ is said to be a Cauchy sequence if, $\forall ϵ > 0, \exists N \in N s.t. ρ (x_{j}, x_{k}) < ϵ$ whenever $j, k \geq N$ (intuitively, points in a Cauchy sequence get tighter together).

Let $(x_{n})$ be a sequence of vectors in $R^{k}$ . Suppose for any $ϵ > 0, \exists n \in N s.t. \forall p, q > n, ρ (x^{p}, x^{q}) < ϵ$ . Then, $(x_{n})$ has a limit.

More basic definition: ${a_{n}}_{n = 1}^{\infty} \to A$ if $\forall ϵ > 0, \exists N s.t. \forall n \geq N, ∣ a_{n} - A ∣ < ϵ$

${a_{n} b_{n}} \to A B$
${a_{n} + b_{n}} \to A + B$

Sequences

Let $S = (S, ρ)$ be a metric space. A sequence $(x_{n}) \subset S$ is said to converge to $x \in S$ if $\forall ϵ > 0, \exists N \in N s.t. n \geq N ⟹ ρ (x_{n}, x) < ϵ$ .

Theorem. A sequence in $(S, ρ)$ can have at most one limit

Definition. centered on $x \in S$ with radius $ϵ > 0$ is the set

$B (ϵ, x) : = {z \in S : ρ (z, x) < ϵ}$

Set Definitions

Definition. A subset $E$ of $S$ is called bounded if $E \subset B (n, x)$ for some $x \in S$ and some suitably large $n \in N$ (intuition - some arbitrarily large $ϵ$ ball can fit $E$ inside it).

A sequence $(x_{n})$ in $S$ is called bounded if its range ${x_{n} : n \in N}$ is a bounded set.

Definition. A set $F \subset S$ is closed IFF for every convergent sequence contained in $F$ , the limit of the sequence is also in $F$ .

A closed set contains all its limit points. That is , if $(x_{k})$ is a convergent sequence of points in $S$ , then $lim_{k \to \infty} x_{k}$ is in S as well.

Definition. A subset of an arbitrary metric space $S$ is open iff its complement is closed, and closed iff its complement is open.

A set $S \in R^{k}$ is called open if, $\forall x \in S \exists \epsilon

0 \suchthat y \in B(\epsilon, x), \rho(x,y) < \epsilon$ is in S.

If $F$ is a closed, bounded subset of $(R, ∥ . ∥)$ , then $sup F \in F$ .

A set $S \subset R^{k}$ is open iff its complement is closed.
the union of any number of open sets is open
the intersection of a finite number of open sets is open.
the intersection of any number of closed sets is closed
the union of a finite number of closed sets is closed.

Definition. A point $x \in S$ is called an interior point of $S$ if the set ${y : ρ (y, x) < ϵ}$ is contained in $S$ for all $ϵ > 0$ sufficiently small. A point is called a boundary point if ${y : ρ (y, x) < ϵ} \cap S^{c}$ is non-empty for all $ϵ > 0$ sufficiently small. The set of all boundary points in $A$ is denoted by $\partial A$ .

The closure of a set $S$ is the set $S$ combined with all points that are the limits of sequence of points in $S$ .

Definition. A subset $A \subset S$ is said to be complete iff every cauchy sequence in A converges to some point in $A$ .

Definition. The set $K \subset S$ is called compact if every sequence contained in $K$ has a subsequence that converges to a point in $K$ .

Definition. A set $S \subset R^{k}$ is called convex if, $\forall λ \in [0, 1]$ and $a, a^{'} \in S$ , we have $λa + (1 - λ) a^{'} \in S$ . (i.e. all convex combinations of two points in a set are also in the set).

Theorem. Every bounded sequence in Euclidean space $(R^{k}, d_{2})$ has at least one convergent subsequence.

Theorem. A subset of $(R^{k}, d_{2})$ is precompact in that same space iff it is bounded and compact.

IOW : Compact $⟺$ Closed $\land$ Bounded

Theorem. All metrics on $R^{k}$ induced by a norm are equivalent.

Functions

A function $f$ from set $A$ to $B$ , written as $A ∋ x \mapsto f (x) \in B$ or $f : A \to B$ is a rule associating every element in A to one and only one element in $B$ . The point $b$ is also written as $f (a)$ , and is called the image of $a$ under $f$ . For $D \subset B$ , the set $f^{- 1} (D)$ is the set of all points in A that map into D under F, and is called the preimage of D under F. $f^{- 1} (D) : = {a \in A : f (a) \in D}$

a function $f : A \to B$ is called

injective / one-to-one if distinct elements of $A$ are always mapped into distinct elements of $B$
surjective / onto if every element of B is the image under $f$ of at least one point in $A$
bijective if a function is both injective and surjective

Definition. A real valued function $f$ on $R^{k}$ is continuous at point $a$ if $\forall ϵ > 0, \exists δ > 0 s.t. ρ (x, a) < δ ⟹ ∣ f (x) - f (a) ∣ < ϵ$ .

Equivalently, $lim_{x \to a} f (x) = f (a)$

A function is said to be continuous on the set $S \subset R^{k}$ if, $\forall a \in S \land \forall ϵ > 0, \exists δ > 0 s.t. \forall {x : ρ (x, a) < δ}, ∣ f (x) - f (a) ∣ < ϵ$ . Equivalently, in $lim_{x \to a} f (x)$ , we require the sequence of points that converge to $a$ to be entirely in $S$ .

The sum of two continuous functions is continuous
The product of two continuous functions is continuous
The quotient of two continuous functions is continuous at any point where the denominator is nonzero

Definition.

x \to c lim f (x) = L ⟺ \forall ε > 0, \exists δ > 0, s.t. 0 < ∣ x - c ∣ < δ \Rightarrow ∣ f (x) - L ∣ < ε

Definition. Given two metric spaces $(X, ρ_{X})$ , $(Y, ρ_{Y})$ , a function $f : X \to Y$ is called Lipschitz continuous if $\exists K \in R s.t. \forall x_{1}, x_{2} \in X$ , $ρ_{Y} (f (x_{1}), f (x_{2})) \leq K ρ_{X} (x_{1}, x_{2})$

such a $K$ is referred to as a Lipschitz constant for the function $f$ .

A Real valued function $f : R \to R$ is Lipschitz if $\exists K > 0$ such that $∣ f (x_{1}) - f (x_{2}) ∣ \leq K ∣ x_{1} - x_{2} ∣$

This limits how fast a function can change. Every function that has bounded first derivatives is Lipschitz continuous. A differentiable function is Lipschitz if and only if it has a bounded derivative.

Definition. A function defined on $X$ is said to be Holder of order $α > 0$ if $\exists M \geq 0$ such that $ρ_{Y} (f (x), f (y)) \leq M ρ_{X} (x, y)^{α} \forall x, y \in X$

this is also called Uniform Lipschitz.

Theorem. A function f $S \to Y$ is continuous iff the preimage $f^{- 1} (G)$ of every open set $G \subset Y$ is open in $S$ .

Definition. $f$ is continuous if $\forall ϵ > 0$ , $\exists δ > 0$ $s.t. ∣ x - x_{0} ∣ < δ \forall x ⟹ ∣ f (x) - f (x_{0}) ∣ < ϵ$ .

Theorem. Let function f $S \to Y$ , where $S, Y$ are metric spaces and $f$ is continuous. If $K \subset S$ is compact, then so is $f (K)$ , the image of $K$ under $f$ .

Result.

Γ (α) : = \int_{0}^{\infty} t^{α - 1} e^{- t} d t = \int_{0}^{1} (lo g (1/ t))^{α - 1} d t

Beta function : $B (α, β) = Γ (α) \cdot Γ (β) /Γ (α + β)$

Theorem. Let $f : k \to R$ , where $K \subset (S, ρ)$ (an arbitrary metric space). If $f$ is continuous and $K$ is compact, then $f$ attains its supremum and infimum on $K$ .

In case of continuous functions on compact domains, optima always exist.

Definition. The function $f : R \to R$ is differentiable at $x_{0}$ if $\exists lim R (x) = \frac{( f ( x ) - f ( x _{0} ))}{( x - x _{0} )} = f^{'} (x_{0})$ , i.e. $(f (x) - f (x_{0})) / (x - x_{0})$ has a limit as $x \to x_{0}$ .

The derivative of $f$ at $x_{0}$ is this limit and is denoted $f^{'} (x_{0})$ or $\frac{\partial f}{\partial x} ∣_{x = x_{0}}$

Theorem. Differentiability $\subset$ Continuity $\subset$ $\exists$ Limit i.e. not all functions with limits are continuous, not all continuous functions are differentiable.

More generally,

Continuously Differentiable $\subset$ Lipschitz Continuous $\subset$ $α -$ Holder Continuous $\subset$ Uniformly Continuous $\subset$ Continuous

Theorem.

Linearity: $f, g : X \to Y$ are differentiable at $x$ , then $f + g$ and $α f$ are differentiable at $x$ with $\nabla (f + g) (x) = \nabla f (x) + \nabla g (x)$ and $\nabla (α f) (x) = α \nabla f (x)$ .
Chain Rule: If $f$ and $g$ are differentiable, then $g \circ f$ is differentiable and $\nabla (g \circ f) (x) = \nabla g (f (x)) \nabla f (x)$ .

Theorem. Let $f : [a, b] \to R$ , $f$ is continuous and differentiable. $f (a) = f (b) ⟹ \exists c \in [a, b] s.t. f^{'} (c) = 0$ .

Theorem. $f : [a, b] \to R$ , $f$ is continuous and differentiable. Then,

$f^{'} (c) = \frac{f ( b ) - f ( a )}{b - a}$

Definition. The epigraph of a function $f$ is $epi f : = {(x, t) :} f (x) \leq t$ (i.e. area above the function).

Definition. Let $f : [a, b] \to R, x, y \in [a, b]; t \in (0, 1)$ . Then,

F is convex if $f ((1 - t) x + t y) \leq (1 - t) f (x) + t f (x)$ . $f^{''} \geq 0$ : i.e. the epigraph of $f$ is a convex set.
F is concave if $f ((1 - t) x + t y) \geq (1 - t) f (x) + t f (x)$ . $f^{''} \leq 0$

Fixed Points

Definition. Let $T : S \to S$ , where $S$ is any set. An $x^{*} \in S$ is called a fixed point of $T$ on S if $T x^{*} = x^{*}$ .

If $S \subset R$ , then fixed points of $T$ are those points in $S$ where $T$ meets the 45 degree line.

Theorem. Consider the space $(R^{k}, d)$ , where $d$ is the metric induced by any norm. Let $S \subset R^{k}$ , and let $T : S \to S$ . If $T$ is continuous and $S$ is both compact and convex, then $T$ has at least one fixed point in S.

Definition. Let $(S, ρ)$ be a metric space. $T : S \to S$ is a map. it is called

nonexpansive if $ρ (T x, T y) \leq ρ (x, y) \forall x, y \in S$
contracting if $ρ (T x, T y) < ρ (x, y) \forall x, y \in S, x \neq = y$
uniformly contracting with modulus $λ \in [0, 1)$ if $ρ (T x, T y) < λ ρ (x, y) \forall x, y \in S, x \neq = y$

Theorem. Let $T : S \to S$ , where $(S, ρ)$ is a complete metric space. If $T$ is a uniform contraction on $S$ with modulus $λ$ , then $T$ has a unique fixed point $x^{*} \in S$ . Moreover for every $x \in S$ and $n \in N$ , we have $ρ (T^{n} x, x^{*}) \leq λ^{n} ρ (x, x^{*}) ⟹ T^{n} x \to x^{*} as n \to \infty$

Measure

Definition. A $σ$ -algebra (also $σ$ -field) is a collection $F$ of subsets of $Ω$ that

$Ω \in F \land \emptyset \in F$ : includes $Ω$ itself and the null set
$A \in F ⟹ Ω - A = : A^{C} \in F$ : is closed under complement
$A_{1}, A_{2}, \dots \in F ⟹ ⋃_{i = 1}^{\infty} A_{i} \in F$ : is closed under countable unions.

This is effectively the definition of the event space $S$ for a sample space $Ω$ .

Definition. A measure $μ$ on a set $X$ assigns a nonnegative value $μ (A)$ to many subsets of $X$ . For a collection $F$ subsets of $Ω$ , a measure is a map

$μ : F \to [0, \infty]$

Given $A \in F, μ (A)$ is a measure of the ‘size’ of set $A$ .

A function $μ$ on a $σ -$ field $A$ of $X$ is a measure of

Null empty-set: $μ (\emptyset) = 0$
Non-Negativity: $\forall A \in A, 0 \leq μ (A) \leq \infty ⟹ μ : A \to [0, \infty]$
Countable Additivity: If $A_{1}, A_{2}, \dots$ are disjoint elements of $A$ (i.e. $A_{i} \cap A_{j} = \emptyset \forall i \neq = j$ ), $μ (⋃_{i = 1}^{\infty} A_{i}) = \sum_{i = 1}^{\infty} μ (A_{i})$

Existence ensured by Caratheodory’s Extension Theorem.

Examples of measures $μ$ :

If $X$ is countable, let $μ (A) = # A =$ number of points in A. This counting measure can be defined for any subset $A \subset X$ , then the $σ -$ field $A$ is the collection of all subsets of $X = : X = 2^{X}$ , the power set of $X$ .
If $X = R^{k}$ , define $μ (A) = \int \dots_{A} \int d x_{1} \dots d x_{k}$

Definition. Given a topology on $Ω$ , a Borel $σ$ -field is a $σ$ field generated by the family of open subsets of $Ω$ , i.e. the smallest $σ$ field that contains all the open sets.

The Lebesgue measure of a set $A$ can be defined implicitly for any set B in a $σ -$ field $B$ called the Borel sets of $R^{n}$ . $B$ is the smallest $σ -$ field that contains all ‘rectangles’

(a_{1}, b_{1}) \times \dots \times (a_{n}, b_{n}) : = {x \in R^{n} : a_{i} < x_{i} < b_{i}, i = 1, \dots, n}

Result. Suppose $Ω = R$ . We say that $I \subset R$ is a bounded interval if $\exists a, b, a < b s.t. I \in {[a, b], (a, b), (a, b], [a, b)}$ . Define $C^{1} : = {I \subset R, I is a bounded interval}$ , the smallest $σ -$ algebra that contains $C^{1}$ is denoted by $B^{1}$ and is called the Borelian $σ -$ algebra

Thus, a countable union of open intervals belongs to the Borelean $σ$ -algebra. Since every open subset of $R$ can be written as a countable union of open intervals, it is therefore also a Borelean set. The closed subsets are Borelean since a closed set is the complement of an open set.

Definition. Basic problem: how to assign each subset of $R^{k}$ , i.e. each element of $P (R^{k})$ a real number that will represent its ‘size’.

With $R^{n}, n = 1, 2, 3$ , $μ (A)$ is the length, area, or volume of $A$ , respectively. $μ$ is a Lebesgue measure on $R^{k}$ .

Definition. If $A$ is a $σ -$ field of subsets of $X$ , the pair $(X, A)$ is called a measurable space, and if $μ$ is a measure on $A$ , the triple $(X, A, μ)$ is called a measure space.

Definition. A measure $μ$ is called a probability measure if $μ (X) = 1$ , and then the triple $(X, A, μ)$ is called a probability space.

Definition. If $(X, A)$ is a measurable space and $f$ is a real-valued function on $X$ , $f$ if measureable if

$f^{- 1} (B) : = {x \in X : f (x) \in B} \in A$

for every Borel set $B$ .

Integration

An integral is a map assigning a number to a function, where the number is viewed as the area/volume ‘under’ the function. Given a measure space $(Ω, F, μ)$ and a measureable function $f : Ω \to R$ , an integral $\int fd μ$ is a map from $f$ to number such that the following three properties hold

If $f \geq 0$ , then $\int fd μ \geq 0$
$\forall a \in R, \int a ϕ d μ = a \int ϕ d μ$
$\int (f + g) d μ = \int fd μ + \int g d μ$

Definition. Suppose $f$ is a bounded function defined on $[a, b]$ . An increasing sequence $P : = {a = x_{0} < x_{1} < x_{2} < \dots < x_{n} = b}$ defines a partition of the interval. The mesh size of the partition is defined to be

$∣ P ∣ = max {∣ x_{i} - x_{i - 1} ∣ : i = 1, \dots, N}$

To each partition we associate two approximations of the area under the graph of $f$ , by the rules

$U (f, P) : = \sum_{j = 1}^{N} sup_{x \in [x_{j - 1}, x_{j}]} f (x) (x_{j} - x_{j - 1})$

$L (f, P) : = \sum_{j = 1}^{N} in f_{x \in [x_{j - 1}, x_{j}]} f (x) (x_{j} - x_{j - 1})$

these are called the upper and lower Riemann Sums For any partition, $U (f, P) \geq L (f, P)$ .

Definition. A bounded function $f$ defined on an interval $[a, b]$ is said to be Riemann integrable if

$in f_{P} U (f, P) = sup_{P} L (f, P); \int_{a}^{b} f (x) d x = : Riemann Integral$

Suppose $f$ is piecewise continuous, defined on $[a, b]$ . Then, $f$ is Riemann integrable and

\int_{a}^{b} f (x) d x = N \to \infty lim j = 1 \sum N f (a + \frac{j}{N} (b - a)) \frac{b - a}{N}

Theorem. If $f$ is continuous on $[a, b]$ then $F (x) : = \int_{a}^{x} f (t) d t$ is differentiable on the open interval $(a, b)$ and $F^{'} (x) = f (x) \forall x \in (a, b)$ . $F$ is called the anti-derivative.

$\int_{a}^{b} f (t) d t = F (b) - F (a)$

Definition. Given a measurable space $(Ω, F)$ and a set $E \in F$ , we define an indicator function $f_{e} : Ω \to R$ defined by

$f_{E} (ω) = 1_{ω \in E}$

This function is measurable.

Definition. Any function $f$ of the form $f (ω) \sum_{i = 1}^{n} a_{i} 1_{ω \in E_{i}} \forall a_{i} \in R \land E_{1}, \dots, E_{n} \in F$ , where ${E_{i}}_{i = 1}^{n}$ is a finite partition of $Ω$ . A countable sum of measurable functions is measurable, which implies that $f$ is measurable. Then we define

$\int fd μ : = \sum_{i = 1}^{n} a_{i} μ (E_{i})$

Definition. For any measurable function $f$ , define $f^{+} : = max {f, 0}$ and $f^{-} : = max {- f, 0}$ , which are also measurable. We also have $f = f^{+} - f^{-}$ and $∣ f ∣ = f^{+} + f^{-}$ . When at least one of $\int f^{+} d μ$ or $\int f^{-} d μ$ is finite, we define the integral

$\int fd μ : = \int f^{+} d μ - \int f^{-} d μ$ When both $\int f^{+} d μ$ and $\int f^{-} d μ$ are finite, we say $f$ is integrable w.r.t. $μ$ . Lebesgue integrals intuitively slice the function horizontally, while Riemann integrals slice vertically.

Probability Theory

Definition. The triple $(Ω, S, P)$ is a probability space if it satisfies the following

Unitarity: $Pr (Ω) = 1$
Non Negativity: $\forall s \in S, Pr (a) \geq 0 Pr (a) \in R \land Pr (a) < \infty$
Countable Additivity: If $A_{1}, A_{2}, \dots, \in S$ are pairwise disjoint[i.e. $\forall i \neq = j, A_{i} \cap A_{j} = \emptyset$ ], Then $P (⋃_{i = 1}^{\infty} A_{i}) = \sum_{i = 1}^{\infty} P (A_{i})$

Other properties for any event $A, B$

$A \subset B ⟹ Pr (A) \leq Pr (B)$
$Pr (A) \leq 1$
$Pr (A) = 1 - Pr (A^{c})$
$Pr (\emptyset) = 0$

Stated differently:

Definition. Let $P$ be a probability measure on a measurable space $(E, B)$ , so $(E, B, P)$ is a probability space. Sets $B \in B$ are called events, points $e \in E$ are called outcomes, and $P (B)$ is called the probability of B.

Let $(E, B)$ be a measurable space. Let $P : B \to [0, 1]$ be a ‘set’ function mapping the $σ -$ algebra of subsets of $E$ into the real line. We say $P$ is a probability measure if, for events $A, B \in E$ ,

$0 \leq Pr (A) \leq 1$ : Events range from never happening to always happening
$Pr (E) = 1$ : Something must happen
$Pr (\emptyset) = 0$ : Nothing never happens
$Pr (A) + Pr (A^{c}) = 1$ : A must either happen or not happen
$Pr (\cup_{n = 1}^{\infty} A_{n}) = \sum_{n = 1}^{\infty} Pr (A_{n})$ : $σ -$ additivity for countable disjoint events
- Boole’s Inequality $Pr (\cup_{n = 1}^{\infty} A_{n}) \leq \sum_{n = 1}^{\infty}$ for any sequence of events
Monotonicity: for events $A, B; A \subseteq B ⟹ Pr (A) \leq Pr (B)$

Definition. A measurable function $X : Ω \to R s.t. \forall r \in R, {ω \in Ω : X (ω) \leq r} \in E$ (event space) is called a random variable. In other words, a random variable is a function from the sample space to the real line $R$ , and the probability of its value being in a given interval is well defined.

Definition. The probability measure $P_{X}$ living on $(R, B (R))$ such that for any $A \in B (R)$ ,

$P_{X} (A) = P ({e \in E : X (e) \in A}) = : P (X \in A)$

for Borel sets $A$ is called the distribution of $X$ . The notation $X \sim Q$ is used to indicate that $X$ has distribution $Q ⟹ P_{X} = Q$ .

Definition. Map $F : R \to [0, 1]$ such that

F_{X} (x) = P (X \leq x) = : P_{X} ((- \infty, x])

for $x \in R$

Properties of CDF $F (u)$

Boundary property: $lim_{u \to - \infty} F (u) = 0, lim_{u \to \infty} F (u) = 1$
Nondecreasing: $F (x) \leq F (y)$ if $x \leq y$
Right continuous: $lim_{u ↓ x} F (u) = F (x)$

Densities

Theorem. If a finite measure $P$ is absolutely continuous wrt a $σ -$ finite measure $μ$ , then $\exists$ a nonnegative measurable function $f$ s.t.

$P (A) = \int_{A} fd μ = : \int f 1_{A} d μ$

The function $f$ in this theorem is called the Radon-Nikodym derivative of $P$ with respect to $μ$ , or the density of $P$ with respect to $μ$ , denoted

$f = \frac{d P}{d μ}$

Definition. If a random variable has density $p$ wrt Lebesgue measure on $R$ , then $X$ or its distribution $P_{X}$ is called absolutely continuous with density $p$ . Then, from R-N ,

$F_{X} (x) = P (X \leq x) = P_{X} ((- \infty, x]) = \int_{- \infty}^{x} p (u) d u$

Using the fundamental theorem of calculus, $p$ can be found from the CDF $F_{X}$ by differentiation, $p (x) = F_{X}^{'} (x)$ .

Definition. Let $X_{0}$ be a countable subset of $R$ . The measure $μ : = μ (B) = # (X \cap B)$ for borel sets $B$ is also called counting measure on $X_{0}$ . Then,

$\int fd μ = \sum_{x \in X_{0}} f (x)$

Suppose $X$ is a random variable s.t. $P (X \in X_{0}) = P_{X} (X_{0} = 1)$ . Then, $X$ is called a discrete random variable.

The density $p$ of $P_{X}$ w.r.t. $μ$ satisfies

P (X \in A) = P_{X} (A) = \int_{A} p d μ = x \in X_{0} \sum p (x) 1_{A} (x)

In particular, if $A = {y} s.t. y \in X_{0}$ , then $X \in A ⟺ X = y$ , and so

P (X = y) = x \in X_{0} \sum p (x) 1_{{y}} (x) = p (y)

The density $p$ is called the mass function for X.

Moments

Definition. If $X$ is a random variable on a probability space $(E, B, P)$ , then the expectation of $X \sim P_{X}$ (i.e. density $p$ ), is defined as

$E [X] : = \int X d P = \int x d P_{X} (x) = \int x p (x) d (x)$

For discrete RV $X$ with $P (X \in X_{0}) = 1$ for a countable set $X_{0}$ , if $μ$ is counting measure on $X_{0}$ , and $p$ is the mass function given by $p (x) = P (X = x)$ ,

E [X] : = \int x d P_{X} (x) = \int x p (x) d μ (x) = x \in X_{0} \sum x p (x)

Definition. The variance of a random variable $X$ with finite expectation is defined as

$V [X] = E [X - E [X]]^{2}$

If $X$ is absolutely continuous with density $p$ ,

$V [X] = \int (x - E [X])^{2} p (x) d x$

If $X$ is discrete with mass function $p$ ,

$V [X] = \sum_{x \in X_{0}} (x - E [X])^{2} p (x)$

Random vectors

If $X_{1}, \dots, X_{n}$ are random variables, then the function $X : E \to R^{n}$ defined by

X (e) = X_{1} (e) ⋮ X_{n} (e), e \in E

is called a random vector. The definitions above extend naturally to random vectors, e.g. the distribution $P_{X}$ of $X$ is

$P_{X} (B) = P (X \in B) : = P ({e \in E : X (e) \in B})$

for Borel sets $B \in R^{n}$ . The expectation of a random vector $X$ is the vector of expectations

E [X] = E [X_{1}] ⋮ E [X_{n}]

A random vector is said to be absolutely continuous if the CDF can be written as

F (x_{1}, x_{2}, \dots, x_{n}) = \int_{- \infty}^{x_{1}} \int_{- \infty}^{x_{2}} \dots \int_{- \infty}^{x_{n}} f (z_{1}, \dots, z_{n}) d z_{1} \dots d z_{n}

Definition. A matrix $W$ is called a random matrix if the entries $W_{ij}$ are random variables.

Definition. The covariance of a random vector $X$ is the matrix of covariances of the variables in $X$

$[Cov (() X)]_{ij} = Cov (() X_{i}, X_{j})$

If $μ = E [X]$ and $(X - μ)^{'}$ is the transpose of the mean deviation, then

Cov (X_{i}, X_{j}) : = E [(X_{i} - μ_{i}) (X_{j} - μ_{j})] = [E [(X - μ) (X - μ)^{'}]]_{ij}

Cov (X) = E [(X - μ) (X - μ)^{'}] = E [X X^{'}] - μ μ^{'}

Product Measures and Independence

Let $(X, A, μ)$ and $(Y, B, ν)$ be measure spaces. Then $\exists$ a unique product measure $μ \times ν$ on $(X \times Y, A \lor B)$ such that $(μ \times ν) (A \times B) = μ (A) ν (B) \forall A \in A, B \in B$ . The $σ$ -field $A \lor B$ is defined formally as the smallest $σ -$ field containing all sets $A \times B$ with $A \in A$ , $B \in B$ .

Theorem. Integration against the product measure $μ \times ν$ can be accomplished by iterated integration against $μ$ and $ν$ , in either order

\int f (x, y) d (μ \times ν) = \int [\int f (x, y) d ν (y)] d μ (x) = \int [\int f (x, y) d μ (x)] d ν (y)

Definition. Suppose $X_{i} : Ω \to R, 1 \leq i \leq m$ are random variables.

$X_{1}, X_{2}, \dots, X_{m}$ are indepepndent for all $B_{1}, B_{2}, \dots, B_{m}$ Borel subsets of $R$ , it is true that

Pr (X_{i} \in B_{i}, \forall i 1 \leq i \leq m) = Pr (X_{1} \in B_{1}) \dots Pr (X_{m} \in B_{m})

Conditional Expectations

Definition. Given two r.v.s $X, Y$ with finite second moments, $E [Y ∣ X]$ is defined as a $σ (X) -$ measurable function $m (X)$ such that

$m (.) arg min_{h} E_{P} [Y - m (X)]^{2}$

For continuous $X, Y \in R^{2}$ with pdf $f (x, y)$ , For any $y^{*} \in R$ , the conditional probability of the event ${Y \leq y^{*}} : = P_{Y ∣ X} (Y \leq y^{*} ∣ x)$ is defined as a function satisfying

\int_{- \infty}^{x^{*}} P_{Y ∣ X} (Y \leq y^{*} ∣ x) f_{X} (d x) = P_{X Y} (X \leq x^{*}, Y \leq y^{*}), \forall x^{*} \in R

The conditional CDF is

F_{Y ∣ X} (y^{*}) = P_{Y ∣ X} (Y \leq y^{*} ∣ x) = \int_{- \infty}^{y^{*}} \frac{f ( x , y )}{f _{X} ( x )} d y

where the conditional density is $f (y ∣ x) = f (x, y) / f_{X} (x)$ .

Definition. Let $X \sim F_{X}; Y \sim F_{Y}$ . Define the class of ‘Bounded Lipschitz functions’ with Lipschitz constant 1 as

B L (1) : = {h : R \to R s.t. ∣ h (x) - h (y) ∣ \leq ∣ x - y ∣ \land x \in R sup ∣ h (x) ∣ \leq 1}

Then the Bounded Lipschitz Distance is

d_{B L} (F_{X}, F_{Y}) : = h \in B L (1) sup ∣ E_{F_{X}} [h (X)] - E_{F_{Y}} [h (Y)] ∣

$F_{X}$ and $F_{Y}$ are said to be ‘close’ if $d_{B L} (F_{X}, F_{Y})$ is small.

Order Statistics

Theorem. Suppose $X_{1}, X_{2}, \dots, X_{n}$ are r.v.s with distribution $F_{x} ()$ . To each $ω \in Ω$ define $max (X_{1}, \dots, X_{n}) (ω) = max (X_{1} (ω), \dots, X_{n} (ω))$ . We want the distribution of $max (X_{1}, \dots, X_{n})$ .

G (r) = Pr ({ω \in Ω : max (X_{1}, \dots, X_{n}) \leq r}) = Pr (\cap_{i = 1}^{n} [X_{i} \leq r]) = i = 1 \prod n Pr (X_{i} \leq r) = i = 1 \prod n F (r) = F^{n} (r)

If $F$ has a density, $G$ has a density too

$g (r) = G^{'} (r) = n F^{n - 1} (r) f (r)$

More generally, the distribution function of $X_{(m)}$ is given by $F_{(m)}$

F_{(m)} (t) = i = m \sum n (i n) F (t)^{i} (1 - F (t))^{n - i}, - \infty < t < \infty

Reference: Severini (2005), Chapter 7.

Result. has a distribution with density $g (r) = n r^{n - 1}$

Result. (Useful for Vickrey auctions)

F_{Y^{2}} (r) f_{Y^{2}} (r) = F^{m} (r) + m (1 - F (r)) F^{m - 1} (r) = m (m - 1) (1 - F (r)) F^{m - 2} (r) f (r)

Linear Functions and Linear Algebra

Linear Functions

Definition. A function $f : X \to Y$ between two linear spaces $X, Y$ is linear if it preserves the linearity of sets $X$ and $Y$ through the following properties ( $\forall x_{1}, x_{2} \in X$ )

Additivity $f (x_{1} + x_{2}) = f (x_{1}) + f (x_{2})$
Homogeneity $f (α x_{1}) = α f (x_{1})$

$f : V \to W$ linear and bijective is called an isomorphism
A linear function that maps onto $R$ is called a linear functional.
A linear function that maps onto itself is called a linear operator / automorphism.

Every linear function mapping from a finite-dimensional domain $X$ can be represented by a matrix.

Theorem. Linear Functions $f (α x_{1} + (1 - α) x_{2}) = α f (x_{1} + (1 - α) f (x_{2})$ Imply

Additivity : $f (x_{1} + x_{2} = f (x_{1}) + f (x_{2})$ generalises to
- Convex Functions $f (α x_{1} + (1 - α) x_{2}) \leq α f (x_{1}) + (1 - α) f (x_{2})$ generalises to
- Quasiconvex functions $f (α x_{1} + (1 - α) x_{2}) \leq min (f (x_{1}, x_{2})$
Homogeneity: $f (α x) = α f (x)$ generalises to
- Homogeneous functions $f (α x) = α^{k} f (x), α > 0$ generalises to
- Homothetic functions $f (x_{1}) = f (x_{2}) ⟹ f (α x_{1}) = f (α x_{2}) α > 0$

Definition. An inner product on a vector space $V$ is a mapping $⟨ \cdot, \cdot ⟩ : V \times V \to R$ that satisfies, $\forall x, y, z \in V \land a in R$

$⟨ x, x ⟩ \geq 0 and ⟨ x, x ⟩ = 0 ⟺ x = 0$
$⟨ x, y + z ⟩ = ⟨ x, y ⟩ + ⟨ x, z ⟩$
$⟨ x, a y ⟩ = a ⟨ x, y ⟩$
$⟨ x, y ⟩ = ⟨ y, x ⟩$

$x^{'} y = ⟨ x, y ⟩ : = \sum_{i = 1}^{n} x_{i} y_{i}$

This generalises to an inner product of two functionals $u, v : R \to R$ where

$⟨ u, v ⟩ = \int_{a}^{b} u (x) v (x) d x$

An inner product defines a norm $∥ v ∥ = ⟨ v, v ⟩$ . This gives us a restatement of the Cauchy-Schwartz inequality $∣ ⟨ x, y ⟩ ∣ \leq ∥ x ∥ ∥ y ∥$

Definition.

$⟨ x, y ⟩ = 0 ⟹ x ⊥ y$ (Orthogonality). Furthermore, if $∥ x ∥ = 1 = ∥ y ∥$ , then they are said to be orthonormal.
$⟨ x, y ⟩ = \pm 1 ⟹$ x parallel to y

Definition. $A \in R^{n \times n}$ is symmetric, positive definite if $\forall x \in V : x^{'} A x$ > 0.

Definition. For conformable matrices $A$ and $B$ with identical dimensions, the element-wise product

c_{ij} = a_{ij} \times b_{ij} \forall i, j \in dim (A) \equiv A ⊙ B

is called the Hadamard product.

Definition. Euclidian norm of a vector $x \in R^{N}$ is defined as $∥ x ∥ : = ⟨ x, x ⟩$

Theorem. Angle between $u$ and $v$ is given by

$cos θ = \frac{⟨ u , v ⟩}{∥ u ∥ ∥ v ∥}$

Theorem. $∥ ⟨ x, y ⟩ ∥ \leq ∥ x ∥ ∥ y ∥$

Definition. $Trace (A) = \sum_{n = 1}^{N} a_{nn}$

For conformable matrices $A, B; tr (A B) = tr (B A)$

Definition. For a square matrix $A$ , scalar $λ$ and vector $x$ that satisfies $A x = λ x$ constitute an eigenvalue and eigenvector respectively.

Definition. A square matrix $A \in R^{n \times n}$ is orthogonal iff its columns are orthonormal so that

$A A^{'} = I = A^{'} A ⟹ A^{- 1} = A^{'}$

Definition. For $Φ : V \to W$ , we define the kernel/null space as

ker (Φ) : = Φ^{- 1} (0_{W}) = {v \in V : Φ (v) = 0_{W}}

and the image/range

Im (Φ) : = Φ (V) = {w \in W ∣ \exists v \in V : Φ (v) = w}

The dimension of the image is called the rank of $Φ$ . The dimension of the kernel is called the nullity of $Φ$ . If $X$ is finite dimensional,

$rank Φ + nullity Φ = dim X$

This is the rank-nullity result

A linear function $Φ$ has full rank if $rank Φ (X) = min {rank X, rank Y}$ .

Definition. Nonsingular Matrices

A matrix $A \in R_{n \times n}$ with columns $a_{1}, \dots a_{n}$ is non-singular or one-to-one if

A is one-to-one ⟺ {a_{1}, \dots, a_{n}} is a basis ⟺ ker (A) = {0}

Projection

Definition. Let $V$ be a vector space and $U \subseteq V$ is a subspace of $V$ . A linear mapping $π : V \to U$ is called a projection if $π^{2} = π \circ π = π$ . Since homeomorphisms can be expressed by a transformation matrix, projections can be represented as a projection matrix $P_{π}$ with the property $P_{π}^{2} = P$ . Projection matrices are always symmetric.

Result. We look at orthogonal projections of vectors $y \in R^{n}$ onto lower dimensional subspaces $X \subseteq R^{n}$ with $dim (X) = m \geq 1$ . Assume $(x_{1}, \dots, x_{m})$ is an ordered basis of $X$ . Therefore, any projection $π_{U} (x) = \sum_{i = 1}^{m} λ_{i} x_{i} = X λ$ . The problem, then, is to find $λ_{1}, \dots, λ_{m}$ coordinates of the projection (with respect to basis $X$ ) where $π_{U} (x) = B λ$ given $X = [x_{1}, \dots, x_{m}] \in R^{n \times m}$ and $λ = [λ_{1}, \dots, λ_{m}]^{T} \in R^{m}$ . The solution is the familiar OLS coef vector

$λ = (X^{'} X)^{- 1} X^{'} y$

where $(X^{'} X)^{- 1} X^{'}$ is also called the pseudo-inverse of $B$ , which can be computed as long as $(X^{'} X)^{- 1}$ is full rank.

The projection matrix is therefore

$P_{π} = X (X^{'} X)^{- 1} X^{'}$

Definition. Any basis $(b_{1}, \dots, b_{n})$ of an $n -$ dimensional vector space $V$ can be transformed into an orthogonal/orthonormal basis $(u_{1}, \dots, u_{n})$ where $span [b_{1}, \dots, b_{n}] = span [u_{1}, \dots, u_{n}]$ as follows

u_{1} : = b_{1}, u_{k} : = b_{k} - π_{span [u_{1}, \dots, u_{k - 1}]} (b_{k}), k = 2, \dots, n

where the $k$ th basis vector $b_{k}$ is projected onto the subspace spanned by the first $k - 1$ constructed orthogonal vectors $u_{1}, \dots, u_{k - 1}$ .

This is the same as FWL Theorem, but older.

Matrix Decompositions

Definition. A square matrix $A$ admits to an eigen-decomposition if it can be factorised as $A = Q Λ Q^{- 1}$ where

$Q$ is a $n \times n$ matrix whose $i$ th column is the eigenvector $q_{i}$ of $A$ (orthogonal matrix)
$Λ$ is a diagonal matrix with corresponding eigenvalues $Λ_{ii} = λ_{i}$

Theorem. If $Q, N$ are $N \times N$ orthogonal matrices

$Q^{T} = Q^{- 1}$ is also orthogonal
$QN$ is orthogonal
$d e t (Q) \in {- 1, 1}$

Definition. If $A$ is positive definite, then it admits to

$A = R^{T} R$ where $R$ is non-singular upper triangular
$A = L L^{T}$ where $L$ is non-singular lower triangular

Definition. If $A$ is a $N \times K$ matrix with full column rank, $\exists$ a factorisation $A = QR$ where

$Q$ is an orthogonal matrix
$R$ is $K \times K$ upper triangular and nonsingular (invertible)

Result. $\hat{β} = (X^{'} X)^{- 1} X^{'} y$ is often numerically unstable, so we can define $X = QR$ . The, the OLS estimate can be written as $\hat{β} = (R)^{- 1} Q^{'} y$ . The homoscedastic variance is $V [\hat{β}] = σ^{2} (X^{'} X)^{- 1} = (R^{'} R)^{- 1} σ^{2}$ .

Using the same decomposition, $\hat{y} = X \hat{β} = Q Q^{'} y$ .

Definition. Any $n \times p$ matrix $Z$ may be written as

$Z = UΣ V^{'}$

where $U$ is a $n \times n$ orthogonal matrix, $V$ is a $p \times p$ orthogonal matrix, and $Σ$ is a $n \times p$ diagonal matrix with non-negative elements.

Result. For a square covariance matrix $X^{'} X$ , if $X = US V^{'}$ , then

X^{'} X = V S^{T} U^{T} US V^{T} = VD V^{'}

where $D = S^{2}$ contains the square singular values. In other words,

U = evec (X X^{'}), V = evec (X^{'} X), S^{2} = eval (X^{'} X) = eval (X X^{'})

Matrix Identities

For conformable matrices $A, B, C$ ,

A (B + C) = AB + AC

A + B^{⊤} = A^{⊤} + B^{⊤}

A B^{⊤} = B^{⊤} A^{⊤}

(AB)^{- 1} = (B)^{- 1} (A)^{- 1}

tr (ABC) = tr (CBA) = tr (BCA)

Partitioned Matrices

Definition. It can be useful to partition a matrix as follows

X = [X_{11} X_{21} X_{12} X_{22}]

Multiplying a partitioned matrix with a stacked vector $c$

Xc = [X_{11} X_{21} X_{12} X_{22}] [c_{1} c_{2}] = [X_{11} c_{1} + X_{12} c_{2} X_{21} c_{1} + X_{22} c_{2}]

Theorem.

[X_{11} X_{21} X_{12} X_{22}]^{- 1} = [X_{11}^{- 1} + X_{11}^{- 1} X_{12} F_{2} X_{21} (X_{11})^{- 1} - F_{2} X_{21} (X_{11})^{- 1} - (X_{11})^{- 1} X_{12} F_{2} F_{2}]

where $F_{2} = (X_{22} - X_{21} (X_{11})^{- 1} X_{12})^{- 1}$

Function Spaces

Almost all of this section is based on Larry Wasserman’s notes: functionspaces.pdf, and Racine (2013).

Definition. Let $U$ be any set, let $b U$ be the collection of all bounded functions s.t. $f : U \to R$ (i.e. $sup_{x \in U} ∣ f (x) ∣ < \infty$ and let

d_{\infty} (f, g) : = ∥ f - g ∥_{\infty} : = x \in U sup ∣ f (x) - g (x) ∣

Spaces of functions can be treated as linear vector spaces

Definition. $⟨ f, g ⟩ = \int f (x) g (x) d x$

which leads to a norm for functions

$∥ f ∥_{2}^{2} = \int f^{2} (x) d x$

Definition. An operator $O$ is a higher-order function that maps from one function to another. A derivative and integral are both operators.

Operators can have eigenvalues and eigenfunctions such that

$O f = λ f$

$exp a x$ is an eigenfunction for both differentiation and integration.

Definition. is a complete ( $: =$ every Cauchy sequence in the space converges to a point in it), inner product space. Equivalently, it is a vector space endowed with an inner product and an associated norm and metric such that every Cauchy sequence has a limit in $H$ .

Intuitively, it means it doesn’t have any ‘holes’ in it ( $Q$ is not a complete space because $2$ is missing from it).

Every Hilbert space is a Banach space but the reverse is not true in general. In a hilbert space, $∥ f_{n} - f ∥ \to 0 as n \to \infty$ .

If $V$ is a hilbert space and $L$ is a closed subspace then $\forall v, \exists y \in L$ called a projection of $v$ onto $L$ that minimises $∥ v - z ∥$ over $z \in L$ . The set of elements orthogonal every $z \in L$ is denoted $L^{⊥}$ . Every $v \in L$ can be written as $v = w + z$ where $z$ is the projection of $v$ onto $L$ and $w \in L^{⊥}$ .

Result. , the set of random variables defined on a common probability space ${Σ, F, μ}$ is a Hilbert space with inner product $⟨ X, Y ⟩ = E [X Y]$ , associated norm $∥ X ∥ = E [X^{2}]$ and metric $∥ X - Y ∥$ .

Result. the space of Borel-measurable real functions $f$ on $R$ given density $w (x)$ satisfying $\int_{- \infty}^{\infty} f (x)^{2} w (x) d x < \infty$ and associated norm $∥ f ∥ = ⟨ f, f ⟩$ and metric $∥ f - g ∥$ is a hilbert space.

Definition. If $L$ and $M$ are spaces such that every $ℓ \in L$ is orthogonal to every $m \in M$ , then we define the orthogonal sum as

$L \oplus M = {l + m : l \in L, m \in M}$

A set of vectors ${e_{t}, t \in T}$ is orthonormal if $⟨ e_{s}, e_{t} ⟩ = 0$ when $s \neq = t$ and $∥ e_{t} ∥ = 1\forall T$ . This is also called an orthonormal basis. Every hilbert space has an orthonormal basis. A Hilbert space is said to be separable if there exists a countable orthonormal basis.

$L_{p}$ spaces

Let $F$ be a collection of functions $[a, b] \mapsto R$ . The $L_{p}$ norm on $F$ is defined by

$∥ f ∥_{p} = (\int_{a}^{b} ∣ f (x) ∣^{p} d x)^{1/ p}$

For $p = \infty$ , we define the sup norm $∥ f ∥_{\infty} = sup_{x} ∣ f (x) ∣$ .

The space $L_{p} (a, b)$ is defined as

$L_{p} (a, b) : = {f : [a, b] \to R : ∥ f ∥_{p} < \infty}$

Every $L_{p}$ space is a Banach Space.

Cauchy Schwartz: $(\int f (x) g (x) d x)^{2} \leq \int f^{2} (x) d x \int g^{2} (x) d x$
Minkowski : $∥ f + g ∥_{p} \leq ∥ f ∥_{p} + ∥ g ∥_{p}$ where $p > 1$
Holder: $∥ f g ∥_{1} \leq ∥ f ∥_{p} ∥ g ∥_{q}$ where $(1/ p) + (1/ q) = 1$

Result. Functions where $∥ f ∥_{2}^{2} < \infty$ are said to be square-integrable, and the space of square-integrable functions is called $L_{2}$ . Many familiar results from vector spaces carries over into $L_{2}$ .

$L_{2} (a, b)$ is a Hilbertspace. The inner product between two functions $f, g \in L_{2} (a, b)$ is $\int_{a}^{b} f (x) g (x) d x$ and the norm of $f$ is $∥ f ∥^{2} = \int_{a}^{b} f^{2} (x) d x$ . With this inner product, $L_{2} (a, b)$ is a separable Hilbert space; that is, we can find a countable orthonormal basis $ϕ_{1}, ϕ_{2}, \dots;$ , that is $∥ ϕ_{j} ∥ = 1 \forall j$ , and $\int_{a}^{b} ϕ_{i} (x) ϕ_{j} (x) = 0 \forall i \neq = j$ . It follows that if $f \in L_{2} (a, b)$ ,

$f (x) = \sum_{j = 1}^{\infty} θ_{j} ϕ_{j} (x) where θ_{j} = \int_{a}^{b} f (x) ϕ_{j} (x) d x$

are the coefficients. Parseval’s identity $\int_{a}^{b} f^{2} (x) d x = \sum_{j = 1}^{\infty} θ_{j}^{2}$ .

The span of $L_{2}$ is

{j = 1 \sum \infty a_{j} ϕ_{j} (x) : (a_{j})_{j \geq 1} \in R^{\infty}}

The projection of $f = \sum_{j = 1}^{\infty} θ_{j} ϕ_{j} (x)$ onto the span ${ϕ_{1}, \dots, ϕ_{n}}$ is $f_{n} = \sum_{j = 1}^{n} θ_{j} ϕ_{j} (x)$ , which we call the n-term linear approximation of $f$ .

Definition. A sequence of functions $ψ_{1}, \dots$ can be considered a basis. An orthonormal basis is one that admits to

$f = \sum_{j = 1}^{\infty} ⟨ f, ψ_{j} ⟩ ψ_{j}$

Mononomials $1, x, x^{2}, \dots$ are a basis for $L_{2}$ on $[0, 1]$ and $R$ , but they aren’t orthogonal.

Result. A popular basis for $L_{2}$ on $[0, 1]$ are the sines and cosines, which may be written as $ϕ_{1} = 1$ , $ϕ_{2 k} = sin 2 kπ x$ , $ϕ_{2 k + 1} = cos 2 kπ x$ . Coefficients in this expansion are referred to as the Fourier transform of the original function.

A cosine basis on $[0, 1]$ is

ϕ_{0} (x) = 1, ϕ_{j} (x) = 2 cos (2 πj x), j = 1, 2, \dots

Legendre basis on $(- 1, 1)$ is

P_{0} (x) = 1, P_{1} (x) = x, P_{2} (x) = \frac{1}{2} (3 x^{2} - 1), P_{3} (x) = \frac{1}{2} (5 x^{3} - 3 x), \dots

The Haar basis on [0,1] consists of functions ${ϕ (x), ψ_{jk} (x) : j = 0, 1, \dots, k = 0, 1, \dots, 2^{j} - 1}$ where $ϕ (x) = {10 if 0 \leq x < 1 otherwise$ $ψ_{jk} (x) = 2^{j /2} ψ (2^{j} x - k)$ and $ψ (x) = ⎩ ⎨ ⎧ - 1 1 if 0 \leq x \leq \frac{1}{2} if \frac{1}{2} < x \leq 1$

This is a doubly indexed set of functions so when $f$ is expanded in this basis we write $f (x) = α ϕ (x) + \sum_{j = 1}^{\infty} \sum_{k = 1}^{2^{j} - 1} β_{jk} ψ_{jk} (x)$ where $α = \int_{0}^{1} f (x) ϕ (x) d x$ and $β_{jk} = \int_{0}^{1} f (x) ψ_{jk} (x) d x .$ The Haar basis is an example of a wavelet basis.

Definition. Let $β$ be a positive integer. Let $T \subset R .$ The Holder space $H (β, L)$ is the set of functions $g : T \to R$ such that

g^{(β - 1)} (y) - g^{(β - 1)} (x) \leq L ∣ x - y ∣, for all x, y \in T

The special case $β = 1$ is sometimes called the Lipschitz space. If $β = 2$ then we have

∣ g^{'} (x) - g^{'} (y) ∣ \leq L ∣ x - y ∣, for all x, y \in T

Roughly speaking, this means that the functions have bounded second derivatives.

Multivariate version

There is also a multivariate version of Holder spaces. Let $T \subset R^{d}$ . Given a vector $s =$ $(s_{1}, \dots, s_{d}),$ define $∣ s ∣ = s_{1} + \dots + s_{d}, s! = s_{1}! \dots s_{d}!, x^{s} = x_{1}^{s_{1}} \dots x_{d}^{s_{d}}$ and

D^{s} = \frac{\partial ^{s_{1} + \dots + s_{d}}}{\partial x _{1}^{s_{1}} \dots \partial x _{d}^{s_{d}}}

The Hölder class $H (β, L)$ is the set of functions $g : T \to R$ such that

∣ D^{s} g (x) - D^{s} g (y) ∣ \leq L ∥ x - y ∥^{β - ∣ s ∣}

for all $x, y$ and all $s$ such that $∣ s ∣ = β - 1$ .

Definition. A Sobolev space is a space of functions possessing sufficiently many derivatives for some application domain. Formally,

Let $f$ be integrable on every bounded interval. Then $f$ is weakly differentiable if there exists a function $f^{'}$ that is integrable on every bounded interval, such that $\int_{x}^{y} f^{'} (s) d s = f (y) - f (x)$ whenever $x \leq y .$ We call $f^{'}$ the weak derivative of $f .$ Let $D^{j} f$ denote the $j^{th}$ weak derivative of $f$

The Sobolev space of order $m$ is defined by $W_{m, p} = {f \in L_{p} (0, 1) : ∥ D^{m} f ∥ \in L_{p} (0, 1)}$ The Sobolev ball of order $m$ and radius $c$ is defined by $W_{m, p} (c) = {f : f \in W_{m, p}, ∥ D^{m} f ∥_{p} \leq c}$

Definition. A Mercer kernel is a continuous function $K : [a, b] \times [a, b] \to R$ such that $K (x, y) = K (y, x),$ and such that $K$ is positive semidefinite, meaning that

i = 1 \sum n j = 1 \sum n K (x_{i}, x_{j}) c_{i} c_{j} \geq 0

for all finite sets of points $x_{1}, \dots, x_{n} \in [a, b]$ and all real numbers $c_{1}, \dots, c_{n} .$ The function

$K (x, y) = \sum_{k = 1}^{m - 1} \frac{1}{k !} x^{k} y^{k} + \int_{0}^{x \land y} \frac{( x - u ) ^{m - 1} ( y - u ) ^{m - 1}}{( m - 1 ) ! ^{2}} d u$

is an example of a Mercer kernel. The most commonly used kernel is the Gaussian kernel

$K (x, y) = e^{- \frac{∥ x - y ∥ ^{2}}{σ ^{2}}}$

Definition.

Given a kernel $K,$ let $K_{x} (\cdot)$ be the function obtained by fixing the first coordinate. That is, $K_{x} (y) = K (x, y) .$ For the Gaussian kernel, $K_{x}$ is a Normal, centered at $x .$ We can create functions by taking liner combinations of the kernel:

$f (x) = \sum_{j = 1}^{k} α_{j} K_{x_{j}} (x)$ Let $H_{0}$ denote all such functions: $H_{0} = {f : \sum_{j = 1}^{k} α_{j} K_{x_{j}} (x)}$ Given two such functions $f (x) = \sum_{j = 1}^{k} α_{j} K_{x_{j}} (x)$ and $g (x) = \sum_{j = 1}^{m} β_{j} K_{y_{j}} (x)$ we define an inner product $⟨ f, g ⟩ = ⟨ f, g ⟩_{K} = \sum_{i} \sum_{j} α_{i} β_{j} K (x_{i}, y_{j})$ In general, $f ($ and $g)$ might be representable in more than one way. You can check that $⟨ f, g ⟩_{K}$ is independent of how $f ($ or $g)$ is represented. The inner product defines a norm: $∥ f ∥_{K} = ⟨ f, f, ⟩ = \sum_{j} \sum_{k} α_{j} α_{k} K (x_{j}, x_{k}) = α^{T} K α$ where $α = (α_{1}, \dots, α_{k})^{T}$ and $K$ is the $k \times k$ matrix with $K_{jk} = K (x_{j}, x_{k})$

The Reproducing Property

Let $f (x) = \sum_{i} α_{i} K_{x_{i}} (x) .$ Note the following crucial property: $⟨ f, K_{x} ⟩ = \sum_{i} α_{i} K (x_{i}, x) = f (x)$ This follows from the definition of $⟨ f, g ⟩$ where we take $g = K_{x} .$ This implies that $⟨ K_{x}, K_{x} ⟩ = K (x, x)$ This is called the reproducing property. It also implies that $K_{x}$ is the representer of the evaluation functional.

The completion of $H_{0}$ with respect to $∥ \cdot ∥_{K}$ is denoted by $H_{K}$ and is called the RKHS generated by $K$ .

Evaluation Functionals. A key property of RKHS’s is the behavior of the evaluation functional. The evaluation functional $δ_{x}$ assigns a real number to each function. It is defined by $δ_{x} f = f (x) .$ In general, the evaluation functional is not continuous. This means we can have $f_{n} \to f$ but $δ_{x} f_{n}$ does not converge to $δ_{x} f .$ For example, let $f (x) = 0$ and $f_{n} (x) = n I (x < 1/ n^{2}) .$ Then $∥ f_{n} - f ∥ = 1/ n \to 0.$ But $δ_{0} f_{n} = n$ which does not converge to $δ_{0} f = 0.$ Intuitively, this is because Hilbert spaces can contain very unsmooth functions.

But in an $RKHS$ , the evaluation functional is continuous. Intuitively, this means that the functions in the space are well-behaved. To see this, suppose that $f_{n} \to f .$ Then

δ_{x} f_{n} = ⟨ f_{n}, K_{x} ⟩ \to ⟨ f, K_{x} ⟩ = f (x) = δ_{x} f

so the evaluation functional is continuous.

A Hilbert space is a RKHS if and only if the evaluation functionals are continuous.

Theorem. Let $ℓ$ be a loss function depending on $(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})$ and on $f (X_{1}), \dots, f (X_{n}) .$ Let $f$ minimize $ℓ + g (∥ f ∥_{K}^{2})$ where $g$ is any monotone increasing function. Then $f$ has the form $f (x) = \sum_{i = 1}^{n} α_{i} K (x_{i}, x)$ for some $α_{1}, \dots, α_{n}$

Calculus and Optimisation

Calculus

Definition. The derivative of a function $f$ at point $x$ , when defined, is the tangent to the function at $x$ .

$\frac{\partial f}{\partial x} = lim_{h \to 0} \frac{f ( x + h ) - f ( x )}{h}$

Definition. For function $f : R^{m} \to R$ , we define the

\nabla_{x} f = \frac{\partial f}{\partial x _{1}} \frac{\partial f}{\partial x _{2}} ⋮ \frac{\partial f}{\partial x _{n}}

which collects the partial derivatives in a column vector.

The matrix of partial derivatives of $f$ is called the Hessian, denoted by $\nabla^{2} (x)$

\nabla^{2} (x) = \frac{\partial}{\partial x _{1}} \frac{\partial f}{\partial x _{1}} ⋮ \frac{\partial}{\partial x _{1}} \frac{\partial f}{\partial x _{n}} \dots ⋱ \dots \frac{\partial}{\partial x _{n}} \frac{\partial f}{\partial x _{1}} ⋮ \frac{\partial}{\partial x _{n}} \frac{\partial f}{\partial x _{n}}

For a vector-valued function $f : R^{n} \to R^{m}$ , we can construct the Jacobian, which collects all $m \times n$ partial derivatives.

\frac{\partial f}{\partial x} = \nabla f_{1} (x) \nabla f_{2} (x) ⋮ \nabla f_{m} (x) = \frac{\partial f _{1} ( x )}{\partial x _{1}} \frac{\partial f _{2} ( x )}{\partial x _{1}} ⋮ \frac{\partial f _{m} ( x )}{\partial x _{1}} \frac{\partial f _{1} ( x )}{\partial x _{2}} \frac{\partial f _{2} ( x )}{\partial x _{2}} ⋮ \frac{\partial f _{m} ( x )}{\partial x _{2}} \dots \dots ⋱ \dots \frac{\partial f _{1} ( x )}{\partial x _{n}} \frac{\partial f _{2} ( x )}{\partial x _{n}} ⋮ \frac{\partial f _{m} ( x )}{\partial x _{n}}

Theorem. $f : R \to R$ admits to Taylor expansion around $a$ such that

f (x) = f (a) + \frac{f ^{'} ( a )}{1 !} (x - a) + \frac{f ^{''} ( a )}{2 !} (x - a)^{2} + \dots = n = 0 \sum \infty \frac{f ^{(n)} ( a )}{n !} (x - a)^{n}

For a function with multiple arguments $f : R^{k} \to R$ , the second-order Taylor expansion around the point $x_{0}$ is

f (x) \approx f (x_{0}) + (x - x_{0}) \cdot \nabla f (x_{0}) + \frac{1}{2} (x - x_{0}) \nabla^{2} (x) (x - x_{0})

Theorem. Let $f (x)$ have continuous first and second order partial derivatives in the $ε -$ neighbourhood of the optimum $x_{0}$ .

If $\nabla f (x_{0}) = 0$ and $\nabla^{2} (x_{0})$ is positive definite, then $x_{0}$ is a local minimum. If $\nabla f (x_{0}) = 0$ and $\nabla^{2} (x_{0})$ is negative definite, then $x_{0}$ is a local maximum.

Theorem. Let $λ_{1}, \dots, λ_{m}$ be nonnegative real numbers, and suppose $x_{0}$ maximises the Lagrangian $M (x, λ)$

$M (x, λ) = f (x) - \sum_{j = 1}^{m} λ_{j} g_{j} (x)$

Then, $x_{0}$ maximises $f (x)$ subject to constraints ( $x \in S$ )

$g_{j} (x) \leq g_{j} (x_{0}), j = 1, \dots m$

Theorem. If $φ : U \to R^{d}$ is differentiable at $a$ and $D_{φ_{a}}$ is invertible, then $\exists U^{'}, V^{'}$ such that $a \in U^{'} \subseteq U, φ (a) \in V^{'} \land φ : U^{'} \to V^{'}$ is bijective. Further, the inverse function $ψ : V^{'} \to U^{'}$ is differentiable.

Theorem. Let $U \subseteq R^{d + 1}$ be a domain and $f : U \to R$ be a differentiable function. If $x \in R^{d} \land y \in R$ , we’ll concatenate the two vectors and write $(x, y) \in R^{d + 1}$ .

Suppose $c = f (a, b)$ , and $\partial_{y} f (a, b) \neq = 0$ . Then, $\exists U^{'} ∋ a \land$ differentiable function $g : U^{'} \to R s.t. g (a) = b \land f (x, g (x)) = c \forall x \in U^{'}$ .

Further, $\exists V^{'} ∋ b s.t. {(x, y) ∣ x \in U^{'}, y \in V^{'}, f (x, y) = c} = {(g, g (x)) ∣ x \in U^{'}}$ . IoW, $\forall x \in U^{'}, f (x, y) = c$ has a unique solution $y = g (x) \in V^{'}$ .

Theorem. Let $f : R^{2} \to R$ be differentiable and consider the implicitly defined curve

Γ : = {(x, y) \in R^{2} ∣ f (x, y) = c}

(i.e. a level set of $f$ ). Pick $(a, b) \in Γ$ , suppose $\partial_{y} f (a, b) \neq = 0$ . By IFT, we know the $y$ -coordinate of this curve can locally be expressed as a differentiable function of $x$ . Directly differentiating $f (x, y) = c w.r.t. x$ gives

\partial_{x} f + \partial_{y} f \frac{d y}{d x} = 0 ⟺ \frac{d y}{d x} = - \frac{\partial _{x} f ( a , b )}{\partial _{y} f ( a , b )}

Differentiation Rules

Rule	$f (x)$	$f^{'} (x)$
Power rule	$x^{a}$	$a x^{a - 1}$
Exponential rule	$e^{x}$	$e^{x}$
Log rule	$lo g x$	$\frac{1}{x}$
Linear rule	$(a f + b g)$	$a \frac{\partial f}{\partial x} + b \frac{\partial g}{\partial x}$
Product rule	$(f \cdot g)$	$f^{'} (x) g (x) + f (x) g^{'} (x)$
Quotient rule	$\frac{f}{g}$	$\frac{f ^{'} ( x ) g ( x ) - f ( x ) g ^{'} ( x )}{( g ( x ) ) ^{2}}$
Chain rule	$f (g (x))$	$\frac{\partial f}{\partial g} \frac{\partial g}{\partial x}$

Matrix Derivatives

Let $a, x \in R^{n}$ , and $A$ be a conformable matrix

$\frac{\partial a ^{'} x}{\partial x} = a$
$\frac{\partial a ^{'} x}{\partial x ^{'}} = a^{'}$

$\frac{\partial}{\partial x ^{'}} A x = A$
$\frac{\partial}{\partial x} A x = A^{'}$
$\frac{\partial}{\partial x} x^{'} A x = (A + A^{'}) x$
$\frac{\partial}{\partial A} x^{'} A x = x x^{'}$
$\frac{\partial}{\partial A} lo g ∣ A ∣ = (A^{'})^{- 1}$

General Results from Optimisation Theory

References: Luenberger (1997), Rustagi (2014).

Projection Theorem - In $R^{k}$ , the shortest line from a point to the plane is furnished by the perpendicular from the point to the plane. Core idea carries through to higher dimensions and infinite-dimensional Hilbert Space
Hahn-Banach Theorem: given a sphere and a point not in the sphere, there exists a hyperplane separating the point and the sphere.
Duality: The shortest distance from a point to a convex set is equal to the maximum of the distances from the point to a hyperplane separating the point from the convex set.
Differentials: Set derivative of the objective function to zero.

Linear Programming

Maximise

x max Z = c^{⊤} x

subject to

Ax x \leq b \geq 0

where $x \in R^{n}$ is the choice vector, $c \in R^{n}$ is a given vector, $A \in R^{m \times n}$ is a known matrix of constants, and $b \in R^{m}$ is a vector of constants.

Definition. By introducing $m$ slack variables $y_{1}, \dots, y_{m}$ , $y = (y_{1}, \dots, y_{m})^{⊤}$ for every inequality with $y \geq 0$ , we can convert every linear programming into its standard form

x min Z Ax + Iy x = (c^{⊤} x^{⊤} + 0^{⊤} y) subject to = b \geq 0, y \geq 0

Definition. Primal

x min Ax x c^{⊤} x s.t. \geq b \geq 0

Dual

y max A^{⊤} y y b^{⊤} y s.t. \leq c \geq 0, y \in R^{n + m}

Theorem. A feasible solution $x_{0}$ to the primal is optimal IFF there exists a feasible solution $y_{0}$ to the dual problem such that

$c^{⊤} x_{0} = b^{⊤} y_{0}$

Dantzig’s Simplex method, Karmarkar’s Algorithm.

Nonlinear Optimisation

minimise [maximise] g_{i} (x) x f (x) s.t. \leq a_{i} i = 1, \dots, k \geq 0

Saddle Point

Suppose we have $x, y \in R^{n}$ and $ϕ (\cdot)$ is a real valued function. Then, $(x_{0}, y_{0}), x_{0} \geq 0, y_{0} \geq 0$ is a saddle-point of $ϕ (x, y)$ if

$ϕ (x_{0}, y) \leq ϕ (x_{0}, y_{0}) \leq ϕ (x, y_{0})$

$\forall x, y \geq 0$ .

ϕ (x_{0}, y_{0}) ϕ (x_{0}, y_{0}) = x min y max ϕ (x, y) = y max x min ϕ (x, y)

Definition.

Q C^{⊤} x x = a^{⊤} x - \frac{1}{2} x^{⊤} Bx s.t. \leq d \geq 0

where $a \in R^{n}$ , $B \in R^{n \times n}$ is symmetric, positive definite, $C \in R^{n \times k}$ is a matrix of constraints, and $d \in R^{k}$ .

Constrained Maximisation

Theorem. We want to maximise $f (x) s.t. g (x) = c$ (implicitly defined by $S : = {g = c}$ ). Suppose $\nabla g \neq = 0\forall x \in S$ . If $f$ attains a constrained local maximum (or minimum) at $a$ on the surface $S$ , $\exists λ \in R s.t. \nabla f (a) = λ \nabla g (a)$ .

Generic problem of the form

Definition. $max_{x_{1}, x_{2} \in R^{n}} f (x_{1}, x_{2}) s.t. g (x_{1}, x_{2}) = b$

First, write $L (x_{1}, x_{2}, λ) = f (x_{1}, x_{2}) + λ [g (x_{1}, x_{2}) - b]$

differentiating wrt $x_{1}, x_{2}, λ$ yields FOCs

[x_{1}] [x_{2}] [λ] : \frac{\partial L}{\partial x _{1}} = f_{1} (x_{1}, x_{2}) + λ g_{1} (x_{1}, x_{2}) = 0 : \frac{\partial L}{\partial x _{2}} = f_{2} (x_{1}, x_{2}) + λ g_{2} (x_{1}, x_{2}) = 0 : \frac{\partial L}{\partial λ} = g (x_{1}, x_{2}) - b = 0

which gives us three (potentially nonlinear) equations with three unknowns $(x_{1}, x_{2}, λ)$ , that can be solved simultaneously.

To check sufficiency, the second-order condition analogue is the determinant of the bordered hessian matrix

BH (x_{1}, x_{2}, λ) = 0 - g_{1} (x_{1}, x_{2}) - g_{2} (x_{1}, x_{2}) - g_{1} (x_{1}, x_{2}) f_{11} (x_{1}, x_{2}) f_{21} (x_{1}, x_{2}) - g_{2} (x_{1}, x_{2}) f_{12} (x_{1}, x_{2}) f_{22} (x_{1}, x_{2})

If $det BH > 0$ , then it is negative definite, which implies that the $(x_{1}^{*}, x_{2}^{*})$ that solves the system is indeed a local maximum.

Definition. The Hessian for of a $C^{2}$ [twice differentiable] function $f : R^{n} d \to R$ is defined by the matrix

H f = \partial_{1} \partial_{1} f \partial_{1} \partial_{2} f ⋮ \partial_{1} \partial_{d} f \partial_{2} \partial_{1} f \partial_{2} \partial_{2} f ⋮ \partial_{2} \partial_{d} f \dots \dots \dots \dots \partial_{d} \partial_{1} f \partial_{d} \partial_{2} f ⋮ \partial_{d} \partial_{d} f

If $(A v) \cdot v \leq 0 \forall v \in R^{d}$ , $A$ is said to be negative semi-definite.
If $(A v) \cdot v < 0 \forall v \in R^{d}$ , $A$ is said to be negative definite.
If $(A v) \cdot v \geq 0 \forall v \in R^{d}$ , $A$ is said to be positive semi-definite.
If $(A v) \cdot v > 0 \forall v \in R^{d}$ , $A$ is said to be positive definite.

Numerical Optimisation

Root-finding

We want to evaluate the roots of the equation

$y = f (x) = 0, x \in R$

Assume the inverse of $f$ , denoted $f^{- 1}$ exists.

$x = f^{- 1} (y) = g (y)$

Finding the root of $f (x) = 0$ is equivalent to evaluating $g (0) = x$ .

Canonical newton-raphson is

$x_{i + 1} = x_{i} - \frac{f ( x _{i} )}{f ^{'} ( x _{i} )}$

Quasi-Newton

General version of update rule:

θ_{k + 1} = θ_{k} - λ_{k} \cdot A_{k} \frac{\partial ℓ}{\partial θ} (θ_{k})

Step length $λ = 1$ for both N-R and BHHH.

Definition. set $A_{k} = (\nabla^{2} (θ))^{- 1}$

Update rule:

x_{i + 1} = x_{i} - \frac{f ^{'} ( x _{i} )}{f ^{''} ( x _{i} )}

For Log-likelihood,

θ_{k + 1} = θ_{k} - (\frac{\partial ^{2} ℓ}{\partial θ \partial θ ^{'}} (θ_{k}))^{- 1} \frac{\partial ℓ}{\partial θ} (θ_{k}) \equiv θ_{k} - (H (θ_{k}))^{- 1} s (θ_{k})

Definition. Uses Information-matrix equality. Set $A_{k} = \frac{1}{N} (S (θ_{k}) S (θ_{k})^{'})$ to be outer product of scores

A_{k} : = (\frac{1}{N} i = 1 \sum N \frac{\partial ℓ}{\partial θ} (θ_{k}) \frac{\partial ℓ}{\partial θ ^{'}} (θ_{k}))^{- 1}

Lalgorithms

Explorer

Chapter 99: Mathematical Background Appendix

Concept map

Mathematical Background

Proof Techniques

Set Theory

Relations

Intervals and Contour Sets

Algebra

Analysis and Topology

Metric Spaces

Functions

Fixed Points

Measure

Integration

Probability Theory

Densities

Moments

Random vectors

Product Measures and Independence

Conditional Expectations

Order Statistics

Linear Functions and Linear Algebra

Linear Functions

Projection

Matrix Decompositions

Matrix Identities

Partitioned Matrices

Function Spaces

Lp​ spaces

Multivariate version

The Reproducing Property

Calculus and Optimisation

Calculus

Differentiation Rules

Matrix Derivatives

General Results from Optimisation Theory

Linear Programming

Nonlinear Optimisation

Saddle Point

Constrained Maximisation

Numerical Optimisation

Root-finding

Quasi-Newton

Graph View

Table of Contents

Backlinks

$L_{p}$ spaces