Chapter 07: Reinforcement Learning - Policy-Gradient Methods

This note is a Markdown migration of the policy-gradient RL notes from the Quarto source.

Parent map: index Related chapters: maximum-likelihood-and-machine-learning, causal-inference, dependent-data-time-series-and-spatial-statistics

Reinforcement learning (RL) algorithms are often grouped into two broad families:

Value-based methods learn a value function (e.g., $Q (s, a)$ ) and define a policy implicitly by acting (approximately) greedily with respect to it (e.g., Q-learning, DQN).
Policy-gradient methods directly parameterize a stochastic policy $π_{θ} (a ∣ s)$ and optimize its value $J (θ)$ by gradient ascent (e.g., REINFORCE, actor-critic, PPO/TRPO).

This note focuses on policy-gradient methods because (i) they have a clean causal-inference interpretation (policies as stochastic interventions / propensity models and $Q, V$ as outcome models), (ii) they handle continuous actions naturally, and (iii) modern deep RL practice is dominated by stable actor-critic and trust-region variants.

They are primarily for the author’s own reference but may serve as a useful bridge for others who approach RL from a causal inference direction and frequently scratch their heads about the startling similarities between concepts and the distressing differences in notation and priorities. Relies heavily on chapters 14-15 of Stefan Wager’s excellent causal inference textbook, Murphy 2025, some hesitant references to Sutton and Barto, and zero-shot elaborations from a council of frontier LLMs.

1. Problem Setup

Markov Decision Process (MDP).
An MDP is the tuple

M = (S, A, P, R, γ),

where

$S$ is a (finite or continuous) state space,
$A$ an action space,
$P (s^{'} ∣ s, a)$ the transition kernel,
$R (s, a)$ the (possibly stochastic) immediate reward,
$γ \in [0, 1)$ a discount factor.

A policy $π_{θ} (a ∣ s)$ is a conditional distribution over actions given states, parameterized by $θ \in R^{d}$ . We will treat the policy as the treatment assignment rule in a causal-inference sense.

Causal-inference mapping.

State $s$ $\leftrightarrow$ covariates $X$ .
Action $a$ $\leftrightarrow$ treatment $W$ .
Reward $r$ $\leftrightarrow$ outcome $Y$ .
Policy $π_{θ} (a ∣ s)$ $\leftrightarrow$ propensity score $P (W = a ∣ X = s)$ .
$Q^{π} (s, a) = E [G_{t} ∣ s_{t} = s, a_{t} = a]$ $\leftrightarrow$ outcome model $μ (a, x)$ .

Policy-based vs value-based (quick CI-friendly orientation).

Value-based RL (Q-learning, DQN): learn an optimal outcome model $Q^{⋆} (s, a)$ via Bellman optimality and pick actions via $a \in ar g max_{a} Q^{⋆} (s, a)$ . In causal terms, this is “estimate counterfactual mean outcomes, then choose the best treatment” (a plug-in optimal regime learner), with the complication that $Q^{⋆}$ is defined by a fixed point and involves future decisions.
Policy gradients (REINFORCE, PPO): directly update a parametric propensity model $π_{θ} (a ∣ s)$ to increase $J (θ)$ . In causal terms, this is “move the assignment mechanism along a smooth path that increases expected welfare,” and the gradient weights changes in propensities by long-run causal contrasts (advantages).
Actor-critic: uses $Q / V$ mainly as a regression adjustment / control variate for the policy update rather than as the sole object to maximize.

Why policy gradients are popular (first order): both families struggle when there is little overlap (rare actions in relevant states) because learning requires extrapolation. Value-based methods amplify this via repeated bootstrapping and $max$ operators (the “deadly triad”: function approximation + bootstrapping + off-policy data), while policy-gradient methods can be stabilized by staying close to the current/behavior policy (trust regions, incremental updates, entropy regularization) and avoid explicit $ar g max_{a}$ over continuous actions. That said, value-based methods remain extremely strong in small, discrete action spaces and can be more sample-efficient because they reuse off-policy data.

Trajectory. A trajectory $τ = (s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, \dots)$ follows the dynamics induced by $π_{θ}$ and $P$ . The probability density of a trajectory under policy $π_{θ}$ is

p_{θ} (τ) = ρ_{0} (s_{0}) t \geq 0 \prod π_{θ} (a_{t} ∣ s_{t}) P (s_{t + 1} ∣ s_{t}, a_{t}) p (r_{t} ∣ s_{t}, a_{t}),

where $ρ_{0}$ is the initial state distribution.

Return.
The discounted return from time $t$ is

G_{t} = k = 0 \sum \infty γ^{k} r_{t + k} .

We seek the policy objective

J (θ) = E_{τ \sim p_{θ}} [G_{0}] = E_{s_{0} \sim ρ_{0}} [V^{π} (s_{0})],

where

V^{π} (s) = E [G_{t} ∣ s_{t} = s]

is the state-value function and

Q^{π} (s, a) = E [G_{t} ∣ s_{t} = s, a_{t} = a] .

In causal language, $J (θ)$ is the average counterfactual reward if we set the “treatment rule” to $π_{θ}$ .

It is often convenient to define the discounted state-occupancy measure

d^{π} (s) = t \geq 0 \sum γ^{t} Pr_{π} (s_{t} = s),

or its normalized version $ρ^{π} (s) = (1 - γ) d^{π} (s)$ (a proper distribution). With $\overset{r}{ˉ} (s, a) = E [r_{t} ∣ s_{t} = s, a_{t} = a]$ ,

J (θ) = s \sum d^{π} (s) a \sum π_{θ} (a ∣ s) \overset{r}{ˉ} (s, a) = \frac{1}{1 - γ} E_{s \sim ρ^{π}, a \sim π_{θ}} [\overset{r}{ˉ} (s, a)] .

1.1 A Causal Model for Sequential Decisions (and Where It Breaks)

Write the history (information set) at time $t$ as

H_{t} = (s_{0}, a_{0}, r_{0}, \dots, s_{t}),

and view a policy as a dynamic treatment regime $π (a_{t} ∣ H_{t})$ (for Markov policies, this reduces to $π (a_{t} ∣ s_{t})$ ).

To interpret policies causally, it is helpful to introduce policy potential outcomes: $(s_{t}^{π}, a_{t}^{π}, r_{t}^{π})_{t \geq 0}$ denotes the counterfactual trajectory generated if actions were assigned according to $π$ . Then the causal estimand is

J (π) = E [t \geq 0 \sum γ^{t} r_{t}^{π}] .

Identification from logged data hinges on sequential versions of the usual causal assumptions:

Consistency. If the realized action rule matches $π$ , then realized states/rewards equal the corresponding potential outcomes.
Positivity / overlap. For any history $h_{t}$ that occurs under the target policy, the behavior policy assigns positive probability to any action that $π$ might choose.
Sequential ignorability (no unobserved confounding). Roughly, $(r_{t}^{\overset{a}{ˉ}_{t}}, s_{t + 1}^{\overset{a}{ˉ}_{t}}) ⊥ ⊥ a_{t} ∣ H_{t}$ for all action histories $\overset{a}{ˉ}_{t}$ , meaning that given the recorded history, the action is “as-if randomized” with respect to future potential outcomes.

Under these assumptions, the sequential g-formula identifies the counterfactual trajectory distribution under $π$ as exactly the interventional rollout distribution we wrote earlier,

p_{π} (τ) = ρ_{0} (s_{0}) t \geq 0 \prod π (a_{t} ∣ s_{t}) P (s_{t + 1} ∣ s_{t}, a_{t}) p (r_{t} ∣ s_{t}, a_{t}),

and therefore

J (π) = E_{τ \sim p_{π}} [t \geq 0 \sum γ^{t} r_{t}] .

This is the core bridge: RL “policy evaluation” is g-computation for a dynamic treatment regime, and off-policy methods are sequential IPW/DR for that same estimand.

Endogeneity in RL = missing state / hidden confounding.
If there is an unobserved variable $u_{t}$ that affects both action choice and rewards/transitions, then conditioning on $s_{t}$ (or even $H_{t}$ ) may not block backdoor paths. In econometric language, actions become endogenous and $π (a ∣ s)$ is not a valid propensity score. This is mainly a problem for offline (observational) evaluation and learning: importance weighting and doubly-robust estimators are generally biased without sequential ignorability given the logged state/history. (By contrast, if you can actively randomize actions online, policy-gradient learning remains viable even with partial observability—what breaks is the ability to make causal off-policy statements from observational logs without extra assumptions.)

Where explicit randomization enters.

In online RL / experimentation, the learner intervenes on $a_{t}$ by sampling from a known (often stochastic) policy. This built-in randomization is the analogue of running a sequential randomized experiment and is what makes likelihood-ratio gradients and on-policy evaluation straightforward.
In offline RL, you inherit the (possibly endogenous) behavior policy that generated the data. Valid OPE/off-policy learning requires either (i) a credible ignorability story given the logged history/state and overlap, or (ii) additional structure (proxies, instruments/encouragement, partial observability models) beyond standard MDP assumptions.

In both cases, you must know or estimate action probabilities (logged propensities) to use IPW/DR style estimators; if the data do not record (or cannot credibly reconstruct) $π_{b} (a_{t} ∣ s_{t})$ , off-policy causal claims become much harder.

Analogue of IV/encouragement designs.
The closest RL analogue to IV is to use exogenous variation that shifts actions but does not directly affect rewards except through actions—e.g., randomized “nudges”/recommendations, known exploration noise, or randomized constraints on feasible actions—together with assumptions that let you separate causal effects from confounding (often formalized via confounded MDPs or POMDPs with proxy variables).

1.2 Contextual Bandits as the “Static” Special Case

A contextual bandit is the one-step (or no-transition) special case of an MDP: observe a context/state $s$ , choose an action $a \sim π (\cdot ∣ s)$ , and observe an outcome/reward $r$ . There is no state evolution driven by actions (or you can think of $T = 1$ ).

This is exactly the standard causal setup with covariates $X$ , treatment $W$ , and outcome $Y$ , except that in bandits the data may be collected adaptively (the assignment rule can depend on past data). Two key links:

Policy gradient collapses to familiar score-function / AIPW structure.
With $T = 1$ , the REINFORCE estimator is just $\nabla_{θ} lo g π_{θ} (a ∣ s) r$ , and off-policy evaluation uses the single-step weight $π (a ∣ s) / π_{b} (a ∣ s)$ .
Exploration = explicit randomization.
Bandit algorithms (e.g., $ϵ$ -greedy, Thompson sampling) deliberately randomize actions to learn; this is the experimental design piece that provides overlap/positivity. The analogue in multi-step RL is stochastic policies (or injected noise) during learning.

Bandit/RL logs are often adaptively collected (the assignment rule changes over time as data arrive). Identification via IPW/DR still goes through when propensities are known and ignorability holds given the logged context/history, but inference (standard errors) typically needs martingale/online-learning arguments rather than i.i.d. sampling.

2. Policy-Gradient Theorem

The policy-gradient theorem (Sutton et al., 2000) gives a closed-form expression for the gradient of $J$ :

\nabla_{θ} J (θ) = E_{τ \sim p_{θ}} [t \geq 0 \sum γ^{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) Q^{π} (s_{t}, a_{t})] . (PGT)

Equivalently, using the discounted occupancy measure $d^{π} (s) = \sum_{t \geq 0} γ^{t} Pr_{π} (s_{t} = s)$ ,

\nabla_{θ} J (θ) = E_{s \sim d^{π}, a \sim π_{θ} (\cdot ∣ s)} [\nabla_{θ} lo g π_{θ} (a ∣ s) Q^{π} (s, a)] .

2.1 Contextual Bandit Special Case (Why the Gradient Is “Treatment Effect × Policy Sensitivity”)

If you set the horizon to one step ( $T = 1$ ), RL becomes a contextual bandit: observe $s$ , pick $a \in A$ , observe reward $r$ , and stop. Let the conditional mean outcome be

μ (s, a) = E [r ∣ s, a] .

Then the value of a stochastic policy is just the mixture of counterfactual means:

J (θ) = E_{s} [a \in A \sum π_{θ} (a ∣ s) μ (s, a)] .

Differentiating w.r.t. the policy parameters $θ$ (not w.r.t. the discrete action) gives

\nabla_{θ} J (θ) = E_{s} [a \in A \sum \nabla_{θ} π_{θ} (a ∣ s) μ (s, a)] = E_{s, a \sim π_{θ}} [\nabla_{θ} lo g π_{θ} (a ∣ s) μ (s, a)],

which is exactly the one-step (bandit) form of the policy-gradient theorem. Since $μ (s, a)$ is unknown, on-policy sampling replaces it with the realized reward $r$ , yielding the unbiased score-function estimator $\nabla_{θ} lo g π_{θ} (a ∣ s) r$ (REINFORCE for $T = 1$ ).

For a binary action $a \in {0, 1}$ this becomes especially transparent:

J (θ) = E_{s} [π_{θ} (1 ∣ s) μ (s, 1) + (1 - π_{θ} (1 ∣ s)) μ (s, 0)],

\nabla_{θ} J (θ) = E_{s} [\nabla_{θ} π_{θ} (1 ∣ s) (μ (s, 1) - μ (s, 0))] .

The term $μ (s, 1) - μ (s, 0)$ is exactly the conditional treatment effect (CATE) at state $s$ , and $\nabla_{θ} π_{θ} (1 ∣ s)$ is how much your parametric assignment rule can change the propensity at $s$ .

If you parameterize $π_{θ} (1 ∣ s) = σ (θ^{⊤} f (s))$ (logistic propensity),

\nabla_{θ} lo g π_{θ} (a ∣ s) = (a - π_{θ} (1 ∣ s)) f (s),

so gradient ascent increases the log-odds of action $1$ in states where action $1$ has higher counterfactual value than action $0$ , and decreases it where it has lower value. In multi-step RL, the CATE is replaced by the long-run contrast $Q^{π} (s, 1) - Q^{π} (s, 0)$ (or an advantage).

2.2 Incremental Propensity-Score Interventions (Odds Tilts) as Policy Paths

Econometrics/causal inference often targets incremental propensity-score interventions (Kennedy and coauthors): instead of setting “everyone treated” vs “everyone untreated”, define a stochastic intervention that nudges treatment odds by a fixed factor and study how welfare changes as you turn the knob. In RL terms, this is a one-dimensional path of policies through the behavior policy, and differentiating $J$ along that path is a policy gradient in that specific direction.

For binary action $a \in {0, 1}$ with behavior propensity $e (s) = π_{b} (1 ∣ s)$ , an odds-tilt by $δ > 0$ defines the counterfactual policy

π_{δ} (1 ∣ s) = \frac{δ e ( s )}{1 - e ( s ) + δ e ( s )} ⟺ logit π_{δ} (1 ∣ s) = logit e (s) + lo g δ .

Let $η = lo g δ$ . In the one-step bandit case,

J (η) = E_{s} [π_{η} (1 ∣ s) μ (s, 1) + (1 - π_{η} (1 ∣ s)) μ (s, 0)],

and differentiating gives

\frac{d}{d η} J (η) = E_{s} [π_{η} (1 ∣ s) (1 - π_{η} (1 ∣ s)) (μ (s, 1) - μ (s, 0))] .

This is exactly “CATE $\times$ sensitivity of the incremental intervention.” Equivalently, in score form,

\frac{d}{d η} J (η) = E_{s, a \sim π_{η} (\cdot ∣ s)} [(a - π_{η} (1 ∣ s)) μ (s, a)] = E_{s, a \sim π_{η} (\cdot ∣ s)} [\frac{d}{d η} lo g π_{η} (a ∣ s) μ (s, a)] .

Two practical takeaways that mirror offline RL practice:

Incremental interventions stay close to $π_{b}$ , improving overlap and stabilizing importance weights (contrast with large policy shifts).
This is conceptually similar to trust-region policy updates (e.g., PPO/TRPO): control policy divergence, then optimize the resulting welfare curve $J (η)$ (or take a gradient step at $η = 0$ ).

In multi-step RL, the same idea can be applied per time step by tilting each logged decision rule; $dJ / d η$ then involves discounted sums of score terms times long-run causal contrasts (advantages), making the connection to policy gradients direct.

Derivation

The key trick is the likelihood-ratio (a.k.a. score function) identity

\nabla_{θ} p_{θ} (τ) = p_{θ} (τ) t \geq 0 \sum \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) .

Then

\nabla_{θ} J = \nabla_{θ} \int p_{θ} (τ) G_{0} d τ = \int \nabla_{θ} p_{θ} (τ) G_{0} d τ = E_{τ} [G_{0} t \sum \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})] .

Now decompose $G_{0}$ around time $t$ :

G_{0} = k = 0 \sum t - 1 γ^{k} r_{k} + γ^{t} G_{t} .

The “past” term drops out because it does not depend on $a_{t}$ :

E [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) k = 0 \sum t - 1 γ^{k} r_{k}] = E [k = 0 \sum t - 1 γ^{k} r_{k} E [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) ∣ H_{t}]] = 0.

\nabla_{θ} J = E_{τ} [t \geq 0 \sum γ^{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) G_{t}] .

Finally, condition on $(s_{t}, a_{t})$ and use $Q^{π} (s_{t}, a_{t}) = E [G_{t} ∣ s_{t}, a_{t}]$ to replace $G_{t}$ by $Q^{π}$ , yielding (PGT).

Analogy to causal inference.

The score $\nabla_{θ} lo g π_{θ} (a ∣ s)$ is the analogue of the score of a propensity-score model $P_{θ} (W = a ∣ X = s)$ .
The factor $Q^{π} (s, a)$ is the counterfactual reward under treatment $a$ , analogous to $E [Y (a) ∣ S = s]$ .
The expectation over trajectories is the analogue of the expectation over the joint distribution of $(S, A, Y)$ .

3. REINFORCE (Monte-Carlo Policy Gradient)

The classic REINFORCE algorithm (Williams, 1992) estimates the gradient by sampling trajectories and replacing $Q^{π} (s, a)$ by the Monte-Carlo return $G_{t}$ observed along the trajectory:

\nabla_{θ} J = t = 0 \sum T - 1 γ^{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) G_{t} .

Why it works:

$G_{t}$ is an unbiased (but high-variance) estimator of $Q^{π} (s_{t}, a_{t})$ .
The estimator is thus unbiased for $\nabla_{θ} J$ .

Variance Reduction: Baselines.
Adding a baseline $b (s_{t})$ that does not depend on the action yields

\nabla_{θ} J = t = 0 \sum T - 1 γ^{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) [G_{t} - b (s_{t})] . (REINFORCE-B)

Because $E_{a_{t} \sim π_{θ} (\cdot ∣ s_{t})} [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})] = 0$ , subtracting $b (s_{t})$ leaves the expectation unchanged while reducing variance:

E [γ^{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) b (s_{t})] = E [γ^{t} b (s_{t}) E_{a_{t} \sim π_{θ} (\cdot ∣ s_{t})} [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})]] = 0.

Analogy: $b (s)$ is like a regression adjustment in causal inference: it captures the part of the outcome that is independent of the treatment, reducing noise in the estimate.

A common choice is the state-value function $V^{π} (s)$ as baseline, turning the term into an advantage estimate $A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)$ .

4. Actor-Critic: Using a Learned Critic

The main limitation of REINFORCE is the high variance of $G_{t}$ . Actor-critic methods use a critic to estimate $Q^{π}$ (or $V^{π}$ ) from data, reducing variance.

Basic actor-critic update:

Critic update (e.g., TD(0)):
$δ_{t} = r_{t} + γ V_{ϕ} (s_{t + 1}) - V_{ϕ} (s_{t}),$
minimize $E [δ_{t}^{2}]$ w.r.t. critic parameters $ϕ$ .
Actor update (policy gradient with critic):
$\nabla_{θ} J = t \sum γ^{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) [\hat{A}_{t}],$
where a common one-step advantage estimate is $\hat{A}_{t} = δ_{t}$ (TD(0)), and longer-horizon choices include GAE.

The critic acts like a control variate: it approximates $Q^{π}$ so that the gradient uses a low-variance estimate $δ_{t}$ instead of $G_{t}$ .

Connection to causal inference.

The TD error $δ_{t}$ is analogous to the residual in a regression-adjusted estimator of treatment effect.
The critic approximates the conditional mean outcome (counterfactual reward), giving a smoother estimate.

5. Off-Policy Gradient Estimation (Importance Sampling)

In practice we often have data collected under a behavior policy $π_{θ_{b}}$ (or a set of policies) and wish to evaluate or improve a different target policy $π_{θ}$ . The importance-sampling (IS) correction gives

\nabla_{θ} J (θ) = E_{τ \sim p_{θ_{b}}} [t \sum γ^{t} (i = 0 \prod t \frac{π _{θ} ( a _{i} ∣ s _{i} )}{π _{θ_{b}} ( a _{i} ∣ s _{i} )}) \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) G_{t}] . (IS-PG)

The cumulative ratio $\prod_{i = 0}^{t} \frac{π _{θ} ( a _{i} ∣ s _{i} )}{π _{θ_{b}} ( a _{i} ∣ s _{i} )}$ is the importance weight up to time $t$ , directly analogous to the inverse probability weight in causal inference ( $w = \frac{P ( A = a ∣ S )}{P _{behavior} ( A = a ∣ S )}$ ).

Why the product appears (and why it stops at $t$ ).
Because the environment dynamics $P (s_{t + 1} ∣ s_{t}, a_{t})$ (and reward model $p (r_{t} ∣ s_{t}, a_{t})$ ) do not depend on $θ$ , the trajectory density ratio is

\frac{p _{θ} ( τ )}{p _{θ_{b}} ( τ )} = i \geq 0 \prod \frac{π _{θ} ( a _{i} ∣ s _{i} )}{π _{θ_{b}} ( a _{i} ∣ s _{i} )} .

In the $t$ -th summand, factors for times $> t$ can be dropped without changing the expectation: conditional on $H_{t + 1}$ ,

E_{a_{t + 1} \sim π_{θ_{b}} (\cdot ∣ s_{t + 1})} [\frac{π _{θ} ( a _{t + 1} ∣ s _{t + 1} )}{π _{θ_{b}} ( a _{t + 1} ∣ s _{t + 1} )}] = a_{t + 1} \sum π_{θ} (a_{t + 1} ∣ s_{t + 1}) = 1,

and similarly for later steps. This yields the per-decision (prefix) weight in (IS-PG).

Variance issues.
IS weights can explode when the target policy is far from the behavior policy. Remedies include:

Weight clipping / truncation (e.g., importance-ratio clipping in PPO).
Per-decision IS (use prefix weights, and/or truncate after a fixed horizon).
Self-normalized IS (divide by sum of weights).

6. Natural Policy Gradients (NPG)

The vanilla gradient (PGT) treats $\nabla_{θ} J$ in Euclidean space. In RL, however, the policy manifold is naturally endowed with the Fisher information matrix $F (θ)$ :

F (θ) = E_{s \sim ρ^{π}} [Cov_{a \sim π_{θ}} [\nabla_{θ} lo g π_{θ} (a ∣ s)]] .

The natural gradient is defined as

\tilde{\nabla}_{θ} J (θ) = F (θ)^{- 1} \nabla_{θ} J (θ) .

Interpretation.

$F (θ)^{- 1}$ acts as a metric tensor on the parameter space, turning the Euclidean gradient into a steepest ascent direction under the KL divergence metric.
In causal terms, it is analogous to weighting by the inverse covariance of the propensity score gradient, similar to weighted least squares.

Derivation (trust-region viewpoint).
For a small parameter step $Δ θ$ , a second-order approximation gives

KL (π_{θ} ∥ π_{θ + Δ θ}) \approx \frac{1}{2} Δ θ^{⊤} F (θ) Δ θ,

while $J (θ + Δ θ) \approx J (θ) + \nabla_{θ} J (θ)^{⊤} Δ θ$ . Maximizing the linearized improvement subject to a KL “trust region” constraint,

Δ θ max \nabla_{θ} J (θ)^{⊤} Δ θ s.t. \frac{1}{2} Δ θ^{⊤} F (θ) Δ θ \leq ϵ,

yields $Δ θ \propto F (θ)^{- 1} \nabla_{θ} J (θ)$ , i.e. the natural gradient direction.

Practical algorithms.

TRPO (Schulman et al., 2015) approximates the natural gradient by solving a constrained optimization problem that keeps the KL divergence small.
PPO (Schulman et al., 2017) uses a surrogate objective with a clipped ratio, which implicitly regularizes the step size similar to natural gradients.

7. Deterministic Policy Gradients (DPG) & Deep Deterministic Policy Gradient (DDPG)

When actions are continuous and the policy is deterministic $μ_{θ} (s)$ , the policy-gradient theorem simplifies to

\nabla_{θ} J (θ) = E_{s \sim ρ^{μ}} [\nabla_{θ} μ_{θ} (s) \nabla_{a} Q^{μ} (s, a)_{a = μ_{θ} (s)}] . (DPG)

Derivation (chain rule).
If the policy is deterministic, then (up to the usual occupancy-measure subtlety) you can write

J (θ) = E_{s \sim ρ^{μ}} [Q^{μ} (s, μ_{θ} (s))] .

Differentiating through the action argument and applying the chain rule gives (DPG).

Key points:

The gradient no longer involves a log-probability term; instead we back-propagate through the deterministic mapping $μ_{θ}$ .
The critic supplies $\nabla_{a} Q$ (often via a neural network).
The expectation is over the state distribution under the current policy.
In practice, deterministic methods still rely on explicit exploration noise (e.g., in DDPG/TD3) to collect informative data; without randomization there is no overlap for off-policy evaluation and learning becomes brittle.

Analogy.

The deterministic policy is a hard treatment assignment rule (a deterministic dynamic regime).
The gradient uses the local sensitivity of the expected reward to small perturbations in the action, similar to the derivative of a regression model with respect to treatment dose.

8. Generalized Advantage Estimation (GAE)

A powerful variance-reduction trick is to combine multi-step TD errors into a smoothed advantage:

\hat{A}_{t}^{GAE (γ, λ)} = l = 0 \sum \infty (γλ)^{l} δ_{t + l},

where $δ_{t} = r_{t} + γ V_{ϕ} (s_{t + 1}) - V_{ϕ} (s_{t})$ .

$λ \in [0, 1]$ trades off bias (lower $λ$ ) vs variance (higher $λ$ ).
When $λ = 1$ , GAE reduces to the Monte-Carlo advantage $G_{t} - V_{ϕ} (s_{t})$ ; when $λ = 0$ , it reduces to the one-step TD residual $δ_{t}$ .

Causal parallel.
GAE is like using a regression baseline plus a sequence of residual corrections: it blends one-step TD residuals to trade off bias and variance. This is not literally a doubly-robust estimator, but it plays a similar “control variate / residualization” role in policy-gradient updates.

9. Trust-Region Methods: TRPO & PPO

Trust-Region Policy Optimization (TRPO).
Instead of a simple gradient step, TRPO solves

θ max E [\frac{π _{θ} ( a ∣ s )}{π _{θ_{old}} ( a ∣ s )} A_{θ_{old}} (s, a)] s.t. E [D_{KL} (π_{θ_{old}} (\cdot ∣ s) ∥ π_{θ} (\cdot ∣ s))] \leq δ .

The KL constraint defines a trust region in policy space, ensuring that updates are conservative.

Proximal Policy Optimization (PPO).
PPO replaces the exact constrained problem with a surrogate objective:

L^{CLIP} (θ) = E [min (r_{t} (θ) \hat{A}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A}_{t})],

where $r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{old}} ( a _{t} ∣ s _{t} )}$ .

The clipping term enforces a soft trust region. PPO is computationally simpler yet empirically very robust.

Causal interpretation.
The surrogate objective can be seen as a robust estimator that limits the influence of large importance weights (which correspond to extreme propensity-score ratios). The KL constraint is akin to a regularization on the propensity score distribution to avoid drastic shifts that would inflate variance.

10. Off-Policy Evaluation (OPE) and Doubly Robust Estimators

Inverse-Probability Weighted (IPW) estimator.
Given trajectories ${τ_{i}}_{i = 1}^{N}$ of (truncated) horizon $T$ from a behavior policy $π_{b}$ , define the per-step likelihood ratio

ρ_{i, t} = \frac{π ( a _{i, t} ∣ s _{i, t} )}{π _{b} ( a _{i, t} ∣ s _{i, t} )} and w_{i, 0 : t} = k = 0 \prod t ρ_{i, k} .

If the policies are history-dependent, replace $s_{i, t}$ by the full history $H_{i, t}$ in these ratios. The (trajectory-wise) IPW estimate of $J (π)$ is

J_{IPW} (π) = \frac{1}{N} i = 1 \sum N w_{i, 0 : T - 1} G_{i},

where $G_{i} = \sum_{t = 0}^{T - 1} γ^{t} r_{i, t}$ .

In practice, a common lower-variance alternative is per-decision importance sampling (PDIS):

J_{PDIS} (π) = \frac{1}{N} i = 1 \sum N t = 0 \sum T - 1 γ^{t} w_{i, 0 : t} r_{i, t} .

Doubly Robust (DR) estimator.
Combines a model estimate of $Q^{π}$ (or $V^{π}$ ) with IPW:

J_{DR} (π) = \frac{1}{N} i = 1 \sum N [\hat{V} (s_{i, 0}) + t = 0 \sum T - 1 γ^{t} w_{i, 0 : t} (r_{i, t} + γ \hat{V} (s_{i, t + 1}) - \hat{V} (s_{i, t}))] .

If either the model $\hat{V}$ is correct or the logged behavior propensities (hence the importance ratios) are correct, the estimator remains unbiased.

RL analogue.
The actor-critic update with a learned value function is a close cousin of DR: the critic plays the role of an outcome model, and the policy-ratio terms (explicitly in off-policy methods, implicitly in on-policy sampling) play the role of propensity weighting. Variance is reduced when the critic is accurate.

11. Summary & Takeaways

RL Concept	Causal-Inference Analogue	Key Mathematical Point
Policy $\pi_\theta(a	s)$	Propensity score $P(A=a
Return $G_{t}$	Counterfactual outcome $Y (a)$	Monte-Carlo estimate of $Q^{π}$
Baseline $b (s)$	Regression adjustment	Control variate reducing variance
Off-policy IS weights	Inverse probability weights	Cumulative ratio $\prod_{k \leq t} π / π_{b}$
Critic $Q^{π}$	Outcome model $\mathbb{E}[Y	A,S]$
Natural gradient	Weighted least squares / Fisher metric	Preconditioner $F^{- 1}$
Deterministic policy	Hard treatment assignment	Gradient through deterministic mapping
Trust-region (KL)	Robustness / regularization of propensity	Constrained optimization
GAE	Multi-step DR / bias-variance tradeoff	Exponential weighting of TD errors

Key equations to remember

Policy Gradient Theorem
$\nabla_{θ} J = E [t \sum γ^{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) Q^{π} (s_{t}, a_{t})] .$
REINFORCE with Baseline
$\nabla_{θ} J = t \sum γ^{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) [G_{t} - b (s_{t})] .$
Actor-Critic Update
$\nabla_{θ} J = t \sum γ^{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \hat{A}_{t} .$
Importance-Sampling Policy Gradient
$\nabla_{θ} J = t \sum γ^{t} (i = 0 \prod t \frac{π _{θ} ( a _{i} ∣ s _{i} )}{π _{b} ( a _{i} ∣ s _{i} )}) \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) G_{t} .$
Natural Gradient
$\tilde{\nabla}_{θ} J = F (θ)^{- 1} \nabla_{θ} J .$
Deterministic Policy Gradient
$\nabla_{θ} J = E_{s \sim ρ^{μ}} [\nabla_{θ} μ_{θ} (s) \nabla_{a} Q^{μ} (s, a) ∣_{a = μ_{θ} (s)}] .$
GAE
$\hat{A}_{t}^{GAE} = l = 0 \sum \infty (γλ)^{l} δ_{t + l} .$

These formulas form the backbone of most modern deep RL systems. The causal-inference perspective simply reframes them: the policy is a treatment rule, returns are counterfactual outcomes, IS weights are propensity ratios, and baselines/critics are regression adjustments that reduce variance.

12. Suggested Further Reading

Topic	Classic Papers	Modern Extensions
Policy-gradient theorem	Sutton et al. 2000	N/A
REINFORCE	Williams 1992	Actor-Critic, A2C, DDPG
Natural gradients	Amari 1998	TRPO, K-FAC
Deterministic policy gradients	Silver et al. 2014	DDPG, TD3
Trust-region methods	Schulman et al. 2015	PPO, ACKTR
Variance reduction (GAE)	Schulman et al. 2017	Generalized Advantage Estimation
Off-policy evaluation	Munos et al. 2008	DR, MAGIC, OPE

Lalgorithms

Explorer