Policy‑Gradient Methods – A bridge from Causal Inference

Author

Apoorva Lal and several LLMs

Published

December 30, 2025

Reinforcement learning (RL) algorithms are often grouped into two broad families:

This note focuses on policy‑gradient methods because (i) they have a clean causal‑inference interpretation (policies as stochastic interventions / propensity models and Q,V as outcome models), (ii) they handle continuous actions naturally, and (iii) modern deep RL practice is dominated by stable actor‑critic and trust‑region variants.

They are primarily for the author’s own reference but may serve as a useful bridge for others who approach RL from a causal inference direction and frequently scratch their heads about the startling similarities between concepts and the distressing differences in notation and priorities. Relies heavily on chapters 14-15 of Stefan Wager’s excellent causal inference textbook, Murphy 2025, some hesitant references to Sutton and Barto, and zero-shot elaborations from a council of frontier LLMs.

1. Problem Setup

Markov Decision Process (MDP).
An MDP is the tuple
\mathcal{M}=(\mathcal{S},\mathcal{A},P,R,\gamma), where
- \mathcal{S} is a (finite or continuous) state space,
- \mathcal{A} an action space,
- P(s'|s,a) the transition kernel,
- R(s,a) the (possibly stochastic) immediate reward,
- \gamma\in[0,1) a discount factor.

A policy \pi_\theta(a|s) is a conditional distribution over actions given states, parameterized by \theta\in\mathbb{R}^d. We will treat the policy as the treatment assignment rule in a causal‑inference sense.

Causal‑inference mapping.
- State s \leftrightarrow covariates X.
- Action a \leftrightarrow treatment W.
- Reward r \leftrightarrow outcome Y.
- Policy \pi_\theta(a|s) \leftrightarrow propensity score P(W=a\mid X=s).
- Q^\pi(s,a)=\mathbb{E}[G_t\mid s_t=s,a_t=a] \leftrightarrow outcome model \mu(a,x).

Policy‑based vs value‑based (quick CI‑friendly orientation).

  • Value‑based RL (Q‑learning, DQN): learn an optimal outcome model Q^\star(s,a) via Bellman optimality and pick actions via a\in\arg\max_a Q^\star(s,a). In causal terms, this is “estimate counterfactual mean outcomes, then choose the best treatment” (a plug‑in optimal regime learner), with the complication that Q^\star is defined by a fixed point and involves future decisions.
  • Policy gradients (REINFORCE, PPO): directly update a parametric propensity model \pi_\theta(a\mid s) to increase J(\theta). In causal terms, this is “move the assignment mechanism along a smooth path that increases expected welfare,” and the gradient weights changes in propensities by long‑run causal contrasts (advantages).
  • Actor–critic: uses Q/V mainly as a regression adjustment / control variate for the policy update rather than as the sole object to maximize.

Why policy gradients are popular (first order): both families struggle when there is little overlap (rare actions in relevant states) because learning requires extrapolation. Value‑based methods amplify this via repeated bootstrapping and \max operators (the “deadly triad”: function approximation + bootstrapping + off‑policy data), while policy‑gradient methods can be stabilized by staying close to the current/behavior policy (trust regions, incremental updates, entropy regularization) and avoid explicit \arg\max_a over continuous actions. That said, value‑based methods remain extremely strong in small, discrete action spaces and can be more sample‑efficient because they reuse off‑policy data.

Trajectory. A trajectory \tau=(s_0,a_0,r_0,s_1,a_1,r_1,\dots) follows the dynamics induced by \pi_\theta and P. The probability density of a trajectory under policy \pi_\theta is

p_\theta(\tau)=\rho_0(s_0)\prod_{t\ge 0}\pi_\theta(a_t|s_t)\,P(s_{t+1}|s_t,a_t)\,p(r_t|s_t,a_t),

where \rho_0 is the initial state distribution.

Return.
The discounted return from time t is G_t = \sum_{k=0}^\infty \gamma^k r_{t+k}. We seek the policy objective J(\theta)=\mathbb{E}_{\tau\sim p_\theta}[G_0] =\mathbb{E}_{s_0\sim \rho_0}\bigl[V^\pi(s_0)\bigr], where V^\pi(s)=\mathbb{E}\bigl[G_t\mid s_t=s\bigr] is the state‑value function and Q^\pi(s,a)=\mathbb{E}\bigl[G_t \mid s_t=s, a_t=a\bigr].

In causal language, J(\theta) is the average counterfactual reward if we set the “treatment rule” to \pi_\theta.

It is often convenient to define the discounted state‑occupancy measure d^\pi(s)=\sum_{t\ge 0}\gamma^t\,\Pr_\pi(s_t=s), or its normalized version \rho^\pi(s)=(1-\gamma)d^\pi(s) (a proper distribution). With \bar r(s,a)=\mathbb{E}[r_t\mid s_t=s,a_t=a], J(\theta)=\sum_{s} d^\pi(s)\sum_{a}\pi_\theta(a|s)\,\bar r(s,a) =\frac{1}{1-\gamma}\,\mathbb{E}_{s\sim\rho^\pi,a\sim\pi_\theta}\bigl[\bar r(s,a)\bigr].

1.1 A Causal Model for Sequential Decisions (and Where It Breaks)

Write the history (information set) at time t as H_t=(s_0,a_0,r_0,\dots,s_t), and view a policy as a dynamic treatment regime \pi(a_t\mid H_t) (for Markov policies, this reduces to \pi(a_t\mid s_t)).

To interpret policies causally, it is helpful to introduce policy potential outcomes: (s_t^\pi, a_t^\pi, r_t^\pi)_{t\ge 0} denotes the counterfactual trajectory generated if actions were assigned according to \pi. Then the causal estimand is J(\pi)=\mathbb{E}\Bigl[\sum_{t\ge 0}\gamma^t r_t^\pi\Bigr].

Identification from logged data hinges on sequential versions of the usual causal assumptions:

  • Consistency. If the realized action rule matches \pi, then realized states/rewards equal the corresponding potential outcomes.
  • Positivity / overlap. For any history h_t that occurs under the target policy, the behavior policy assigns positive probability to any action that \pi might choose.
  • Sequential ignorability (no unobserved confounding). Roughly, (r_t^{\bar a_t}, s_{t+1}^{\bar a_t}) \;\perp\!\!\!\perp\; a_t \mid H_t for all action histories \bar a_t, meaning that given the recorded history, the action is “as‑if randomized” with respect to future potential outcomes.

Under these assumptions, the sequential g‑formula identifies the counterfactual trajectory distribution under \pi as exactly the interventional rollout distribution we wrote earlier, p_\pi(\tau)=\rho_0(s_0)\prod_{t\ge 0}\pi(a_t\mid s_t)\,P(s_{t+1}\mid s_t,a_t)\,p(r_t\mid s_t,a_t), and therefore J(\pi)=\mathbb{E}_{\tau\sim p_\pi}\Bigl[\sum_{t\ge 0}\gamma^t r_t\Bigr]. This is the core bridge: RL “policy evaluation” is g‑computation for a dynamic treatment regime, and off‑policy methods are sequential IPW/DR for that same estimand.

Endogeneity in RL = missing state / hidden confounding.
If there is an unobserved variable u_t that affects both action choice and rewards/transitions, then conditioning on s_t (or even H_t) may not block backdoor paths. In econometric language, actions become endogenous and \pi(a\mid s) is not a valid propensity score. This is mainly a problem for offline (observational) evaluation and learning: importance weighting and doubly‑robust estimators are generally biased without sequential ignorability given the logged state/history. (By contrast, if you can actively randomize actions online, policy‑gradient learning remains viable even with partial observability—what breaks is the ability to make causal off‑policy statements from observational logs without extra assumptions.)

Where explicit randomization enters.

  • In online RL / experimentation, the learner intervenes on a_t by sampling from a known (often stochastic) policy. This built‑in randomization is the analogue of running a sequential randomized experiment and is what makes likelihood‑ratio gradients and on‑policy evaluation straightforward.

  • In offline RL, you inherit the (possibly endogenous) behavior policy that generated the data. Valid OPE/off‑policy learning requires either (i) a credible ignorability story given the logged history/state and overlap, or (ii) additional structure (proxies, instruments/encouragement, partial observability models) beyond standard MDP assumptions.

In both cases, you must know or estimate action probabilities (logged propensities) to use IPW/DR style estimators; if the data do not record (or cannot credibly reconstruct) \pi_b(a_t\mid s_t), off‑policy causal claims become much harder.

Analogue of IV/encouragement designs.
The closest RL analogue to IV is to use exogenous variation that shifts actions but does not directly affect rewards except through actions—e.g., randomized “nudges”/recommendations, known exploration noise, or randomized constraints on feasible actions—together with assumptions that let you separate causal effects from confounding (often formalized via confounded MDPs or POMDPs with proxy variables).

1.2 Contextual Bandits as the “Static” Special Case

A contextual bandit is the one‑step (or no‑transition) special case of an MDP: observe a context/state s, choose an action a\sim\pi(\cdot\mid s), and observe an outcome/reward r. There is no state evolution driven by actions (or you can think of T=1).

This is exactly the standard causal setup with covariates X, treatment W, and outcome Y, except that in bandits the data may be collected adaptively (the assignment rule can depend on past data). Two key links:

  • Policy gradient collapses to familiar score‑function / AIPW structure.
    With T=1, the REINFORCE estimator is just \nabla_\theta \log \pi_\theta(a\mid s)\,r, and off‑policy evaluation uses the single‑step weight \pi(a\mid s)/\pi_b(a\mid s).
  • Exploration = explicit randomization.
    Bandit algorithms (e.g., \epsilon‑greedy, Thompson sampling) deliberately randomize actions to learn; this is the experimental design piece that provides overlap/positivity. The analogue in multi‑step RL is stochastic policies (or injected noise) during learning.

Bandit/RL logs are often adaptively collected (the assignment rule changes over time as data arrive). Identification via IPW/DR still goes through when propensities are known and ignorability holds given the logged context/history, but inference (standard errors) typically needs martingale/online‑learning arguments rather than i.i.d. sampling.


2. Policy‑Gradient Theorem

The policy‑gradient theorem (Sutton et al., 2000) gives a closed‑form expression for the gradient of J:

\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim p_\theta}\!\Bigl[ \sum_{t\ge 0}\gamma^t\,\nabla_\theta \log \pi_\theta(a_t|s_t)\,Q^\pi(s_t,a_t) \Bigr]. \tag{PGT}

Equivalently, using the discounted occupancy measure d^\pi(s)=\sum_{t\ge 0}\gamma^t\Pr_\pi(s_t=s), \nabla_\theta J(\theta) = \mathbb{E}_{s\sim d^\pi,\,a\sim \pi_\theta(\cdot\mid s)}\bigl[\nabla_\theta \log \pi_\theta(a\mid s)\,Q^\pi(s,a)\bigr].

2.1 Contextual Bandit Special Case (Why the Gradient Is “Treatment Effect × Policy Sensitivity”)

If you set the horizon to one step (T=1), RL becomes a contextual bandit: observe s, pick a\in\mathcal{A}, observe reward r, and stop. Let the conditional mean outcome be \mu(s,a)=\mathbb{E}[r\mid s,a]. Then the value of a stochastic policy is just the mixture of counterfactual means: J(\theta)=\mathbb{E}_{s}\Bigl[\sum_{a\in\mathcal{A}}\pi_\theta(a\mid s)\,\mu(s,a)\Bigr]. Differentiating w.r.t. the policy parameters \theta (not w.r.t. the discrete action) gives \nabla_\theta J(\theta) =\mathbb{E}_{s}\Bigl[\sum_{a\in\mathcal{A}}\nabla_\theta \pi_\theta(a\mid s)\,\mu(s,a)\Bigr] =\mathbb{E}_{s,a\sim\pi_\theta}\bigl[\nabla_\theta \log \pi_\theta(a\mid s)\,\mu(s,a)\bigr], which is exactly the one‑step (bandit) form of the policy‑gradient theorem. Since \mu(s,a) is unknown, on‑policy sampling replaces it with the realized reward r, yielding the unbiased score‑function estimator \nabla_\theta \log \pi_\theta(a\mid s)\,r (REINFORCE for T=1).

For a binary action a\in\{0,1\} this becomes especially transparent: J(\theta)=\mathbb{E}_{s}\bigl[\pi_\theta(1\mid s)\mu(s,1)+(1-\pi_\theta(1\mid s))\mu(s,0)\bigr], so \nabla_\theta J(\theta)=\mathbb{E}_{s}\Bigl[\nabla_\theta \pi_\theta(1\mid s)\,\bigl(\mu(s,1)-\mu(s,0)\bigr)\Bigr]. The term \mu(s,1)-\mu(s,0) is exactly the conditional treatment effect (CATE) at state s, and \nabla_\theta \pi_\theta(1\mid s) is how much your parametric assignment rule can change the propensity at s.

If you parameterize \pi_\theta(1\mid s)=\sigma(\theta^\top f(s)) (logistic propensity), \nabla_\theta \log \pi_\theta(a\mid s) = (a-\pi_\theta(1\mid s))\,f(s), so gradient ascent increases the log‑odds of action 1 in states where action 1 has higher counterfactual value than action 0, and decreases it where it has lower value. In multi‑step RL, the CATE is replaced by the long‑run contrast Q^\pi(s,1)-Q^\pi(s,0) (or an advantage).

2.2 Incremental Propensity‑Score Interventions (Odds Tilts) as Policy Paths

Econometrics/causal inference often targets incremental propensity‑score interventions (Kennedy and coauthors): instead of setting “everyone treated” vs “everyone untreated”, define a stochastic intervention that nudges treatment odds by a fixed factor and study how welfare changes as you turn the knob. In RL terms, this is a one‑dimensional path of policies through the behavior policy, and differentiating J along that path is a policy gradient in that specific direction.

For binary action a\in\{0,1\} with behavior propensity e(s)=\pi_b(1\mid s), an odds‑tilt by \delta>0 defines the counterfactual policy \pi_\delta(1\mid s)=\frac{\delta\,e(s)}{1-e(s)+\delta\,e(s)} \quad\Longleftrightarrow\quad \operatorname{logit}\pi_\delta(1\mid s)=\operatorname{logit}e(s)+\log\delta. Let \eta=\log\delta. In the one‑step bandit case, J(\eta)=\mathbb{E}_s\bigl[\pi_\eta(1\mid s)\mu(s,1)+(1-\pi_\eta(1\mid s))\mu(s,0)\bigr], and differentiating gives \frac{d}{d\eta}J(\eta) =\mathbb{E}_s\Bigl[\pi_\eta(1\mid s)\bigl(1-\pi_\eta(1\mid s)\bigr)\,\bigl(\mu(s,1)-\mu(s,0)\bigr)\Bigr]. This is exactly “CATE \times sensitivity of the incremental intervention.” Equivalently, in score form, \frac{d}{d\eta}J(\eta) =\mathbb{E}_{s,\,a\sim \pi_\eta(\cdot\mid s)}\bigl[(a-\pi_\eta(1\mid s))\,\mu(s,a)\bigr] =\mathbb{E}_{s,\,a\sim \pi_\eta(\cdot\mid s)}\bigl[\tfrac{d}{d\eta}\log \pi_\eta(a\mid s)\;\mu(s,a)\bigr].

Two practical takeaways that mirror offline RL practice:

  • Incremental interventions stay close to \pi_b, improving overlap and stabilizing importance weights (contrast with large policy shifts).
  • This is conceptually similar to trust‑region policy updates (e.g., PPO/TRPO): control policy divergence, then optimize the resulting welfare curve J(\eta) (or take a gradient step at \eta=0).

In multi‑step RL, the same idea can be applied per time step by tilting each logged decision rule; dJ/d\eta then involves discounted sums of score terms times long‑run causal contrasts (advantages), making the connection to policy gradients direct.

Derivation

The key trick is the likelihood‑ratio (a.k.a. score function) identity \nabla_\theta p_\theta(\tau) = p_\theta(\tau)\sum_{t\ge 0}\nabla_\theta \log \pi_\theta(a_t|s_t). Then \nabla_\theta J = \nabla_\theta \int p_\theta(\tau)G_0\,d\tau = \int \nabla_\theta p_\theta(\tau)G_0\,d\tau = \mathbb{E}_{\tau}\!\bigl[ G_0 \sum_{t}\nabla_\theta \log \pi_\theta(a_t|s_t) \bigr]. Now decompose G_0 around time t: G_0=\sum_{k=0}^{t-1}\gamma^k r_k + \gamma^t G_t. The “past” term drops out because it does not depend on a_t: \mathbb{E}\Bigl[\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,\sum_{k=0}^{t-1}\gamma^k r_k\Bigr] = \mathbb{E}\Bigl[\sum_{k=0}^{t-1}\gamma^k r_k\;\mathbb{E}\bigl[\nabla_\theta \log \pi_\theta(a_t\mid s_t)\mid H_t\bigr]\Bigr] =0. So \nabla_\theta J =\mathbb{E}_{\tau}\!\Bigl[\sum_{t\ge 0}\gamma^t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\,G_t\Bigr]. Finally, condition on (s_t,a_t) and use Q^\pi(s_t,a_t)=\mathbb{E}[G_t\mid s_t,a_t] to replace G_t by Q^\pi, yielding (PGT).

Analogy to causal inference.
- The score \nabla_\theta \log \pi_\theta(a|s) is the analogue of the score of a propensity‑score model P_\theta(W=a\mid X=s).
- The factor Q^\pi(s,a) is the counterfactual reward under treatment a, analogous to \mathbb{E}[Y(a)\mid S=s].
- The expectation over trajectories is the analogue of the expectation over the joint distribution of (S,A,Y).


3. REINFORCE (Monte‑Carlo Policy Gradient)

The classic REINFORCE algorithm (Williams, 1992) estimates the gradient by sampling trajectories and replacing Q^\pi(s,a) by the Monte‑Carlo return G_t observed along the trajectory:

\widehat{\nabla}_\theta J = \sum_{t=0}^{T-1} \gamma^t\,\nabla_\theta \log \pi_\theta(a_t|s_t)\, G_t.

Why it works:
- G_t is an unbiased (but high‑variance) estimator of Q^\pi(s_t,a_t).
- The estimator is thus unbiased for \nabla_\theta J.

Variance Reduction: Baselines.
Adding a baseline b(s_t) that does not depend on the action yields

\widehat{\nabla}_\theta J = \sum_{t=0}^{T-1}\gamma^t\,\nabla_\theta \log \pi_\theta(a_t|s_t)\,\bigl[G_t-b(s_t)\bigr]. \tag{REINFORCE‑B}

Because \mathbb{E}_{a_t\sim\pi_\theta(\cdot\mid s_t)}\bigl[\nabla_\theta \log \pi_\theta(a_t|s_t)\bigr]=0, subtracting b(s_t) leaves the expectation unchanged while reducing variance: \mathbb{E}\bigl[\gamma^t\nabla_\theta \log \pi_\theta(a_t|s_t)\,b(s_t)\bigr] =\mathbb{E}\bigl[\gamma^t b(s_t)\,\mathbb{E}_{a_t\sim\pi_\theta(\cdot\mid s_t)}[\nabla_\theta \log \pi_\theta(a_t|s_t)]\bigr] =0. Analogy: b(s) is like a regression adjustment in causal inference: it captures the part of the outcome that is independent of the treatment, reducing noise in the estimate.

A common choice is the state‑value function V^\pi(s) as baseline, turning the term into an advantage estimate A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s).


4. Actor–Critic: Using a Learned Critic

The main limitation of REINFORCE is the high variance of G_t. Actor‑critic methods use a critic to estimate Q^\pi (or V^\pi) from data, reducing variance.

Basic actor‑critic update:

  • Critic update (e.g., TD(0)):
    \delta_t = r_t + \gamma V_{\phi}(s_{t+1}) - V_{\phi}(s_t), minimize \mathbb{E}[\delta_t^2] w.r.t. critic parameters \phi.

  • Actor update (policy gradient with critic):
    \widehat{\nabla}_\theta J = \sum_{t} \gamma^t\,\nabla_\theta \log \pi_\theta(a_t|s_t)\,\bigl[\hat A_t\bigr], where a common one‑step advantage estimate is \hat A_t=\delta_t (TD(0)), and longer‑horizon choices include GAE.

The critic acts like a control variate: it approximates Q^\pi so that the gradient uses a low‑variance estimate \delta_t instead of G_t.

Connection to causal inference.
- The TD error \delta_t is analogous to the residual in a regression‑adjusted estimator of treatment effect.
- The critic approximates the conditional mean outcome (counterfactual reward), giving a smoother estimate.


5. Off‑Policy Gradient Estimation (Importance Sampling)

In practice we often have data collected under a behavior policy \pi_{\theta_b} (or a set of policies) and wish to evaluate or improve a different target policy \pi_\theta. The importance‑sampling (IS) correction gives

\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim p_{\theta_b}} \biggl[ \sum_{t}\gamma^t\Bigl(\prod_{i=0}^t \frac{\pi_\theta(a_i|s_i)}{\pi_{\theta_b}(a_i|s_i)}\Bigr) \nabla_\theta \log \pi_\theta(a_t|s_t)\, G_t \biggr]. \tag{IS‑PG}

The cumulative ratio \prod_{i=0}^t \frac{\pi_\theta(a_i|s_i)}{\pi_{\theta_b}(a_i|s_i)} is the importance weight up to time t, directly analogous to the inverse probability weight in causal inference (w=\frac{P(A=a|S)}{P_{\text{behavior}}(A=a|S)}).

Why the product appears (and why it stops at t).
Because the environment dynamics P(s_{t+1}\mid s_t,a_t) (and reward model p(r_t\mid s_t,a_t)) do not depend on \theta, the trajectory density ratio is \frac{p_\theta(\tau)}{p_{\theta_b}(\tau)} =\prod_{i\ge 0}\frac{\pi_\theta(a_i\mid s_i)}{\pi_{\theta_b}(a_i\mid s_i)}. In the t‑th summand, factors for times >t can be dropped without changing the expectation: conditional on H_{t+1}, \mathbb{E}_{a_{t+1}\sim \pi_{\theta_b}(\cdot\mid s_{t+1})}\!\Bigl[\frac{\pi_\theta(a_{t+1}\mid s_{t+1})}{\pi_{\theta_b}(a_{t+1}\mid s_{t+1})}\Bigr]=\sum_{a_{t+1}}\pi_\theta(a_{t+1}\mid s_{t+1})=1, and similarly for later steps. This yields the per‑decision (prefix) weight in (IS‑PG).

Variance issues.
IS weights can explode when the target policy is far from the behavior policy. Remedies include:

  • Weight clipping / truncation (e.g., importance‑ratio clipping in PPO).
  • Per‑decision IS (use prefix weights, and/or truncate after a fixed horizon).
  • Self‑normalized IS (divide by sum of weights).

6. Natural Policy Gradients (NPG)

The vanilla gradient (PGT) treats \nabla_\theta J in Euclidean space. In RL, however, the policy manifold is naturally endowed with the Fisher information matrix F(\theta):

F(\theta) = \mathbb{E}_{s\sim\rho^\pi}\!\bigl[ \operatorname{Cov}_{a\sim\pi_\theta}\bigl[\nabla_\theta \log \pi_\theta(a|s)\bigr] \bigr].

The natural gradient is defined as

\tilde{\nabla}_\theta J(\theta) = F(\theta)^{-1}\nabla_\theta J(\theta).

Interpretation.
- F(\theta)^{-1} acts as a metric tensor on the parameter space, turning the Euclidean gradient into a steepest ascent direction under the KL divergence metric.
- In causal terms, it is analogous to weighting by the inverse covariance of the propensity score gradient, similar to weighted least squares.

Derivation (trust‑region viewpoint).
For a small parameter step \Delta\theta, a second‑order approximation gives \mathrm{KL}\bigl(\pi_{\theta}\,\|\,\pi_{\theta+\Delta\theta}\bigr)\approx \tfrac{1}{2}\Delta\theta^\top F(\theta)\Delta\theta, while J(\theta+\Delta\theta)\approx J(\theta)+\nabla_\theta J(\theta)^\top \Delta\theta. Maximizing the linearized improvement subject to a KL “trust region” constraint, \max_{\Delta\theta}\;\nabla_\theta J(\theta)^\top \Delta\theta \quad\text{s.t.}\quad \tfrac{1}{2}\Delta\theta^\top F(\theta)\Delta\theta \le \epsilon, yields \Delta\theta \propto F(\theta)^{-1}\nabla_\theta J(\theta), i.e. the natural gradient direction.

Practical algorithms.
- TRPO (Schulman et al., 2015) approximates the natural gradient by solving a constrained optimization problem that keeps the KL divergence small.
- PPO (Schulman et al., 2017) uses a surrogate objective with a clipped ratio, which implicitly regularizes the step size similar to natural gradients.


7. Deterministic Policy Gradients (DPG) & Deep Deterministic Policy Gradient (DDPG)

When actions are continuous and the policy is deterministic \mu_\theta(s), the policy‑gradient theorem simplifies to

\nabla_\theta J(\theta) = \mathbb{E}_{s\sim\rho^\mu} \bigl[ \nabla_\theta \mu_\theta(s)\, \nabla_a Q^\mu(s,a)\big|_{a=\mu_\theta(s)} \bigr]. \tag{DPG}

Derivation (chain rule).
If the policy is deterministic, then (up to the usual occupancy‑measure subtlety) you can write J(\theta)=\mathbb{E}_{s\sim\rho^\mu}\bigl[Q^\mu(s,\mu_\theta(s))\bigr]. Differentiating through the action argument and applying the chain rule gives (DPG).

Key points:

  • The gradient no longer involves a log‑probability term; instead we back‑propagate through the deterministic mapping \mu_\theta.
  • The critic supplies \nabla_a Q (often via a neural network).
  • The expectation is over the state distribution under the current policy.
  • In practice, deterministic methods still rely on explicit exploration noise (e.g., in DDPG/TD3) to collect informative data; without randomization there is no overlap for off‑policy evaluation and learning becomes brittle.

Analogy.
- The deterministic policy is a hard treatment assignment rule (a deterministic dynamic regime).
- The gradient uses the local sensitivity of the expected reward to small perturbations in the action, similar to the derivative of a regression model with respect to treatment dose.


8. Generalized Advantage Estimation (GAE)

A powerful variance‑reduction trick is to combine multi‑step TD errors into a smoothed advantage:

\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}, where \delta_t = r_t + \gamma V_{\phi}(s_{t+1}) - V_{\phi}(s_t).

  • \lambda \in [0,1] trades off bias (lower \lambda) vs variance (higher \lambda).
  • When \lambda=1, GAE reduces to the Monte‑Carlo advantage G_t - V_\phi(s_t); when \lambda=0, it reduces to the one‑step TD residual \delta_t.

Causal parallel.
GAE is like using a regression baseline plus a sequence of residual corrections: it blends one‑step TD residuals to trade off bias and variance. This is not literally a doubly‑robust estimator, but it plays a similar “control variate / residualization” role in policy‑gradient updates.


9. Trust‑Region Methods: TRPO & PPO

Trust‑Region Policy Optimization (TRPO).
Instead of a simple gradient step, TRPO solves

\max_{\theta}\; \mathbb{E}\bigl[ \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A_{\theta_{\text{old}}}(s,a)\bigr] \quad\text{s.t.}\quad \mathbb{E}\bigl[ D_{\text{KL}}\bigl(\pi_{\theta_{\text{old}}}(\cdot|s)\;\|\;\pi_\theta(\cdot|s)\bigr)\bigr] \le \delta.

The KL constraint defines a trust region in policy space, ensuring that updates are conservative.

Proximal Policy Optimization (PPO).
PPO replaces the exact constrained problem with a surrogate objective:

L^{\text{CLIP}}(\theta)= \mathbb{E}\bigl[ \min\bigl(r_t(\theta)\hat{A}_t,\; \text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t\bigr) \bigr], where r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}.

The clipping term enforces a soft trust region. PPO is computationally simpler yet empirically very robust.

Causal interpretation.
The surrogate objective can be seen as a robust estimator that limits the influence of large importance weights (which correspond to extreme propensity‑score ratios). The KL constraint is akin to a regularization on the propensity score distribution to avoid drastic shifts that would inflate variance.


10. Off‑Policy Evaluation (OPE) and Doubly Robust Estimators

Inverse‑Probability Weighted (IPW) estimator.
Given trajectories \{\tau_i\}_{i=1}^N of (truncated) horizon T from a behavior policy \pi_b, define the per‑step likelihood ratio \rho_{i,t}=\frac{\pi(a_{i,t}\mid s_{i,t})}{\pi_b(a_{i,t}\mid s_{i,t})} \quad\text{and}\quad w_{i,0:t}=\prod_{k=0}^{t}\rho_{i,k}. If the policies are history‑dependent, replace s_{i,t} by the full history H_{i,t} in these ratios. The (trajectory‑wise) IPW estimate of J(\pi) is

\widehat{J}_{\text{IPW}}(\pi) = \frac{1}{N}\sum_{i=1}^N w_{i,0:T-1}\,G_i, where G_i=\sum_{t=0}^{T-1}\gamma^t r_{i,t}.

In practice, a common lower‑variance alternative is per‑decision importance sampling (PDIS): \widehat{J}_{\text{PDIS}}(\pi) =\frac{1}{N}\sum_{i=1}^N \sum_{t=0}^{T-1}\gamma^t\, w_{i,0:t}\, r_{i,t}.

Doubly Robust (DR) estimator.
Combines a model estimate of Q^\pi (or V^\pi) with IPW:

\widehat{J}_{\text{DR}}(\pi) = \frac{1}{N}\sum_{i=1}^N \Bigl[ \hat{V}(s_{i,0}) + \sum_{t=0}^{T-1}\gamma^t\, w_{i,0:t}\, \bigl(r_{i,t} + \gamma \hat{V}(s_{i,t+1}) - \hat{V}(s_{i,t})\bigr) \Bigr].

If either the model \hat{V} is correct or the logged behavior propensities (hence the importance ratios) are correct, the estimator remains unbiased.

RL analogue.
The actor‑critic update with a learned value function is a close cousin of DR: the critic plays the role of an outcome model, and the policy‑ratio terms (explicitly in off‑policy methods, implicitly in on‑policy sampling) play the role of propensity weighting. Variance is reduced when the critic is accurate.


11. Summary & Takeaways

RL Concept Causal‑Inference Analogue Key Mathematical Point
Policy \pi_\theta(a|s) Propensity score P(A=a|S=s) Likelihood‑ratio gradient \nabla_\theta \log \pi
Return G_t Counterfactual outcome Y(a) Monte‑Carlo estimate of Q^\pi
Baseline b(s) Regression adjustment Control variate reducing variance
Off‑policy IS weights Inverse probability weights Cumulative ratio \prod_{k\le t}\pi/\pi_b
Critic Q^\pi Outcome model \mathbb{E}[Y|A,S] Control variate / doubly robust
Natural gradient Weighted least squares / Fisher metric Preconditioner F^{-1}
Deterministic policy Hard treatment assignment Gradient through deterministic mapping
Trust‑region (KL) Robustness / regularization of propensity Constrained optimization
GAE Multi‑step DR / bias–variance tradeoff Exponential weighting of TD errors

Key equations to remember

  1. Policy Gradient Theorem
    \nabla_\theta J = \mathbb{E}\!\bigl[ \sum_t \gamma^t\,\nabla_\theta \log \pi_\theta(a_t|s_t) Q^\pi(s_t,a_t) \bigr].

  2. REINFORCE with Baseline
    \widehat{\nabla}_\theta J = \sum_t \gamma^t\,\nabla_\theta \log \pi_\theta(a_t|s_t)\,[G_t - b(s_t)].

  3. Actor–Critic Update
    \widehat{\nabla}_\theta J = \sum_t \gamma^t\,\nabla_\theta \log \pi_\theta(a_t|s_t)\,\hat A_t.

  4. Importance‑Sampling Policy Gradient
    \widehat{\nabla}_\theta J = \sum_t \gamma^t\Bigl(\prod_{i=0}^t \frac{\pi_\theta(a_i|s_i)}{\pi_b(a_i|s_i)}\Bigr) \nabla_\theta \log \pi_\theta(a_t|s_t)\,G_t.

  5. Natural Gradient
    \tilde{\nabla}_\theta J = F(\theta)^{-1}\nabla_\theta J.

  6. Deterministic Policy Gradient
    \nabla_\theta J = \mathbb{E}_{s\sim\rho^\mu}\!\bigl[ \nabla_\theta \mu_\theta(s)\,\nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)} \bigr].

  7. GAE
    \hat{A}_t^{\text{GAE}} = \sum_{l=0}^{\infty} (\gamma\lambda)^l\,\delta_{t+l}.

These formulas form the backbone of most modern deep RL systems. The causal‑inference perspective simply reframes them: the policy is a treatment rule, returns are counterfactual outcomes, IS weights are propensity ratios, and baselines/critics are regression adjustments that reduce variance.


12. Suggested Further Reading

Topic Classic Papers Modern Extensions
Policy‑gradient theorem Sutton et al. 2000 N/A
REINFORCE Williams 1992 Actor‑Critic, A2C, DDPG
Natural gradients Amari 1998 TRPO, K-FAC
Deterministic policy gradients Silver et al. 2014 DDPG, TD3
Trust‑region methods Schulman et al. 2015 PPO, ACKTR
Variance reduction (GAE) Schulman et al. 2017 Generalized Advantage Estimation
Off‑policy evaluation Munos et al. 2008 DR, MAGIC, OPE