This note is a Markdown migration of the policy-gradient RL notes from the Quarto source.

Parent map: index Related chapters: maximum-likelihood-and-machine-learning, causal-inference, dependent-data-time-series-and-spatial-statistics

Reinforcement learning (RL) algorithms are often grouped into two broad families:

  • Value-based methods learn a value function (e.g., ) and define a policy implicitly by acting (approximately) greedily with respect to it (e.g., Q-learning, DQN).

  • Policy-gradient methods directly parameterize a stochastic policy and optimize its value by gradient ascent (e.g., REINFORCE, actor-critic, PPO/TRPO).

This note focuses on policy-gradient methods because (i) they have a clean causal-inference interpretation (policies as stochastic interventions / propensity models and as outcome models), (ii) they handle continuous actions naturally, and (iii) modern deep RL practice is dominated by stable actor-critic and trust-region variants.

They are primarily for the author’s own reference but may serve as a useful bridge for others who approach RL from a causal inference direction and frequently scratch their heads about the startling similarities between concepts and the distressing differences in notation and priorities. Relies heavily on chapters 14-15 of Stefan Wager’s excellent causal inference textbook, Murphy 2025, some hesitant references to Sutton and Barto, and zero-shot elaborations from a council of frontier LLMs.

1. Problem Setup

Markov Decision Process (MDP).
An MDP is the tuple

where

  • is a (finite or continuous) state space,
  • an action space,
  • the transition kernel,
  • the (possibly stochastic) immediate reward,
  • a discount factor.

A policy is a conditional distribution over actions given states, parameterized by . We will treat the policy as the treatment assignment rule in a causal-inference sense.

Causal-inference mapping.

  • State covariates .
  • Action treatment .
  • Reward outcome .
  • Policy propensity score .
  • outcome model .

Policy-based vs value-based (quick CI-friendly orientation).

  • Value-based RL (Q-learning, DQN): learn an optimal outcome model via Bellman optimality and pick actions via . In causal terms, this is “estimate counterfactual mean outcomes, then choose the best treatment” (a plug-in optimal regime learner), with the complication that is defined by a fixed point and involves future decisions.
  • Policy gradients (REINFORCE, PPO): directly update a parametric propensity model to increase . In causal terms, this is “move the assignment mechanism along a smooth path that increases expected welfare,” and the gradient weights changes in propensities by long-run causal contrasts (advantages).
  • Actor-critic: uses mainly as a regression adjustment / control variate for the policy update rather than as the sole object to maximize.

Why policy gradients are popular (first order): both families struggle when there is little overlap (rare actions in relevant states) because learning requires extrapolation. Value-based methods amplify this via repeated bootstrapping and operators (the “deadly triad”: function approximation + bootstrapping + off-policy data), while policy-gradient methods can be stabilized by staying close to the current/behavior policy (trust regions, incremental updates, entropy regularization) and avoid explicit over continuous actions. That said, value-based methods remain extremely strong in small, discrete action spaces and can be more sample-efficient because they reuse off-policy data.

Trajectory. A trajectory follows the dynamics induced by and . The probability density of a trajectory under policy is

where is the initial state distribution.

Return.
The discounted return from time is

We seek the policy objective

where

is the state-value function and

In causal language, is the average counterfactual reward if we set the “treatment rule” to .

It is often convenient to define the discounted state-occupancy measure

or its normalized version (a proper distribution). With ,

1.1 A Causal Model for Sequential Decisions (and Where It Breaks)

Write the history (information set) at time as

and view a policy as a dynamic treatment regime (for Markov policies, this reduces to ).

To interpret policies causally, it is helpful to introduce policy potential outcomes: denotes the counterfactual trajectory generated if actions were assigned according to . Then the causal estimand is

Identification from logged data hinges on sequential versions of the usual causal assumptions:

  • Consistency. If the realized action rule matches , then realized states/rewards equal the corresponding potential outcomes.
  • Positivity / overlap. For any history that occurs under the target policy, the behavior policy assigns positive probability to any action that might choose.
  • Sequential ignorability (no unobserved confounding). Roughly, for all action histories , meaning that given the recorded history, the action is “as-if randomized” with respect to future potential outcomes.

Under these assumptions, the sequential g-formula identifies the counterfactual trajectory distribution under as exactly the interventional rollout distribution we wrote earlier,

and therefore

This is the core bridge: RL “policy evaluation” is g-computation for a dynamic treatment regime, and off-policy methods are sequential IPW/DR for that same estimand.

Endogeneity in RL = missing state / hidden confounding.
If there is an unobserved variable that affects both action choice and rewards/transitions, then conditioning on (or even ) may not block backdoor paths. In econometric language, actions become endogenous and is not a valid propensity score. This is mainly a problem for offline (observational) evaluation and learning: importance weighting and doubly-robust estimators are generally biased without sequential ignorability given the logged state/history. (By contrast, if you can actively randomize actions online, policy-gradient learning remains viable even with partial observability—what breaks is the ability to make causal off-policy statements from observational logs without extra assumptions.)

Where explicit randomization enters.

  • In online RL / experimentation, the learner intervenes on by sampling from a known (often stochastic) policy. This built-in randomization is the analogue of running a sequential randomized experiment and is what makes likelihood-ratio gradients and on-policy evaluation straightforward.

  • In offline RL, you inherit the (possibly endogenous) behavior policy that generated the data. Valid OPE/off-policy learning requires either (i) a credible ignorability story given the logged history/state and overlap, or (ii) additional structure (proxies, instruments/encouragement, partial observability models) beyond standard MDP assumptions.

In both cases, you must know or estimate action probabilities (logged propensities) to use IPW/DR style estimators; if the data do not record (or cannot credibly reconstruct) , off-policy causal claims become much harder.

Analogue of IV/encouragement designs.
The closest RL analogue to IV is to use exogenous variation that shifts actions but does not directly affect rewards except through actions—e.g., randomized “nudges”/recommendations, known exploration noise, or randomized constraints on feasible actions—together with assumptions that let you separate causal effects from confounding (often formalized via confounded MDPs or POMDPs with proxy variables).

1.2 Contextual Bandits as the “Static” Special Case

A contextual bandit is the one-step (or no-transition) special case of an MDP: observe a context/state , choose an action , and observe an outcome/reward . There is no state evolution driven by actions (or you can think of ).

This is exactly the standard causal setup with covariates , treatment , and outcome , except that in bandits the data may be collected adaptively (the assignment rule can depend on past data). Two key links:

  • Policy gradient collapses to familiar score-function / AIPW structure.
    With , the REINFORCE estimator is just , and off-policy evaluation uses the single-step weight .
  • Exploration = explicit randomization.
    Bandit algorithms (e.g., -greedy, Thompson sampling) deliberately randomize actions to learn; this is the experimental design piece that provides overlap/positivity. The analogue in multi-step RL is stochastic policies (or injected noise) during learning.

Bandit/RL logs are often adaptively collected (the assignment rule changes over time as data arrive). Identification via IPW/DR still goes through when propensities are known and ignorability holds given the logged context/history, but inference (standard errors) typically needs martingale/online-learning arguments rather than i.i.d. sampling.


2. Policy-Gradient Theorem

The policy-gradient theorem (Sutton et al., 2000) gives a closed-form expression for the gradient of :

Equivalently, using the discounted occupancy measure ,

2.1 Contextual Bandit Special Case (Why the Gradient Is “Treatment Effect × Policy Sensitivity”)

If you set the horizon to one step (), RL becomes a contextual bandit: observe , pick , observe reward , and stop. Let the conditional mean outcome be

Then the value of a stochastic policy is just the mixture of counterfactual means:

Differentiating w.r.t. the policy parameters (not w.r.t. the discrete action) gives

which is exactly the one-step (bandit) form of the policy-gradient theorem. Since is unknown, on-policy sampling replaces it with the realized reward , yielding the unbiased score-function estimator (REINFORCE for ).

For a binary action this becomes especially transparent:

so

The term is exactly the conditional treatment effect (CATE) at state , and is how much your parametric assignment rule can change the propensity at .

If you parameterize (logistic propensity),

so gradient ascent increases the log-odds of action in states where action has higher counterfactual value than action , and decreases it where it has lower value. In multi-step RL, the CATE is replaced by the long-run contrast (or an advantage).

2.2 Incremental Propensity-Score Interventions (Odds Tilts) as Policy Paths

Econometrics/causal inference often targets incremental propensity-score interventions (Kennedy and coauthors): instead of setting “everyone treated” vs “everyone untreated”, define a stochastic intervention that nudges treatment odds by a fixed factor and study how welfare changes as you turn the knob. In RL terms, this is a one-dimensional path of policies through the behavior policy, and differentiating along that path is a policy gradient in that specific direction.

For binary action with behavior propensity , an odds-tilt by defines the counterfactual policy

Let . In the one-step bandit case,

and differentiating gives

This is exactly “CATE sensitivity of the incremental intervention.” Equivalently, in score form,

Two practical takeaways that mirror offline RL practice:

  • Incremental interventions stay close to , improving overlap and stabilizing importance weights (contrast with large policy shifts).
  • This is conceptually similar to trust-region policy updates (e.g., PPO/TRPO): control policy divergence, then optimize the resulting welfare curve (or take a gradient step at ).

In multi-step RL, the same idea can be applied per time step by tilting each logged decision rule; then involves discounted sums of score terms times long-run causal contrasts (advantages), making the connection to policy gradients direct.

Derivation

The key trick is the likelihood-ratio (a.k.a. score function) identity

Then

Now decompose around time :

The “past” term drops out because it does not depend on :

So

Finally, condition on and use to replace by , yielding (PGT).

Analogy to causal inference.

  • The score is the analogue of the score of a propensity-score model .
  • The factor is the counterfactual reward under treatment , analogous to .
  • The expectation over trajectories is the analogue of the expectation over the joint distribution of .

3. REINFORCE (Monte-Carlo Policy Gradient)

The classic REINFORCE algorithm (Williams, 1992) estimates the gradient by sampling trajectories and replacing by the Monte-Carlo return observed along the trajectory:

Why it works:

  • is an unbiased (but high-variance) estimator of .
  • The estimator is thus unbiased for .

Variance Reduction: Baselines.
Adding a baseline that does not depend on the action yields

Because , subtracting leaves the expectation unchanged while reducing variance:

Analogy: is like a regression adjustment in causal inference: it captures the part of the outcome that is independent of the treatment, reducing noise in the estimate.

A common choice is the state-value function as baseline, turning the term into an advantage estimate .


4. Actor-Critic: Using a Learned Critic

The main limitation of REINFORCE is the high variance of . Actor-critic methods use a critic to estimate (or ) from data, reducing variance.

Basic actor-critic update:

  • Critic update (e.g., TD(0)):

    minimize w.r.t. critic parameters .

  • Actor update (policy gradient with critic):

    where a common one-step advantage estimate is (TD(0)), and longer-horizon choices include GAE.

The critic acts like a control variate: it approximates so that the gradient uses a low-variance estimate instead of .

Connection to causal inference.

  • The TD error is analogous to the residual in a regression-adjusted estimator of treatment effect.
  • The critic approximates the conditional mean outcome (counterfactual reward), giving a smoother estimate.

5. Off-Policy Gradient Estimation (Importance Sampling)

In practice we often have data collected under a behavior policy (or a set of policies) and wish to evaluate or improve a different target policy . The importance-sampling (IS) correction gives

The cumulative ratio is the importance weight up to time , directly analogous to the inverse probability weight in causal inference ().

Why the product appears (and why it stops at ).
Because the environment dynamics (and reward model ) do not depend on , the trajectory density ratio is

In the -th summand, factors for times can be dropped without changing the expectation: conditional on ,

and similarly for later steps. This yields the per-decision (prefix) weight in (IS-PG).

Variance issues.
IS weights can explode when the target policy is far from the behavior policy. Remedies include:

  • Weight clipping / truncation (e.g., importance-ratio clipping in PPO).
  • Per-decision IS (use prefix weights, and/or truncate after a fixed horizon).
  • Self-normalized IS (divide by sum of weights).

6. Natural Policy Gradients (NPG)

The vanilla gradient (PGT) treats in Euclidean space. In RL, however, the policy manifold is naturally endowed with the Fisher information matrix :

The natural gradient is defined as

Interpretation.

  • acts as a metric tensor on the parameter space, turning the Euclidean gradient into a steepest ascent direction under the KL divergence metric.
  • In causal terms, it is analogous to weighting by the inverse covariance of the propensity score gradient, similar to weighted least squares.

Derivation (trust-region viewpoint).
For a small parameter step , a second-order approximation gives

while . Maximizing the linearized improvement subject to a KL “trust region” constraint,

yields , i.e. the natural gradient direction.

Practical algorithms.

  • TRPO (Schulman et al., 2015) approximates the natural gradient by solving a constrained optimization problem that keeps the KL divergence small.
  • PPO (Schulman et al., 2017) uses a surrogate objective with a clipped ratio, which implicitly regularizes the step size similar to natural gradients.

7. Deterministic Policy Gradients (DPG) & Deep Deterministic Policy Gradient (DDPG)

When actions are continuous and the policy is deterministic , the policy-gradient theorem simplifies to

Derivation (chain rule).
If the policy is deterministic, then (up to the usual occupancy-measure subtlety) you can write

Differentiating through the action argument and applying the chain rule gives (DPG).

Key points:

  • The gradient no longer involves a log-probability term; instead we back-propagate through the deterministic mapping .
  • The critic supplies (often via a neural network).
  • The expectation is over the state distribution under the current policy.
  • In practice, deterministic methods still rely on explicit exploration noise (e.g., in DDPG/TD3) to collect informative data; without randomization there is no overlap for off-policy evaluation and learning becomes brittle.

Analogy.

  • The deterministic policy is a hard treatment assignment rule (a deterministic dynamic regime).
  • The gradient uses the local sensitivity of the expected reward to small perturbations in the action, similar to the derivative of a regression model with respect to treatment dose.

8. Generalized Advantage Estimation (GAE)

A powerful variance-reduction trick is to combine multi-step TD errors into a smoothed advantage:

where .

  • trades off bias (lower ) vs variance (higher ).
  • When , GAE reduces to the Monte-Carlo advantage ; when , it reduces to the one-step TD residual .

Causal parallel.
GAE is like using a regression baseline plus a sequence of residual corrections: it blends one-step TD residuals to trade off bias and variance. This is not literally a doubly-robust estimator, but it plays a similar “control variate / residualization” role in policy-gradient updates.


9. Trust-Region Methods: TRPO & PPO

Trust-Region Policy Optimization (TRPO).
Instead of a simple gradient step, TRPO solves

The KL constraint defines a trust region in policy space, ensuring that updates are conservative.

Proximal Policy Optimization (PPO).
PPO replaces the exact constrained problem with a surrogate objective:

where .

The clipping term enforces a soft trust region. PPO is computationally simpler yet empirically very robust.

Causal interpretation.
The surrogate objective can be seen as a robust estimator that limits the influence of large importance weights (which correspond to extreme propensity-score ratios). The KL constraint is akin to a regularization on the propensity score distribution to avoid drastic shifts that would inflate variance.


10. Off-Policy Evaluation (OPE) and Doubly Robust Estimators

Inverse-Probability Weighted (IPW) estimator.
Given trajectories of (truncated) horizon from a behavior policy , define the per-step likelihood ratio

If the policies are history-dependent, replace by the full history in these ratios. The (trajectory-wise) IPW estimate of is

where .

In practice, a common lower-variance alternative is per-decision importance sampling (PDIS):

Doubly Robust (DR) estimator.
Combines a model estimate of (or ) with IPW:

If either the model is correct or the logged behavior propensities (hence the importance ratios) are correct, the estimator remains unbiased.

RL analogue.
The actor-critic update with a learned value function is a close cousin of DR: the critic plays the role of an outcome model, and the policy-ratio terms (explicitly in off-policy methods, implicitly in on-policy sampling) play the role of propensity weighting. Variance is reduced when the critic is accurate.


11. Summary & Takeaways

RL ConceptCausal-Inference AnalogueKey Mathematical Point
Policy $\pi_\theta(as)$Propensity score $P(A=a
Return Counterfactual outcome Monte-Carlo estimate of
Baseline Regression adjustmentControl variate reducing variance
Off-policy IS weightsInverse probability weightsCumulative ratio
Critic Outcome model $\mathbb{E}[YA,S]$
Natural gradientWeighted least squares / Fisher metricPreconditioner
Deterministic policyHard treatment assignmentGradient through deterministic mapping
Trust-region (KL)Robustness / regularization of propensityConstrained optimization
GAEMulti-step DR / bias-variance tradeoffExponential weighting of TD errors

Key equations to remember

  1. Policy Gradient Theorem

  2. REINFORCE with Baseline

  3. Actor-Critic Update

  4. Importance-Sampling Policy Gradient

  5. Natural Gradient

  6. Deterministic Policy Gradient

  7. GAE

These formulas form the backbone of most modern deep RL systems. The causal-inference perspective simply reframes them: the policy is a treatment rule, returns are counterfactual outcomes, IS weights are propensity ratios, and baselines/critics are regression adjustments that reduce variance.


12. Suggested Further Reading

TopicClassic PapersModern Extensions
Policy-gradient theoremSutton et al. 2000N/A
REINFORCEWilliams 1992Actor-Critic, A2C, DDPG
Natural gradientsAmari 1998TRPO, K-FAC
Deterministic policy gradientsSilver et al. 2014DDPG, TD3
Trust-region methodsSchulman et al. 2015PPO, ACKTR
Variance reduction (GAE)Schulman et al. 2017Generalized Advantage Estimation
Off-policy evaluationMunos et al. 2008DR, MAGIC, OPE