Policy Gradient Methods
Goal of reinforcement learning
Objective:
Direct Policy Gradient
Using gradient ascent (instead of descent lol 😅)
We can expand . Using and :
So, the gradient becomes:
REINFORCE algorithm:
- Sample trajectories by running the policy .
- Repeat until convergence.
Comparison to Maximum Likelihood
-
Maximum Likelihood (Imitation Learning): You collect data of “correct” actions and states . The gradient increases the log-probability of observed actions, regardless of how they were gathered:
-
Policy Gradient: We weight the same log-probabilities by the trajectory reward (the “quality” of the trajectory). High-reward trajectories get pushed to higher probability; low-reward trajectories get suppressed.
Discrete vs. Continuous Actions
Discrete Action Example
- outputs probabilities for each discrete action (e.g., turn left or turn right).
- is just the log of that probability; the gradient is found by backprop through the network that outputs the class probabilities.
Continuous Action Example (Gaussian Policies)
- For continuous control (e.g., a humanoid robot), define
- Log-probability under a Gaussian (dropping constants):
- Gradient:
Hence, policy gradient can be seen as a weighted version of a maximum likelihood gradient, where the weight is the reward. Also, can be a neural network, and the gradient can be computed via backpropagation with respect to (i.e. network parameters).
Intuition: “Good Stuff More Likely, Bad Stuff Less Likely”
- We interpret as “how to nudge the policy parameters to increase probability of the observed trajectory.”
- Multiplying by reward means that high-reward trajectories push the policy to increase their probability; low-reward trajectories push the policy away from them.
In short, REINFORCE (the basic policy gradient method) is a formal version of trial-and-error learning.
Partial Observability
- If our agent only sees observations instead of full states , we can still write the same policy gradient formula—just replace by .
- The derivation of the policy gradient never explicitly used the Markov property. So standard policy gradient extends easily to partially observed MDPs (POMDPs).
High Variance Issue
- Problem: The basic policy gradient estimator can have high variance.
- Adding a constant to the rewards does not change the optimal policy in theory. However, in finite samples, adding or subtracting constants or shifting reward magnitudes can drastically change the gradient signals collected.
- We can get large swings in the gradient direction from just a few sampled trajectories with positive or negative rewards, especially if the sample size is small or the reward scale is poorly tuned.
Exploiting Causality and Reward-to-Go
Rewriting the Sum of Rewards
Often, is written as
Then,
We can distribute the sums:
Causality
- Key fact: The action at time cannot affect rewards at time .
- In expectation, any rewards before “cancel out” because they do not depend on the action at .
- This implies we can remove the past rewards and keep only future ones:
The truncated sum is called the reward-to-go.
Removing terms that cannot be influenced reduces the variance of the gradient estimator (the sum has fewer terms, hence often smaller variance).
- Look at the proof of this claim from openai's gym
- Look at this SO answer too
Baselines
Subtracting a Baseline
To reduce variance further, one can subtract a constant baseline from the reward:
Subtracting does not change the gradient in expectation, so it remains an unbiased estimator.
Proof Sketch:
Using , the second term becomes
Choosing the Baseline
A common baseline is average reward, so that trajectories better than average get positive updates and those worse than average get negative updates. More sophisticated baselines can be derived from learned value functions, etc.
Analyzing Variance and the Optimal Baseline
Variance of the Gradient Estimator
Define . With a baseline , the estimator is:
The variance can be written as
Since does not depend on , we minimize
w.r.t. .
Optimal Scalar Baseline
Let mean the squared magnitude or coordinate-wise square. Setting the derivative of
to zero w.r.t. yields:
This is the optimal scalar baseline in terms of variance reduction. It depends on , which reweights the reward by the gradient’s magnitude. In practice, one often uses simpler baselines (e.g. average return) or a learned value network instead of due to implementation complexity.
Policy Gradient as an On-Policy Algorithm
-
Maximizing the Objective We start with the typical RL objective for a policy parameterized by . We want:
where
Here denotes a full trajectory . The trajectory distribution under policy is
-
Policy Gradient Theorem The gradient of can be written using the log-derivative trick:
This is the on-policy gradient form because the expectation is explicitly under .
-
Why It Is On-Policy
- In practice, to compute the expectation , we sample trajectories using the current (latest) policy .
- Each time changes (even if the neural network only changes a little), new samples must be collected that come from the new .
- Therefore, data from old policies cannot directly be reused for the gradient estimate, making the algorithm on-policy.
-
Inefficiency of On-Policy Sampling
- When using deep neural networks, gradient updates are typically small because large steps may destabilize learning.
- Consequently, we might need many small steps, and each step requires new samples collected under the updated policy.
- This can be extremely sample-inefficient if samples are expensive to collect (e.g., in real-world robotics or complex simulators).
-
REINFORCE Algorithm (Classical On-Policy Example)
- Sample from by running the current policy.
- Compute
- Update
We cannot skip the data collection step (step 1) each time, since it must reflect the latest .
Transitioning to Off-Policy Learning
-
Motivation We want to reuse samples obtained from a different distribution (say ), which might come from:
- Old policies (experience replay).
- Other agents or demonstrations.
- Policies that differ from our current .
-
Importance Sampling The technique we use to handle expectations under one distribution while having samples from another is called importance sampling.
Quick Review of Importance Sampling
- Suppose we want
- We only have samples from . We can multiply and divide by :
- This is an unbiased estimator of the expectation under . The ratio is the importance weight.
Importance Sampling for Policy Gradients
- We want to compute but our samples come from some other distribution .
- By important sampling:
- Likewise, for the gradient:
- Trajectory Distribution Ratios: If and differ only by the policy (i.e., same MDP, same transitions , same initial states ), then because the terms for initial states and transitions cancel out.
Deriving the Policy Gradient with Importance Sampling
-
New Parameters
- We might have old samples from but wish to estimate .
- Then
-
Taking the Gradient
- The only -dependent part is .
- Using the identity :
- Substituting the identity back in, we get
- This is exactly the off-policy policy gradient form with importance weights .
-
Local Case
- If , then .
- This recovers the usual on-policy gradient.
-
Full Off-Policy Gradient
- In the general off-policy setting (when ), the ratio is:
- The final gradient has three main factors in the expectation:
- The importance ratio (product of ).
- The policy log-derivative terms .
- The rewards or sum of rewards .
Off-Policy Policy Gradient
We want to optimize
But now we assume we have trajectories drawn from for some old parameter , and we want to evaluate or update a different parameter . By importance sampling, the gradient of can be written:
Because both and factorize into the same initial state and transition probabilities, their ratio simplifies to
Hence:
What About Causality?
In many policy gradient derivations, we factor out the future terms to reflect the fact that future actions do not affect the probability of past states or rewards. That is, the “current” weight at time typically should not include importance ratios from the future time steps onward. Formally, one can rewrite:
- Future actions don’t affect the current weight: the ratio from to is omitted in the multiplier of .
- If we ignore the additional ratio that would weight the future returns (i.e., ignore the ratio on the reward terms entirely), we end up with a policy iteration-style approach, which is no longer the exact policy gradient but can still guarantee improvement under certain conditions (to be seen in a later lecture).
Causality and Exponential Variance Issues
-
Causality
- Typically, when deriving policy gradients, we note that future actions do not affect the probability of past rewards.
- This leads to factoring out certain terms or ignoring future states in the immediate log-derivative factor.
-
Exponential in
- A direct importance sample over entire trajectories can lead to an exponential factor in .
- If each importance ratio is (say) , multiplying many such ratios shrinks to zero exponentially. This can cause huge variance in the gradient estimator.
-
Ignoring Certain Ratios
- Sometimes, algorithms ignore the ratio of state marginals and only keep the ratio of action probabilities at each visited state.
- Doing so avoids the full product over all time steps.
- Strictly, this is not the exact gradient, but under certain conditions (e.g., close to ), the approximation error is bounded and the method remains a reasonable off-policy approach.
-
Policy Iteration Connection
- If you omit the importance weights on rewards entirely, you get a policy iteration-like update.
- It is not the true gradient, but in some contexts can still guarantee policy improvement (not necessarily via gradient ascent, but via a different theoretical argument).
- We will see more details about this in a later lecture.
First-Order Approximation and Practical Insights
-
First-Order Approximation
- We rewrite the off-policy objective in a form that is similar to the on-policy objective but includes importance ratios at each time step.
- Directly multiplying the full ratios over often becomes intractable or extremely high variance.
-
Key Insight
- Ignoring the ratio of state marginals drastically reduces variance from potentially exponential in to something more manageable.
- Although it introduces bias, if updates between and are small, the additional error can be bounded.
-
Summary
- On-policy policy gradient: with .
- Off-policy policy gradient (importance sampling): (Optionally ignoring the ratio of state distributions.)
Covariant / Natural Policy Gradient
Motivation
- Vanilla gradient ascent steps are sensitive to how different parameters scale: some parameters (like here) can drastically affect the policy distribution, while others (like ) affect it less.
- Choosing a single learning rate can be problematic.
Constraint-Based View of Gradient Updates
- Ordinary first-order ascent can be seen as:
-
This restricts to be within an -ball in parameter space.
-
But what if we constrain how much the distribution changes instead of the raw parameters?
- Impose , where is, for example, the KL divergence.
- This ensures steps that are small in policy space, rather than in parameter space.
KL Divergence and Fisher Information
- The KL divergence can be approximated by a second-order Taylor expansion around :
- is the Fisher information matrix:
- Constraining leads to a rescaled gradient update.
Natural Gradient Update
- Solving the Lagrangian for that constraint + linearized objective yields:
-
This factor effectively “preconditions” the gradient to account for how each parameter affects the policy distribution.
-
The figure from Peters & Schaal shows that switching from the blue (vanilla) vector field to the red (natural gradient) vector field makes updates point much more directly to the optimum .
-
Convergence is faster, and tuning the step size is typically easier.
Implementation Notes
- is an expectation under , so we estimate it by sampling from the current policy:
- Then we solve via, e.g., conjugate gradient.
- Trust Region Policy Optimization (TRPO) uses the same ideas but fixes the KL budget and solves for accordingly. See Schulman et al. (2015).
Additional Remarks and References
- Actor-Critic methods can reduce variance further by using learned value functions.
- Natural gradient techniques appear in TRPO, PPO, and other advanced policy gradient algorithms.
Some references:
- Williams (1992): introduced REINFORCE.
- Peters & Schaal (2008): natural policy gradient with excellent visuals.
- Schulman et al. (2015): TRPO uses a KL constraint with conjugate gradient to find natural gradient steps.
- PPO (2017) refines these trust region ideas with a simpler clipped objective.
Conclusion
Vanilla policy gradients can be numerically unstable in continuous or high-dimensional settings, because different parameters can dramatically differ in how they change the policy. Multiplying by , the Fisher information inverse, produces the natural (covariant) gradient, which resolves poor conditioning by stepping in directions that correspond to small changes in the distribution. This often converges more rapidly and robustly, forming the basis for many state-of-the-art RL methods like TRPO and PPO.
Last updated on