Policy Gradient Methods

Goal of reinforcement learning

Objective:

\begin{align*} \theta^* = \arg \max_\theta \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ r(\tau) \right] \\ r(\tau) = \sum_{t} r(s_t, a_t) \\ \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ r(\tau) \right] \approx \dfrac{1}{N} \sum_{i=1}^{N} \sum_{t} r(s_{i,t}, a_{i,t}) \\ N = \text{number of trajectories} \\ \theta^* = \arg \max_\theta \mathcal{J}(\theta) \end{align*}

Direct Policy Gradient

\begin{align*} \nabla_\theta \mathcal{J}(\theta) & = \nabla_\theta \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ r(\tau) \right] \\ & = \int \nabla_\theta p_\theta(\tau) \, r(\tau) \, d\tau \\ & \text{\color{red}{(We're going to use the log identity. Using:)}} \\ & \nabla_{\theta}(p_\theta(\tau)) = p_\theta(\tau) \nabla_{\theta} \log p_\theta(\tau) \\ & = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) \, r(\tau) \right] \\ & = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, r(\tau) \right] \end{align*}

Using gradient ascent (instead of descent lol 😅)

We can expand $p_\theta(\tau) = p(s_1) \prod_{t} \pi(a_t \mid s_t) p(s_{t+1} \mid s_t, a_t)$ . Using $\log$ and $\nabla_\theta$ :

\nabla_\theta \log p_\theta(\tau) = \sum_{t} \nabla_\theta \log \pi_\theta(a_t \mid s_t)

So, the gradient becomes:

\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \left( \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) \right) \, \left( \sum_{t=1}^T r(s_t, a_t) \right) \right] \\

\nabla_\theta J(\theta) \;\approx\; \frac{1}{N}\,\sum_{i=1}^{N}\!\Biggl(\,\sum_{t=1}^{T}\nabla_\theta \log \pi_\theta\bigl(a_{i,t}\mid s_{i,t}\bigr)\Biggr)\!\Biggl(\,\sum_{t=1}^{T} r\bigl(s_{i,t},a_{i,t}\bigr)\Biggr)

\theta \;\leftarrow\; \theta \;+\;\alpha\,\nabla_\theta J(\theta)

REINFORCE algorithm:

Sample trajectories $\{\tau^i\}$ by running the policy $\pi_\theta(a_t \mid s_t)$ .
$\nabla_\theta J(\theta)\;\approx\;\sum_i \Biggl(\sum_t \nabla_\theta \log \pi_\theta\bigl(a_t^i \mid s_t^i\bigr)\Biggr)\!\Biggl(\sum_t r\bigl(s_t^i,a_t^i\bigr)\Biggr)$
$\theta\;\leftarrow\;\theta\;+\;\alpha\,\nabla_\theta J(\theta)$
Repeat until convergence.

Comparison to Maximum Likelihood

Maximum Likelihood (Imitation Learning): You collect data of “correct” actions $(a_{i,t})$ and states $(s_{i,t})$ . The gradient increases the log-probability of observed actions, regardless of how they were gathered:
$\nabla_\theta J_{ML}(\theta) \;\approx\; \frac{1}{N}\, \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} \mid s_{i,t}).$
Policy Gradient: We weight the same log-probabilities by the trajectory reward (the “quality” of the trajectory). High-reward trajectories get pushed to higher probability; low-reward trajectories get suppressed.

Discrete vs. Continuous Actions

Discrete Action Example

$\pi_\theta(a_t \mid s_t)$ outputs probabilities for each discrete action (e.g., turn left or turn right).
$\log \pi_\theta$ is just the log of that probability; the gradient is found by backprop through the network that outputs the class probabilities.

Continuous Action Example (Gaussian Policies)

For continuous control (e.g., a humanoid robot), define $\pi_\theta(a_t \mid s_t) \;=\; \mathcal{N}\bigl(f_{\theta}(s_t),\,\Sigma\bigr).$
Log-probability under a Gaussian (dropping constants): $\log \pi_\theta(a_t \mid s_t) \;=\; -\tfrac{1}{2}\,\|\,f_{\theta}(s_t) - a_t\,\|_{\Sigma}^2 + \text{const}.$
Gradient: $\nabla_\theta \log \pi_\theta(a_t \mid s_t) \;=\; -\tfrac{1}{2}\,\Sigma^{-1}\!\bigl(f_{\theta}(s_t) - a_t\bigr)\,\frac{d\,f_{\theta}(s_t)}{d\theta}.$

Hence, policy gradient can be seen as a weighted version of a maximum likelihood gradient, where the weight is the reward. Also, $f_{\theta}$ can be a neural network, and the gradient can be computed via backpropagation with respect to $\theta$ (i.e. network parameters).

Intuition: “Good Stuff More Likely, Bad Stuff Less Likely”

We interpret $\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)$ as “how to nudge the policy parameters to increase probability of the observed trajectory.”
Multiplying by reward $\sum_t r(s_t, a_t)$ means that high-reward trajectories push the policy to increase their probability; low-reward trajectories push the policy away from them.

In short, REINFORCE (the basic policy gradient method) is a formal version of trial-and-error learning.

Partial Observability

If our agent only sees observations $o_t$ instead of full states $s_t$ , we can still write the same policy gradient formula—just replace $s_t$ by $o_t$ .
The derivation of the policy gradient never explicitly used the Markov property. So standard policy gradient extends easily to partially observed MDPs (POMDPs).

High Variance Issue

Problem: The basic policy gradient estimator can have high variance.
Adding a constant to the rewards $r(s,a)$ does not change the optimal policy in theory. However, in finite samples, adding or subtracting constants or shifting reward magnitudes can drastically change the gradient signals collected.
We can get large swings in the gradient direction from just a few sampled trajectories with positive or negative rewards, especially if the sample size is small or the reward scale is poorly tuned.

Exploiting Causality and Reward-to-Go

Rewriting the Sum of Rewards

Often, $r(\tau)$ is written as

r(\tau) = \sum_{t=1}^T r(s_t, a_t).

Then,

\nabla_\theta J(\theta) \;\approx\; \frac{1}{N}\sum_{i=1}^N \Bigl(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}\mid s_{i,t})\Bigr)\Bigl(\sum_{t=1}^T r(s_{i,t}, a_{i,t})\Bigr).

We can distribute the sums:

\nabla_\theta J(\theta) = \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}\mid s_{i,t}) \Bigl(\sum_{t'=1}^T r(s_{i,t'}, a_{i,t'})\Bigr).

Causality

Key fact: The action at time $t$ cannot affect rewards at time $t' < t$ .
In expectation, any rewards before $t$ “cancel out” because they do not depend on the action at $t$ .
This implies we can remove the past rewards and keep only future ones:

\nabla_\theta J(\theta) \;\approx\; \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}\mid s_{i,t}) \Bigl(\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'})\Bigr).

The truncated sum $\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'})$ is called the reward-to-go.

Removing terms that cannot be influenced reduces the variance of the gradient estimator (the sum has fewer terms, hence often smaller variance).

Look at the proof of this claim from openai's gym
Look at this SO answer too

Baselines

Subtracting a Baseline

To reduce variance further, one can subtract a constant baseline $b$ from the reward:

\nabla_\theta J(\theta) \;\approx\; \frac{1}{N}\sum_{i=1}^N \nabla_\theta \log p_\theta(\tau_i)\,\bigl[r(\tau_i) - b\bigr].

Subtracting $b$ does not change the gradient in expectation, so it remains an unbiased estimator.

Proof Sketch:

\mathbb{E}[\nabla_\theta \log p_\theta(\tau)\,(r(\tau)-b)] = \mathbb{E}[\nabla_\theta \log p_\theta(\tau)\,r(\tau)] - b\,\mathbb{E}[\nabla_\theta \log p_\theta(\tau)].

Using $p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau) = \nabla_\theta p_\theta(\tau)$ , the second term becomes

b \int \nabla_\theta p_\theta(\tau)\,d\tau = b\,\nabla_\theta \int p_\theta(\tau)\,d\tau = b\,\nabla_\theta (1) = 0.

Choosing the Baseline

A common baseline is average reward, so that trajectories better than average get positive updates and those worse than average get negative updates. More sophisticated baselines can be derived from learned value functions, etc.

Analyzing Variance and the Optimal Baseline

Variance of the Gradient Estimator

Define $g(\tau) = \nabla_\theta \log p_\theta(\tau)$ . With a baseline $b$ , the estimator is:

g(\tau)\,\bigl(r(\tau) - b\bigr).

The variance can be written as

\mathrm{Var}\Bigl(g(\tau)\,\bigl(r(\tau) - b\bigr)\Bigr) = \mathbb{E}\Bigl[\bigl(g(\tau)\,(r(\tau) - b)\bigr)^2\Bigr] - \Bigl(\mathbb{E}\bigl[g(\tau)\,(r(\tau) - b)\bigr]\Bigr)^2.

Since $\mathbb{E}[\,g(\tau)\,(r(\tau)-b)\,]$ does not depend on $b$ , we minimize

\mathbb{E}\Bigl[\bigl(g(\tau)\,(r(\tau) - b)\bigr)^2\Bigr]

w.r.t. $b$ .

Optimal Scalar Baseline

Let $g(\tau)^2$ mean the squared magnitude or coordinate-wise square. Setting the derivative of

\mathbb{E}[\,g(\tau)^2\,(r(\tau) - b)^2\,]

to zero w.r.t. $b$ yields:

b^* = \frac{\mathbb{E}\bigl[g(\tau)^2\,r(\tau)\bigr]}{\mathbb{E}\bigl[g(\tau)^2\bigr]}.

This is the optimal scalar baseline in terms of variance reduction. It depends on $g(\tau)^2$ , which reweights the reward by the gradient’s magnitude. In practice, one often uses simpler baselines (e.g. average return) or a learned value network instead of $b^*$ due to implementation complexity.

Policy Gradient as an On-Policy Algorithm

Maximizing the Objective We start with the typical RL objective for a policy parameterized by $\theta$ . We want:
$\theta^* = \arg\max_{\theta} \; J(\theta),$
where
$J(\theta) = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \bigl[r(\tau)\bigr].$
Here $\tau$ denotes a full trajectory $(s_1, a_1, s_2, a_2, \dots)$ . The trajectory distribution under policy $\pi_{\theta}(a \mid s)$ is
$p_{\theta}(\tau) = p(s_1)\,\prod_{t=1}^{T} \pi_{\theta}(a_t \mid s_t)\,p(s_{t+1}\mid s_t, a_t).$
Policy Gradient Theorem The gradient of $J(\theta)$ can be written using the log-derivative trick:
$\nabla_{\theta} J(\theta) \;=\; \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \Bigl[\nabla_{\theta} \log p_{\theta}(\tau)\; r(\tau)\Bigr].$
This is the on-policy gradient form because the expectation is explicitly under $p_{\theta}(\tau)$ .
Why It Is On-Policy
- In practice, to compute the expectation $\mathbb{E}_{\tau \sim p_{\theta}(\tau)}[\cdot]$ , we sample trajectories using the current (latest) policy $\pi_{\theta}$ .
- Each time $\theta$ changes (even if the neural network only changes a little), new samples must be collected that come from the new $\pi_{\theta}$ .
- Therefore, data from old policies cannot directly be reused for the gradient estimate, making the algorithm on-policy.
Inefficiency of On-Policy Sampling
- When using deep neural networks, gradient updates are typically small because large steps may destabilize learning.
- Consequently, we might need many small steps, and each step requires new samples collected under the updated policy.
- This can be extremely sample-inefficient if samples are expensive to collect (e.g., in real-world robotics or complex simulators).
REINFORCE Algorithm (Classical On-Policy Example)
1. Sample $\{\tau^i\}$ from $\pi_{\theta}(a_t \mid s_t)$ by running the current policy.
2. Compute $\nabla_{\theta}J(\theta) \approx \sum_i \Bigl(\sum_t \nabla_{\theta} \log \pi_{\theta}(a_t^i \mid s_t^i)\Bigr)\Bigl(\sum_t r(s_t^i, a_t^i)\Bigr).$
3. Update $\theta \leftarrow \theta + \alpha \,\nabla_{\theta}J(\theta).$
We cannot skip the data collection step (step 1) each time, since it must reflect the latest $\theta$ .

Transitioning to Off-Policy Learning

Motivation We want to reuse samples obtained from a different distribution (say $\tilde{p}(\tau)$ ), which might come from:
- Old policies (experience replay).
- Other agents or demonstrations.
- Policies that differ from our current $\pi_{\theta}$ .
Importance Sampling The technique we use to handle expectations under one distribution while having samples from another is called importance sampling.

Quick Review of Importance Sampling

Suppose we want $\mathbb{E}_{x \sim p(x)}[f(x)] \;=\; \int p(x)\,f(x)\,dx.$
We only have samples from $q(x)$ . We can multiply and divide by $q(x)$ : $\int p(x)\,f(x)\,dx \;=\; \int q(x)\,\frac{p(x)}{q(x)}\, f(x)\,dx \;=\; \mathbb{E}_{x\sim q(x)}\Bigl[\frac{p(x)}{q(x)}\,f(x)\Bigr].$
This is an unbiased estimator of the expectation under $p$ . The ratio $\tfrac{p(x)}{q(x)}$ is the importance weight.

Importance Sampling for Policy Gradients

We want to compute $J(\theta) \;=\; \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \bigl[r(\tau)\bigr],$ but our samples come from some other distribution $\tilde{p}(\tau)$ .
By important sampling: $J(\theta) \;=\; \mathbb{E}_{\tau \sim \tilde{p}(\tau)} \Bigl[\frac{p_{\theta}(\tau)}{\tilde{p}(\tau)}\,r(\tau)\Bigr].$
Likewise, for the gradient: $\nabla_{\theta} J(\theta) \;=\; \mathbb{E}_{\tau \sim \tilde{p}(\tau)} \Bigl[\frac{p_{\theta}(\tau)}{\tilde{p}(\tau)}\;\nabla_{\theta}\log p_{\theta}(\tau)\;r(\tau)\Bigr].$
Trajectory Distribution Ratios: If $\tilde{p}(\tau)$ and $p_{\theta}(\tau)$ differ only by the policy (i.e., same MDP, same transitions $p(s_{t+1}\mid s_t,a_t)$ , same initial states $p(s_1)$ ), then $\frac{p_{\theta}(\tau)}{\tilde{p}(\tau)} \;=\; \frac{\prod_{t=1}^{T} \pi_{\theta}(a_t \mid s_t)}{\prod_{t=1}^{T} \tilde{\pi}(a_t \mid s_t)},$ because the terms for initial states and transitions cancel out.

Deriving the Policy Gradient with Importance Sampling

New Parameters $\theta'$
- We might have old samples from $p_{\theta}(\tau)$ but wish to estimate $J(\theta')$ .
- Then $J(\theta') \;=\; \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \Bigl[\frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} \;r(\tau)\Bigr].$
Taking the Gradient $\nabla_{\theta'}$
- The only $\theta'$ -dependent part is $p_{\theta'}(\tau)$ .
- Using the identity $p_{\theta'}(\tau)\,\nabla_{\theta'} \log p_{\theta'}(\tau) = \nabla_{\theta'}\,p_{\theta'}(\tau)$ : $\nabla_{\theta'} J(\theta') \;=\; \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \Bigl[ \frac{\nabla_{\theta'} p_{\theta'}(\tau)}{p_{\theta}(\tau)}\;r(\tau) \Bigr].$
- Substituting the identity back in, we get $\nabla_{\theta'} J(\theta') = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \Bigl[ \frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} \;\nabla_{\theta'} \log p_{\theta'}(\tau)\; r(\tau) \Bigr].$
- This is exactly the off-policy policy gradient form with importance weights $\tfrac{p_{\theta'}(\tau)}{p_{\theta}(\tau)}$ .
Local Case $\theta' = \theta$
- If $\theta' = \theta$ , then $\tfrac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} = 1$ .
- This recovers the usual on-policy gradient.
Full Off-Policy Gradient
- In the general off-policy setting (when $\theta' \neq \theta$ ), the ratio is: $\frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} = \prod_{t=1}^{T} \frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}.$
- The final gradient has three main factors in the expectation:
  1. The importance ratio (product of $\pi_{\theta'} / \pi_{\theta}$ ).
  2. The policy log-derivative terms $\nabla_{\theta'} \log \pi_{\theta'}(a_t \mid s_t)$ .
  3. The rewards $r(\tau)$ or sum of rewards $\sum_t r(s_t,a_t)$ .

Off-Policy Policy Gradient

We want to optimize

\theta^* = \arg\max_\theta J(\theta), \quad J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}[r(\tau)].

But now we assume we have trajectories $\tau$ drawn from $p_\theta(\tau)$ for some old parameter $\theta$ , and we want to evaluate or update a different parameter $\theta'$ . By importance sampling, the gradient of $\;J(\theta')$ can be written:

\nabla_{\theta'} J(\theta') \;=\; \mathbb{E}_{\tau \sim p_\theta(\tau)} \Bigl[ \frac{p_{\theta'}(\tau)}{p_\theta(\tau)} \;\nabla_{\theta'} \log p_{\theta'}(\tau) \;r(\tau) \Bigr] \quad \text{for } \theta' \neq \theta.

Because both $p_{\theta'}(\tau)$ and $p_\theta(\tau)$ factorize into the same initial state and transition probabilities, their ratio simplifies to

\frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} = \prod_{t=1}^{T} \frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}.

Hence:

\nabla_{\theta'} J(\theta') = \mathbb{E}_{\tau \sim p_\theta(\tau)} \Bigl[ \biggl(\prod_{t=1}^{T} \frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)} \biggr) \bigl(\sum_{t=1}^{T} \nabla_{\theta'} \log \pi_{\theta'}(a_t \mid s_t)\bigr) \bigl(\sum_{t=1}^{T} r(s_t, a_t)\bigr) \Bigr].

What About Causality?

In many policy gradient derivations, we factor out the future terms to reflect the fact that future actions do not affect the probability of past states or rewards. That is, the “current” weight at time $t$ typically should not include importance ratios from the future time steps $t+1$ onward. Formally, one can rewrite:

\nabla_{\theta'} J(\theta') = \mathbb{E}_{\tau \sim p_\theta(\tau)} \Bigl[ \sum_{t=1}^{T} \nabla_{\theta'} \log \pi_{\theta'}(a_t \mid s_t) \;\biggl(\prod_{t'=1}^{t} \frac{\pi_{\theta'}(a_{t'} \mid s_{t'})} {\pi_{\theta}(a_{t'} \mid s_{t'})} \biggr) \biggl(\sum_{t'=t}^{T} r(s_{t'}, a_{t'})\biggr) \Bigr].

Future actions don’t affect the current weight: the ratio from $t+1$ to $T$ is omitted in the multiplier of $\nabla_{\theta'} \log \pi_{\theta'}(a_t \mid s_t)$ .
If we ignore the additional ratio that would weight the future returns (i.e., ignore the ratio on the reward terms entirely), we end up with a policy iteration-style approach, which is no longer the exact policy gradient but can still guarantee improvement under certain conditions (to be seen in a later lecture).

Causality and Exponential Variance Issues

Causality
- Typically, when deriving policy gradients, we note that future actions do not affect the probability of past rewards.
- This leads to factoring out certain terms or ignoring future states in the immediate log-derivative factor.
Exponential in $T$
- A direct importance sample over entire trajectories can lead to an exponential factor in $T$ .
- If each importance ratio is (say) $\leq 1$ , multiplying many such ratios shrinks to zero exponentially. This can cause huge variance in the gradient estimator.
Ignoring Certain Ratios
- Sometimes, algorithms ignore the ratio of state marginals and only keep the ratio of action probabilities at each visited state.
- Doing so avoids the full product over all time steps.
- Strictly, this is not the exact gradient, but under certain conditions (e.g., $\theta'$ close to $\theta$ ), the approximation error is bounded and the method remains a reasonable off-policy approach.
Policy Iteration Connection
- If you omit the importance weights on rewards entirely, you get a policy iteration-like update.
- It is not the true gradient, but in some contexts can still guarantee policy improvement (not necessarily via gradient ascent, but via a different theoretical argument).
- We will see more details about this in a later lecture.

First-Order Approximation and Practical Insights

First-Order Approximation
- We rewrite the off-policy objective in a form that is similar to the on-policy objective but includes importance ratios at each time step.
- Directly multiplying the full ratios over $T$ often becomes intractable or extremely high variance.
Key Insight
- Ignoring the ratio of state marginals drastically reduces variance from potentially exponential in $T$ to something more manageable.
- Although it introduces bias, if updates between $\theta$ and $\theta'$ are small, the additional error can be bounded.
Summary
- On-policy policy gradient: $\nabla_{\theta} J(\theta) \;\approx\; \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta}\,\log \pi_{\theta}(a_{i,t}\mid s_{i,t}) \;\hat{Q}_{i,t},$ with $(s_{i,t}, a_{i,t}) \sim \pi_{\theta}$ .
- Off-policy policy gradient (importance sampling): $\nabla_{\theta'} J(\theta') \;\approx\; \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \frac{\pi_{\theta'}(a_{i,t}\mid s_{i,t})}{\pi_{\theta}(a_{i,t}\mid s_{i,t})} \,\nabla_{\theta'} \log \pi_{\theta'}(a_{i,t}\mid s_{i,t}) \;\hat{Q}_{i,t}.$ (Optionally ignoring the ratio of state distributions.)

Covariant / Natural Policy Gradient

Motivation

Vanilla gradient ascent steps are sensitive to how different parameters scale: some parameters (like $\sigma$ here) can drastically affect the policy distribution, while others (like $k$ ) affect it less.
Choosing a single learning rate $\alpha$ can be problematic.

Constraint-Based View of Gradient Updates

Ordinary first-order ascent can be seen as:

\theta' = \arg\max_{\theta'} \; (\theta' - \theta)^\mathsf{T} \,\nabla_\theta J(\theta) \quad \text{subject to} \quad \|\theta' - \theta\|^2 \le \epsilon.

This restricts $\theta'$ to be within an $\epsilon$ -ball in parameter space.
But what if we constrain how much the distribution changes instead of the raw parameters?
- Impose $D(\pi_{\theta'}, \pi_{\theta}) \le \epsilon$ , where $D$ is, for example, the KL divergence.
- This ensures steps that are small in policy space, rather than in parameter space.

KL Divergence and Fisher Information

The KL divergence can be approximated by a second-order Taylor expansion around $\theta' = \theta$ :

D_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\theta'}) \approx (\theta' - \theta)^\mathsf{T} \, F \, (\theta' - \theta),

$F$ is the Fisher information matrix:

F = \mathbb{E}_{\pi_\theta}\bigl[ \nabla_\theta \log \pi_\theta(a \mid s) \,\nabla_\theta \log \pi_\theta(a \mid s)^\mathsf{T} \bigr].

Constraining $(\theta' - \theta)^\mathsf{T} F (\theta' - \theta) \le \epsilon$ leads to a rescaled gradient update.

Natural Gradient Update

Solving the Lagrangian for that constraint + linearized objective yields:

\theta \leftarrow \theta + \alpha \, F^{-1} \nabla_\theta J(\theta).

This $F^{-1}$ factor effectively “preconditions” the gradient to account for how each parameter affects the policy distribution.
The figure from Peters & Schaal shows that switching from the blue (vanilla) vector field to the red (natural gradient) vector field makes updates point much more directly to the optimum $(k^*, \sigma^*)$ .
Convergence is faster, and tuning the step size is typically easier.

Implementation Notes

$F$ is an expectation under $\pi_\theta$ , so we estimate it by sampling $(s_i, a_i)$ from the current policy:

F \approx \frac{1}{N} \sum_{i=1}^N \nabla_\theta \log \pi_\theta(a_i \mid s_i) \;\nabla_\theta \log \pi_\theta(a_i \mid s_i)^\mathsf{T}.

Then we solve $F^{-1} \nabla_\theta J(\theta)$ via, e.g., conjugate gradient.
Trust Region Policy Optimization (TRPO) uses the same ideas but fixes the KL budget $\epsilon$ and solves for $\alpha$ accordingly. See Schulman et al. (2015).

Additional Remarks and References

Actor-Critic methods can reduce variance further by using learned value functions.
Natural gradient techniques appear in TRPO, PPO, and other advanced policy gradient algorithms.

Some references:

Williams (1992): introduced REINFORCE.
Peters & Schaal (2008): natural policy gradient with excellent visuals.
Schulman et al. (2015): TRPO uses a KL constraint with conjugate gradient to find natural gradient steps.
PPO (2017) refines these trust region ideas with a simpler clipped objective.

Conclusion

Vanilla policy gradients can be numerically unstable in continuous or high-dimensional settings, because different parameters can dramatically differ in how they change the policy. Multiplying by $F^{-1}$ , the Fisher information inverse, produces the natural (covariant) gradient, which resolves poor conditioning by stepping in directions that correspond to small changes in the distribution. This often converges more rapidly and robustly, forming the basis for many state-of-the-art RL methods like TRPO and PPO.

Bhavit Sharma