Neural Network Diagram

Bhavit Sharma

Policy Gradient Methods

Goal of reinforcement learning

Objective:

θ=argmaxθEτpθ(τ)[r(τ)]r(τ)=tr(st,at)J(θ)=Eτpθ(τ)[r(τ)]1Ni=1Ntr(si,t,ai,t)N=number of trajectoriesθ=argmaxθJ(θ)\begin{align*} \theta^* = \arg \max_\theta \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ r(\tau) \right] \\ r(\tau) = \sum_{t} r(s_t, a_t) \\ \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ r(\tau) \right] \approx \dfrac{1}{N} \sum_{i=1}^{N} \sum_{t} r(s_{i,t}, a_{i,t}) \\ N = \text{number of trajectories} \\ \theta^* = \arg \max_\theta \mathcal{J}(\theta) \end{align*}

Direct Policy Gradient

θJ(θ)=θEτpθ(τ)[r(τ)]=θpθ(τ)r(τ)dτ(We’re going to use the log identity. Using:)θ(pθ(τ))=pθ(τ)θlogpθ(τ)=Eτpθ(τ)[θlogpθ(τ)r(τ)]=Eτpθ(τ)[tθlogπθ(atst)r(τ)]\begin{align*} \nabla_\theta \mathcal{J}(\theta) & = \nabla_\theta \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ r(\tau) \right] \\ & = \int \nabla_\theta p_\theta(\tau) \, r(\tau) \, d\tau \\ & \text{\color{red}{(We're going to use the log identity. Using:)}} \\ & \nabla_{\theta}(p_\theta(\tau)) = p_\theta(\tau) \nabla_{\theta} \log p_\theta(\tau) \\ & = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) \, r(\tau) \right] \\ & = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, r(\tau) \right] \end{align*}

Using gradient ascent (instead of descent lol 😅)

We can expand pθ(τ)=p(s1)tπ(atst)p(st+1st,at)p_\theta(\tau) = p(s_1) \prod_{t} \pi(a_t \mid s_t) p(s_{t+1} \mid s_t, a_t). Using log\log and θ\nabla_\theta:

θlogpθ(τ)=tθlogπθ(atst)\nabla_\theta \log p_\theta(\tau) = \sum_{t} \nabla_\theta \log \pi_\theta(a_t \mid s_t)

So, the gradient becomes:

θJ(θ)=Eτpθ(τ)[(t=1Tθlogπθ(atst))(t=1Tr(st,at))]\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \left( \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) \right) \, \left( \sum_{t=1}^T r(s_t, a_t) \right) \right] \\ θJ(θ)    1Ni=1N ⁣(t=1Tθlogπθ(ai,tsi,t)) ⁣(t=1Tr(si,t,ai,t))\nabla_\theta J(\theta) \;\approx\; \frac{1}{N}\,\sum_{i=1}^{N}\!\Biggl(\,\sum_{t=1}^{T}\nabla_\theta \log \pi_\theta\bigl(a_{i,t}\mid s_{i,t}\bigr)\Biggr)\!\Biggl(\,\sum_{t=1}^{T} r\bigl(s_{i,t},a_{i,t}\bigr)\Biggr) θ    θ  +  αθJ(θ)\theta \;\leftarrow\; \theta \;+\;\alpha\,\nabla_\theta J(\theta)

REINFORCE algorithm:

  1. Sample trajectories {τi}\{\tau^i\} by running the policy πθ(atst)\pi_\theta(a_t \mid s_t).
  2. θJ(θ)    i(tθlogπθ(atisti)) ⁣(tr(sti,ati))\nabla_\theta J(\theta)\;\approx\;\sum_i \Biggl(\sum_t \nabla_\theta \log \pi_\theta\bigl(a_t^i \mid s_t^i\bigr)\Biggr)\!\Biggl(\sum_t r\bigl(s_t^i,a_t^i\bigr)\Biggr)
  3. θ    θ  +  αθJ(θ)\theta\;\leftarrow\;\theta\;+\;\alpha\,\nabla_\theta J(\theta)
  4. Repeat until convergence.

Comparison to Maximum Likelihood

  • Maximum Likelihood (Imitation Learning): You collect data of “correct” actions (ai,t)(a_{i,t}) and states (si,t)(s_{i,t}). The gradient increases the log-probability of observed actions, regardless of how they were gathered:

    θJML(θ)    1Ni=1Nt=1Tθlogπθ(ai,tsi,t).\nabla_\theta J_{ML}(\theta) \;\approx\; \frac{1}{N}\, \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} \mid s_{i,t}).
  • Policy Gradient: We weight the same log-probabilities by the trajectory reward (the “quality” of the trajectory). High-reward trajectories get pushed to higher probability; low-reward trajectories get suppressed.

Discrete vs. Continuous Actions

Discrete Action Example

  • πθ(atst)\pi_\theta(a_t \mid s_t) outputs probabilities for each discrete action (e.g., turn left or turn right).
  • logπθ\log \pi_\theta is just the log of that probability; the gradient is found by backprop through the network that outputs the class probabilities.

Continuous Action Example (Gaussian Policies)

  • For continuous control (e.g., a humanoid robot), define πθ(atst)  =  N(fθ(st),Σ).\pi_\theta(a_t \mid s_t) \;=\; \mathcal{N}\bigl(f_{\theta}(s_t),\,\Sigma\bigr).
  • Log-probability under a Gaussian (dropping constants): logπθ(atst)  =  12fθ(st)atΣ2+const.\log \pi_\theta(a_t \mid s_t) \;=\; -\tfrac{1}{2}\,\|\,f_{\theta}(s_t) - a_t\,\|_{\Sigma}^2 + \text{const}.
  • Gradient: θlogπθ(atst)  =  12Σ1 ⁣(fθ(st)at)dfθ(st)dθ.\nabla_\theta \log \pi_\theta(a_t \mid s_t) \;=\; -\tfrac{1}{2}\,\Sigma^{-1}\!\bigl(f_{\theta}(s_t) - a_t\bigr)\,\frac{d\,f_{\theta}(s_t)}{d\theta}.

Hence, policy gradient can be seen as a weighted version of a maximum likelihood gradient, where the weight is the reward. Also, fθf_{\theta} can be a neural network, and the gradient can be computed via backpropagation with respect to θ\theta (i.e. network parameters).

Intuition: “Good Stuff More Likely, Bad Stuff Less Likely”

  • We interpret tθlogπθ(atst)\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) as “how to nudge the policy parameters to increase probability of the observed trajectory.”
  • Multiplying by reward tr(st,at)\sum_t r(s_t, a_t) means that high-reward trajectories push the policy to increase their probability; low-reward trajectories push the policy away from them.

In short, REINFORCE (the basic policy gradient method) is a formal version of trial-and-error learning.

Partial Observability

  • If our agent only sees observations oto_t instead of full states sts_t, we can still write the same policy gradient formula—just replace sts_t by oto_t.
  • The derivation of the policy gradient never explicitly used the Markov property. So standard policy gradient extends easily to partially observed MDPs (POMDPs).

High Variance Issue

  • Problem: The basic policy gradient estimator can have high variance.
  • Adding a constant to the rewards r(s,a)r(s,a) does not change the optimal policy in theory. However, in finite samples, adding or subtracting constants or shifting reward magnitudes can drastically change the gradient signals collected.
  • We can get large swings in the gradient direction from just a few sampled trajectories with positive or negative rewards, especially if the sample size is small or the reward scale is poorly tuned.

Exploiting Causality and Reward-to-Go

Rewriting the Sum of Rewards

Often, r(τ)r(\tau) is written as

r(τ)=t=1Tr(st,at).r(\tau) = \sum_{t=1}^T r(s_t, a_t).

Then,

θJ(θ)    1Ni=1N(t=1Tθlogπθ(ai,tsi,t))(t=1Tr(si,t,ai,t)).\nabla_\theta J(\theta) \;\approx\; \frac{1}{N}\sum_{i=1}^N \Bigl(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}\mid s_{i,t})\Bigr)\Bigl(\sum_{t=1}^T r(s_{i,t}, a_{i,t})\Bigr).

We can distribute the sums:

θJ(θ)=1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=1Tr(si,t,ai,t)).\nabla_\theta J(\theta) = \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}\mid s_{i,t}) \Bigl(\sum_{t'=1}^T r(s_{i,t'}, a_{i,t'})\Bigr).

Causality

  • Key fact: The action at time tt cannot affect rewards at time t<tt' < t.
  • In expectation, any rewards before tt “cancel out” because they do not depend on the action at tt.
  • This implies we can remove the past rewards and keep only future ones:
θJ(θ)    1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTr(si,t,ai,t)).\nabla_\theta J(\theta) \;\approx\; \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}\mid s_{i,t}) \Bigl(\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'})\Bigr).

The truncated sum t=tTr(si,t,ai,t)\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) is called the reward-to-go.

Removing terms that cannot be influenced reduces the variance of the gradient estimator (the sum has fewer terms, hence often smaller variance).


Baselines

Subtracting a Baseline

To reduce variance further, one can subtract a constant baseline bb from the reward:

θJ(θ)    1Ni=1Nθlogpθ(τi)[r(τi)b].\nabla_\theta J(\theta) \;\approx\; \frac{1}{N}\sum_{i=1}^N \nabla_\theta \log p_\theta(\tau_i)\,\bigl[r(\tau_i) - b\bigr].

Subtracting bb does not change the gradient in expectation, so it remains an unbiased estimator.

Proof Sketch:

E[θlogpθ(τ)(r(τ)b)]=E[θlogpθ(τ)r(τ)]bE[θlogpθ(τ)].\mathbb{E}[\nabla_\theta \log p_\theta(\tau)\,(r(\tau)-b)] = \mathbb{E}[\nabla_\theta \log p_\theta(\tau)\,r(\tau)] - b\,\mathbb{E}[\nabla_\theta \log p_\theta(\tau)].

Using pθ(τ)θlogpθ(τ)=θpθ(τ)p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau) = \nabla_\theta p_\theta(\tau), the second term becomes

bθpθ(τ)dτ=bθpθ(τ)dτ=bθ(1)=0.b \int \nabla_\theta p_\theta(\tau)\,d\tau = b\,\nabla_\theta \int p_\theta(\tau)\,d\tau = b\,\nabla_\theta (1) = 0.

Choosing the Baseline

A common baseline is average reward, so that trajectories better than average get positive updates and those worse than average get negative updates. More sophisticated baselines can be derived from learned value functions, etc.


Analyzing Variance and the Optimal Baseline

Variance of the Gradient Estimator

Define g(τ)=θlogpθ(τ)g(\tau) = \nabla_\theta \log p_\theta(\tau). With a baseline bb, the estimator is:

g(τ)(r(τ)b).g(\tau)\,\bigl(r(\tau) - b\bigr).

The variance can be written as

Var(g(τ)(r(τ)b))=E[(g(τ)(r(τ)b))2](E[g(τ)(r(τ)b)])2.\mathrm{Var}\Bigl(g(\tau)\,\bigl(r(\tau) - b\bigr)\Bigr) = \mathbb{E}\Bigl[\bigl(g(\tau)\,(r(\tau) - b)\bigr)^2\Bigr] - \Bigl(\mathbb{E}\bigl[g(\tau)\,(r(\tau) - b)\bigr]\Bigr)^2.

Since E[g(τ)(r(τ)b)]\mathbb{E}[\,g(\tau)\,(r(\tau)-b)\,] does not depend on bb, we minimize

E[(g(τ)(r(τ)b))2]\mathbb{E}\Bigl[\bigl(g(\tau)\,(r(\tau) - b)\bigr)^2\Bigr]

w.r.t. bb.

Optimal Scalar Baseline

Let g(τ)2g(\tau)^2 mean the squared magnitude or coordinate-wise square. Setting the derivative of

E[g(τ)2(r(τ)b)2]\mathbb{E}[\,g(\tau)^2\,(r(\tau) - b)^2\,]

to zero w.r.t. bb yields:

b=E[g(τ)2r(τ)]E[g(τ)2].b^* = \frac{\mathbb{E}\bigl[g(\tau)^2\,r(\tau)\bigr]}{\mathbb{E}\bigl[g(\tau)^2\bigr]}.

This is the optimal scalar baseline in terms of variance reduction. It depends on g(τ)2g(\tau)^2, which reweights the reward by the gradient’s magnitude. In practice, one often uses simpler baselines (e.g. average return) or a learned value network instead of bb^* due to implementation complexity.


Policy Gradient as an On-Policy Algorithm

  1. Maximizing the Objective We start with the typical RL objective for a policy parameterized by θ\theta. We want:

    θ=argmaxθ  J(θ),\theta^* = \arg\max_{\theta} \; J(\theta),

    where

    J(θ)=Eτpθ(τ)[r(τ)].J(\theta) = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \bigl[r(\tau)\bigr].

    Here τ\tau denotes a full trajectory (s1,a1,s2,a2,)(s_1, a_1, s_2, a_2, \dots). The trajectory distribution under policy πθ(as)\pi_{\theta}(a \mid s) is

    pθ(τ)=p(s1)t=1Tπθ(atst)p(st+1st,at).p_{\theta}(\tau) = p(s_1)\,\prod_{t=1}^{T} \pi_{\theta}(a_t \mid s_t)\,p(s_{t+1}\mid s_t, a_t).
  2. Policy Gradient Theorem The gradient of J(θ)J(\theta) can be written using the log-derivative trick:

    θJ(θ)  =  Eτpθ(τ)[θlogpθ(τ)  r(τ)].\nabla_{\theta} J(\theta) \;=\; \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \Bigl[\nabla_{\theta} \log p_{\theta}(\tau)\; r(\tau)\Bigr].

    This is the on-policy gradient form because the expectation is explicitly under pθ(τ)p_{\theta}(\tau).

  3. Why It Is On-Policy

    • In practice, to compute the expectation Eτpθ(τ)[]\mathbb{E}_{\tau \sim p_{\theta}(\tau)}[\cdot], we sample trajectories using the current (latest) policy πθ\pi_{\theta}.
    • Each time θ\theta changes (even if the neural network only changes a little), new samples must be collected that come from the new πθ\pi_{\theta}.
    • Therefore, data from old policies cannot directly be reused for the gradient estimate, making the algorithm on-policy.
  4. Inefficiency of On-Policy Sampling

    • When using deep neural networks, gradient updates are typically small because large steps may destabilize learning.
    • Consequently, we might need many small steps, and each step requires new samples collected under the updated policy.
    • This can be extremely sample-inefficient if samples are expensive to collect (e.g., in real-world robotics or complex simulators).
  5. REINFORCE Algorithm (Classical On-Policy Example)

    1. Sample {τi}\{\tau^i\} from πθ(atst)\pi_{\theta}(a_t \mid s_t) by running the current policy.
    2. Compute θJ(θ)i(tθlogπθ(atisti))(tr(sti,ati)).\nabla_{\theta}J(\theta) \approx \sum_i \Bigl(\sum_t \nabla_{\theta} \log \pi_{\theta}(a_t^i \mid s_t^i)\Bigr)\Bigl(\sum_t r(s_t^i, a_t^i)\Bigr).
    3. Update θθ+αθJ(θ).\theta \leftarrow \theta + \alpha \,\nabla_{\theta}J(\theta).

    We cannot skip the data collection step (step 1) each time, since it must reflect the latest θ\theta.


Transitioning to Off-Policy Learning

  1. Motivation We want to reuse samples obtained from a different distribution (say p~(τ)\tilde{p}(\tau)), which might come from:

    • Old policies (experience replay).
    • Other agents or demonstrations.
    • Policies that differ from our current πθ\pi_{\theta}.
  2. Importance Sampling The technique we use to handle expectations under one distribution while having samples from another is called importance sampling.

Quick Review of Importance Sampling

  • Suppose we want Exp(x)[f(x)]  =  p(x)f(x)dx.\mathbb{E}_{x \sim p(x)}[f(x)] \;=\; \int p(x)\,f(x)\,dx.
  • We only have samples from q(x)q(x). We can multiply and divide by q(x)q(x): p(x)f(x)dx  =  q(x)p(x)q(x)f(x)dx  =  Exq(x)[p(x)q(x)f(x)].\int p(x)\,f(x)\,dx \;=\; \int q(x)\,\frac{p(x)}{q(x)}\, f(x)\,dx \;=\; \mathbb{E}_{x\sim q(x)}\Bigl[\frac{p(x)}{q(x)}\,f(x)\Bigr].
  • This is an unbiased estimator of the expectation under pp. The ratio p(x)q(x)\tfrac{p(x)}{q(x)} is the importance weight.

Importance Sampling for Policy Gradients

  • We want to compute J(θ)  =  Eτpθ(τ)[r(τ)],J(\theta) \;=\; \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \bigl[r(\tau)\bigr], but our samples come from some other distribution p~(τ)\tilde{p}(\tau).
  • By important sampling: J(θ)  =  Eτp~(τ)[pθ(τ)p~(τ)r(τ)].J(\theta) \;=\; \mathbb{E}_{\tau \sim \tilde{p}(\tau)} \Bigl[\frac{p_{\theta}(\tau)}{\tilde{p}(\tau)}\,r(\tau)\Bigr].
  • Likewise, for the gradient: θJ(θ)  =  Eτp~(τ)[pθ(τ)p~(τ)  θlogpθ(τ)  r(τ)].\nabla_{\theta} J(\theta) \;=\; \mathbb{E}_{\tau \sim \tilde{p}(\tau)} \Bigl[\frac{p_{\theta}(\tau)}{\tilde{p}(\tau)}\;\nabla_{\theta}\log p_{\theta}(\tau)\;r(\tau)\Bigr].
  • Trajectory Distribution Ratios: If p~(τ)\tilde{p}(\tau) and pθ(τ)p_{\theta}(\tau) differ only by the policy (i.e., same MDP, same transitions p(st+1st,at)p(s_{t+1}\mid s_t,a_t), same initial states p(s1)p(s_1)), then pθ(τ)p~(τ)  =  t=1Tπθ(atst)t=1Tπ~(atst),\frac{p_{\theta}(\tau)}{\tilde{p}(\tau)} \;=\; \frac{\prod_{t=1}^{T} \pi_{\theta}(a_t \mid s_t)}{\prod_{t=1}^{T} \tilde{\pi}(a_t \mid s_t)}, because the terms for initial states and transitions cancel out.

Deriving the Policy Gradient with Importance Sampling

  1. New Parameters θ\theta'

    • We might have old samples from pθ(τ)p_{\theta}(\tau) but wish to estimate J(θ)J(\theta').
    • Then J(θ)  =  Eτpθ(τ)[pθ(τ)pθ(τ)  r(τ)].J(\theta') \;=\; \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \Bigl[\frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} \;r(\tau)\Bigr].
  2. Taking the Gradient θ\nabla_{\theta'}

    • The only θ\theta'-dependent part is pθ(τ)p_{\theta'}(\tau).
    • Using the identity pθ(τ)θlogpθ(τ)=θpθ(τ)p_{\theta'}(\tau)\,\nabla_{\theta'} \log p_{\theta'}(\tau) = \nabla_{\theta'}\,p_{\theta'}(\tau): θJ(θ)  =  Eτpθ(τ)[θpθ(τ)pθ(τ)  r(τ)].\nabla_{\theta'} J(\theta') \;=\; \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \Bigl[ \frac{\nabla_{\theta'} p_{\theta'}(\tau)}{p_{\theta}(\tau)}\;r(\tau) \Bigr].
    • Substituting the identity back in, we get θJ(θ)=Eτpθ(τ)[pθ(τ)pθ(τ)  θlogpθ(τ)  r(τ)].\nabla_{\theta'} J(\theta') = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \Bigl[ \frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} \;\nabla_{\theta'} \log p_{\theta'}(\tau)\; r(\tau) \Bigr].
    • This is exactly the off-policy policy gradient form with importance weights pθ(τ)pθ(τ)\tfrac{p_{\theta'}(\tau)}{p_{\theta}(\tau)}.
  3. Local Case θ=θ\theta' = \theta

    • If θ=θ\theta' = \theta, then pθ(τ)pθ(τ)=1\tfrac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} = 1.
    • This recovers the usual on-policy gradient.
  4. Full Off-Policy Gradient

    • In the general off-policy setting (when θθ\theta' \neq \theta), the ratio is: pθ(τ)pθ(τ)=t=1Tπθ(atst)πθ(atst).\frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} = \prod_{t=1}^{T} \frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}.
    • The final gradient has three main factors in the expectation:
      1. The importance ratio (product of πθ/πθ\pi_{\theta'} / \pi_{\theta}).
      2. The policy log-derivative terms θlogπθ(atst)\nabla_{\theta'} \log \pi_{\theta'}(a_t \mid s_t).
      3. The rewards r(τ)r(\tau) or sum of rewards tr(st,at)\sum_t r(s_t,a_t).

Off-Policy Policy Gradient

We want to optimize

θ=argmaxθJ(θ),J(θ)=Eτpθ(τ)[r(τ)].\theta^* = \arg\max_\theta J(\theta), \quad J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}[r(\tau)].

But now we assume we have trajectories τ\tau drawn from pθ(τ)p_\theta(\tau) for some old parameter θ\theta, and we want to evaluate or update a different parameter θ\theta'. By importance sampling, the gradient of   J(θ)\;J(\theta') can be written:

θJ(θ)  =  Eτpθ(τ)[pθ(τ)pθ(τ)  θlogpθ(τ)  r(τ)]for θθ.\nabla_{\theta'} J(\theta') \;=\; \mathbb{E}_{\tau \sim p_\theta(\tau)} \Bigl[ \frac{p_{\theta'}(\tau)}{p_\theta(\tau)} \;\nabla_{\theta'} \log p_{\theta'}(\tau) \;r(\tau) \Bigr] \quad \text{for } \theta' \neq \theta.

Because both pθ(τ)p_{\theta'}(\tau) and pθ(τ)p_\theta(\tau) factorize into the same initial state and transition probabilities, their ratio simplifies to

pθ(τ)pθ(τ)=t=1Tπθ(atst)πθ(atst).\frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} = \prod_{t=1}^{T} \frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}.

Hence:

θJ(θ)=Eτpθ(τ)[(t=1Tπθ(atst)πθ(atst))(t=1Tθlogπθ(atst))(t=1Tr(st,at))].\nabla_{\theta'} J(\theta') = \mathbb{E}_{\tau \sim p_\theta(\tau)} \Bigl[ \biggl(\prod_{t=1}^{T} \frac{\pi_{\theta'}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)} \biggr) \bigl(\sum_{t=1}^{T} \nabla_{\theta'} \log \pi_{\theta'}(a_t \mid s_t)\bigr) \bigl(\sum_{t=1}^{T} r(s_t, a_t)\bigr) \Bigr].

What About Causality?

In many policy gradient derivations, we factor out the future terms to reflect the fact that future actions do not affect the probability of past states or rewards. That is, the “current” weight at time tt typically should not include importance ratios from the future time steps t+1t+1 onward. Formally, one can rewrite:

θJ(θ)=Eτpθ(τ)[t=1Tθlogπθ(atst)  (t=1tπθ(atst)πθ(atst))(t=tTr(st,at))].\nabla_{\theta'} J(\theta') = \mathbb{E}_{\tau \sim p_\theta(\tau)} \Bigl[ \sum_{t=1}^{T} \nabla_{\theta'} \log \pi_{\theta'}(a_t \mid s_t) \;\biggl(\prod_{t'=1}^{t} \frac{\pi_{\theta'}(a_{t'} \mid s_{t'})} {\pi_{\theta}(a_{t'} \mid s_{t'})} \biggr) \biggl(\sum_{t'=t}^{T} r(s_{t'}, a_{t'})\biggr) \Bigr].
  • Future actions don’t affect the current weight: the ratio from t+1t+1 to TT is omitted in the multiplier of θlogπθ(atst)\nabla_{\theta'} \log \pi_{\theta'}(a_t \mid s_t).
  • If we ignore the additional ratio that would weight the future returns (i.e., ignore the ratio on the reward terms entirely), we end up with a policy iteration-style approach, which is no longer the exact policy gradient but can still guarantee improvement under certain conditions (to be seen in a later lecture).

Causality and Exponential Variance Issues

  1. Causality

    • Typically, when deriving policy gradients, we note that future actions do not affect the probability of past rewards.
    • This leads to factoring out certain terms or ignoring future states in the immediate log-derivative factor.
  2. Exponential in TT

    • A direct importance sample over entire trajectories can lead to an exponential factor in TT.
    • If each importance ratio is (say) 1\leq 1, multiplying many such ratios shrinks to zero exponentially. This can cause huge variance in the gradient estimator.
  3. Ignoring Certain Ratios

    • Sometimes, algorithms ignore the ratio of state marginals and only keep the ratio of action probabilities at each visited state.
    • Doing so avoids the full product over all time steps.
    • Strictly, this is not the exact gradient, but under certain conditions (e.g., θ\theta' close to θ\theta), the approximation error is bounded and the method remains a reasonable off-policy approach.
  4. Policy Iteration Connection

    • If you omit the importance weights on rewards entirely, you get a policy iteration-like update.
    • It is not the true gradient, but in some contexts can still guarantee policy improvement (not necessarily via gradient ascent, but via a different theoretical argument).
    • We will see more details about this in a later lecture.

First-Order Approximation and Practical Insights

  1. First-Order Approximation

    • We rewrite the off-policy objective in a form that is similar to the on-policy objective but includes importance ratios at each time step.
    • Directly multiplying the full ratios over TT often becomes intractable or extremely high variance.
  2. Key Insight

    • Ignoring the ratio of state marginals drastically reduces variance from potentially exponential in TT to something more manageable.
    • Although it introduces bias, if updates between θ\theta and θ\theta' are small, the additional error can be bounded.
  3. Summary

    • On-policy policy gradient: θJ(θ)    1Ni=1Nt=1Tθlogπθ(ai,tsi,t)  Q^i,t,\nabla_{\theta} J(\theta) \;\approx\; \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta}\,\log \pi_{\theta}(a_{i,t}\mid s_{i,t}) \;\hat{Q}_{i,t}, with (si,t,ai,t)πθ(s_{i,t}, a_{i,t}) \sim \pi_{\theta}.
    • Off-policy policy gradient (importance sampling): θJ(θ)    1Ni=1Nt=1Tπθ(ai,tsi,t)πθ(ai,tsi,t)θlogπθ(ai,tsi,t)  Q^i,t.\nabla_{\theta'} J(\theta') \;\approx\; \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \frac{\pi_{\theta'}(a_{i,t}\mid s_{i,t})}{\pi_{\theta}(a_{i,t}\mid s_{i,t})} \,\nabla_{\theta'} \log \pi_{\theta'}(a_{i,t}\mid s_{i,t}) \;\hat{Q}_{i,t}. (Optionally ignoring the ratio of state distributions.)

Covariant / Natural Policy Gradient

Motivation

  • Vanilla gradient ascent steps are sensitive to how different parameters scale: some parameters (like σ\sigma here) can drastically affect the policy distribution, while others (like kk) affect it less.
  • Choosing a single learning rate α\alpha can be problematic.

Constraint-Based View of Gradient Updates

  • Ordinary first-order ascent can be seen as:
θ=argmaxθ  (θθ)TθJ(θ)subject toθθ2ϵ.\theta' = \arg\max_{\theta'} \; (\theta' - \theta)^\mathsf{T} \,\nabla_\theta J(\theta) \quad \text{subject to} \quad \|\theta' - \theta\|^2 \le \epsilon.
  • This restricts θ\theta' to be within an ϵ\epsilon-ball in parameter space.

  • But what if we constrain how much the distribution changes instead of the raw parameters?

    • Impose D(πθ,πθ)ϵD(\pi_{\theta'}, \pi_{\theta}) \le \epsilon, where DD is, for example, the KL divergence.
    • This ensures steps that are small in policy space, rather than in parameter space.

KL Divergence and Fisher Information

  • The KL divergence can be approximated by a second-order Taylor expansion around θ=θ\theta' = \theta:
DKL(πθπθ)(θθ)TF(θθ),D_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\theta'}) \approx (\theta' - \theta)^\mathsf{T} \, F \, (\theta' - \theta),
  • FF is the Fisher information matrix:
F=Eπθ[θlogπθ(as)θlogπθ(as)T].F = \mathbb{E}_{\pi_\theta}\bigl[ \nabla_\theta \log \pi_\theta(a \mid s) \,\nabla_\theta \log \pi_\theta(a \mid s)^\mathsf{T} \bigr].
  • Constraining (θθ)TF(θθ)ϵ(\theta' - \theta)^\mathsf{T} F (\theta' - \theta) \le \epsilon leads to a rescaled gradient update.

Natural Gradient Update

  • Solving the Lagrangian for that constraint + linearized objective yields:
θθ+αF1θJ(θ).\theta \leftarrow \theta + \alpha \, F^{-1} \nabla_\theta J(\theta).
  • This F1F^{-1} factor effectively “preconditions” the gradient to account for how each parameter affects the policy distribution.

  • The figure from Peters & Schaal shows that switching from the blue (vanilla) vector field to the red (natural gradient) vector field makes updates point much more directly to the optimum (k,σ)(k^*, \sigma^*).

  • Convergence is faster, and tuning the step size is typically easier.

Implementation Notes

  • FF is an expectation under πθ\pi_\theta, so we estimate it by sampling (si,ai)(s_i, a_i) from the current policy:
F1Ni=1Nθlogπθ(aisi)  θlogπθ(aisi)T.F \approx \frac{1}{N} \sum_{i=1}^N \nabla_\theta \log \pi_\theta(a_i \mid s_i) \;\nabla_\theta \log \pi_\theta(a_i \mid s_i)^\mathsf{T}.
  • Then we solve F1θJ(θ)F^{-1} \nabla_\theta J(\theta) via, e.g., conjugate gradient.
  • Trust Region Policy Optimization (TRPO) uses the same ideas but fixes the KL budget ϵ\epsilon and solves for α\alpha accordingly. See Schulman et al. (2015).

Additional Remarks and References

  • Actor-Critic methods can reduce variance further by using learned value functions.
  • Natural gradient techniques appear in TRPO, PPO, and other advanced policy gradient algorithms.

Some references:

  • Williams (1992): introduced REINFORCE.
  • Peters & Schaal (2008): natural policy gradient with excellent visuals.
  • Schulman et al. (2015): TRPO uses a KL constraint with conjugate gradient to find natural gradient steps.
  • PPO (2017) refines these trust region ideas with a simpler clipped objective.

Conclusion

Vanilla policy gradients can be numerically unstable in continuous or high-dimensional settings, because different parameters can dramatically differ in how they change the policy. Multiplying by F1F^{-1}, the Fisher information inverse, produces the natural (covariant) gradient, which resolves poor conditioning by stepping in directions that correspond to small changes in the distribution. This often converges more rapidly and robustly, forming the basis for many state-of-the-art RL methods like TRPO and PPO.

Last updated on