Neural Network Diagram

Bhavit Sharma

n-step Bootstrapping

nstepn-step TD Prediction

Consider a sequence of states and rewards St,Rt+1,St+1,Rt+2,,St+n1,Rt+nS_t, R_{t+1}, S_{t+1}, R_{t+2}, \ldots, S_{t+n-1}, R_{t+n}, where n1n \ge 1. The nstepn-step return is defined as

Gt:t+n=Rt+1+γRt+2++γn1Rt+n+γnVt+n1(St+n)G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n})

This is a very natural generalization of the TD(0)TD(0) return Gt(1)=Rt+1+γVt(St+1)G_t^{(1)} = R_{t+1} + \gamma V_t(S_{t+1}).

Update equation:

Vt+n(St)=Vt+n1(St)+α[Gt:t+nVt+n1(St)]V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha[G_{t:t+n} - V_{t+n-1}(S_t)]

Error-reduction property

The nstepn-step uses the value function Vt+n1V_{t + n - 1} to correct for the missing rewards beyond Rt+nR_{t + n} because we sum the rewards from t+1t + 1 to t+n1t + n - 1, and then add the value of the state at t+nt + n to create the "nstepn-step return".

Explanation This statement formalizes why n-step returns are guaranteed to be “better” (in a worst‐state sense) than the old value estimate Vt+n1V_{t+n-1}. Concretely, it says

maxsEπ[Gt:t+nSt=s]    vπ(s)        γnmaxsVt+n1(s)    vπ(s).\max_s \Bigl\lvert \mathbb{E}_\pi[G_{t:t+n} \mid S_t = s] \;-\; v_\pi(s) \Bigr\rvert \;\;\le\;\; \gamma^n \max_s \Bigl\lvert V_{t+n-1}(s) \;-\; v_\pi(s)\Bigr\rvert.
  1. Worst‐State Error The expression maxs\max_s \bigl\lvert \cdot \bigr\rvert means we look at the largest possible difference (error) across all states ss.

nstepn-step Sarsa

Simple generalization is:

Gt:t+n=Rt+1+γRt+2++γn1Rt+n+γnQt+n1(St+n,At+n)G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n})

The natural update equation is:

Qt+n(St,At)=Qt+n1(St,At)+α[Gt:t+nQt+n1(St,At)]Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha[G_{t:t+n} - Q_{t+n-1}(S_t, A_t)]

We know that that for the normal Sarsa, the update equation is:

Qt+1(St,At)=Qt(St,At)+α[Rt+1+γQt(St+1,At+1)Qt(St,At)]Q_{t + 1}(S_t, A_t) = Q_t(S_t, A_t) + \alpha[R_{t+1} + \gamma Q_t(S_{t+1}, A_{t+1}) - Q_t(S_t, A_t)]

nstepn-step Off-policy Learning

For nn step TD, the update equation will simply be:

Vt+n(St)=Vt+n1(St)+αρt:t+n1[Gt:t+nVt+n1(St)]V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha \rho_{t:t+n-1}[G_{t:t+n} - V_{t+n-1}(S_t)]

where ρt:h=k=tmin(h,T1)π(AkSk)b(AkSk)\rho_{t:h} = \prod_{k=t}^{min(h, T - 1)} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}.

Similarly, for nn step Sarsa, the update equation will be:

Qt+n(St,At)=Qt+n1(St,At)+αρt+1:t+n[Gt:t+nQt+n1(St,At)]Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha \rho_{t+1:t+n}[G_{t:t+n} - Q_{t+n-1}(S_t, A_t)]

We're using ρt+1:t+n\rho_{t+1:t+n} because we're moving from state AtA_t already to At+nA_{t+n} for At+1,At+nA_{t+1}, \cdots A_{t+n}, whereas in the case of VV we're moving from StS_t to St+nS_{t+n} which only includes actions: At,At+1,,At+n1A_t, A_{t+1}, \ldots, A_{t+n-1}.

Per-decision Methods with Control Variates

The original problem is: Naive Off-policy Returns can have high variance. When using off-learning, we use a behaviour policy bb to generate the data, but we want to evaluate a target policy π\pi. The importance sampling ratio ρt=π(AtSt)b(AtSt)\rho_t = \dfrac{\pi(A_t|S_t)}{b(A_t|S_t)} is used to correct for the difference in the probabilities of actions taken by the two policies. If π\pi chooses an action with a much higher probability than bb, the importance sampling ratio will be high, and the return will be scaled up. This can lead to high variance in the returns (and vice-versa).

Similarly if π\pi chooses an action with a much lower probability than bb, the importance sampling ratio will be low, and the return will be scaled down. This can also lead to high variance in the returns.

For nn steps ending at horizon hh, the nn-step return can be written as

Gt:h=Rt+1+γGt+1:hfor t<h<TG_{t:h} = R_{t+1} + \gamma G_{t+1:h} \quad \text{for } t < h < T

One classical variance‐reduction trick is to add and subtract a “baseline” whose expected value is zero under the same distribution.

Gt:h=ρt(Rt+1+γGt+1:h)+(1ρt)Vh1(St)for t<h=TG_{t:h} = \rho_t(R_{t+1} + \gamma G_{t+1:h}) + (1 - \rho_t)V_{h-1}(S_t) \quad \text{for } t < h = T

The second term (1ρt)Vh1(St)(1 - \rho_t)V_{h-1}(S_t) is the control variate. It reduces the variance of the return by subtracting the expected value of the return under the behavior policy.

Why Doesn’t This Introduce Bias?*

The key reason is that the baseline has zero mean once you factor in the importance sampling correction. In more explicit derivations (e.g., Sutton & Barto, Chapter 7), one shows that

ρt[Rt+1+γGt+1:hVh1(St)]  +  Vh1(St)\rho_t \bigl[R_{t+1} + \gamma G_{t+1:h} - V_{h-1}(S_t)\bigr] \;+\; V_{h-1}(S_t)

has the same expectation as the purely corrected return

ρt(Rt+1+γGt+1:h).\rho_t \bigl(R_{t+1} + \gamma G_{t+1:h}\bigr).

Equivalently, we add something whose expected value is zero under the mixture of π\pi and μ\mu. Thus the update remains an unbiased estimator of the target-policy return, yet with typically much smaller variance.

For action values, the off-policy definition of nn-step return is a little different because the first action doesn't play a role in importance sampling. It has already been taken, so the weighted importance sampling is done for rewards following the first action. So,

Gt:h=Rt+1+γ(ρt+1Gt+1:h+V^h1(St+1)ρt+1Qh1(St+1,At+1))G_{t:h} = R_{t+1} + \gamma (\rho_{t+1}G_{t+1:h} + \hat{V}_{h-1}(S_{t+1}) - \rho_{t+1}Q_{h-1}(S_{t+1}, A_{t+1}))

Off-policy Learning Without Importance Sampling: The nn-step Tree Backup Algorithm

A good way to see the tree‐backup n‐step return is that it “splits” the backup at each intermediate state into two parts:

  1. All other actions (besides the one currently taken i.e. At+1A_{t+1}). It uses the current estimate Qt+n1(St+1,a)Q_{t + n - 1}(S_{t + 1}, a) weighed by the π(aSt+1)\pi(a|S_{t+1}).
  2. The action currently taken i.e. At+1A_{t+1}. It uses the current estimate Qt+n1(St+1,At+1)Q_{t + n - 1}(S_{t + 1}, A_{t + 1}) and recursively goes deeper with Gt+1:hG_{t+1:h}.

Here's the equation:

Gt:t+n  =˙  Rt+1  +  γaAt+1π(aSt+1)Qt+n1(St+1,a)  +  γπ(At+1St+1)Gt+1:t+nG_{t:t+n} \;\dot{=}\; R_{t+1} \;+\; \gamma \sum_{a \neq A_{t+1}} \pi\bigl(a \mid S_{t+1}\bigr)\,Q_{t+n-1}\bigl(S_{t+1},a\bigr) \;+\; \gamma \,\pi\bigl(A_{t+1} \mid S_{t+1}\bigr)\,G_{t+1:t+n}

and GT1,t+n=RTG_{T - 1, t + n} = R_{T}

The update equation remains the same: (the way of calculating the target (or error) is different)

Qt+n(St,At)=Qt+n1(St,At)+α[Gt:t+nQt+n1(St,At)]Q_{t + n}(S_t, A_t) = Q_{t + n - 1}(S_t, A_t) + \alpha[G_{t:t+n} - Q_{t+n-1}(S_t, A_t)]

Last updated on