n-step Bootstrapping

$n-step$ TD Prediction

Consider a sequence of states and rewards $S_t, R_{t+1}, S_{t+1}, R_{t+2}, \ldots, S_{t+n-1}, R_{t+n}$ , where $n \ge 1$ . The $n-step$ return is defined as

G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n})

This is a very natural generalization of the $TD(0)$ return $G_t^{(1)} = R_{t+1} + \gamma V_t(S_{t+1})$ .

Update equation:

V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha[G_{t:t+n} - V_{t+n-1}(S_t)]

The $n-step$ uses the value function $V_{t + n - 1}$ to correct for the missing rewards beyond $R_{t + n}$ because we sum the rewards from $t + 1$ to $t + n - 1$ , and then add the value of the state at $t + n$ to create the " $n-step$ return".

Explanation This statement formalizes why n-step returns are guaranteed to be “better” (in a worst‐state sense) than the old value estimate $V_{t+n-1}$ . Concretely, it says

\max_s \Bigl\lvert \mathbb{E}_\pi[G_{t:t+n} \mid S_t = s] \;-\; v_\pi(s) \Bigr\rvert \;\;\le\;\; \gamma^n \max_s \Bigl\lvert V_{t+n-1}(s) \;-\; v_\pi(s)\Bigr\rvert.

Worst‐State Error The expression $\max_s \bigl\lvert \cdot \bigr\rvert$ means we look at the largest possible difference (error) across all states $s$ .

$n-step$ Sarsa

Simple generalization is:

G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n})

The natural update equation is:

Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha[G_{t:t+n} - Q_{t+n-1}(S_t, A_t)]

We know that that for the normal Sarsa, the update equation is:

Q_{t + 1}(S_t, A_t) = Q_t(S_t, A_t) + \alpha[R_{t+1} + \gamma Q_t(S_{t+1}, A_{t+1}) - Q_t(S_t, A_t)]

$n-step$ Off-policy Learning

For $n$ step TD, the update equation will simply be:

V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha \rho_{t:t+n-1}[G_{t:t+n} - V_{t+n-1}(S_t)]

where $\rho_{t:h} = \prod_{k=t}^{min(h, T - 1)} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$ .

Similarly, for $n$ step Sarsa, the update equation will be:

Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha \rho_{t+1:t+n}[G_{t:t+n} - Q_{t+n-1}(S_t, A_t)]

We're using $\rho_{t+1:t+n}$ because we're moving from state $A_t$ already to $A_{t+n}$ for $A_{t+1}, \cdots A_{t+n}$ , whereas in the case of $V$ we're moving from $S_t$ to $S_{t+n}$ which only includes actions: $A_t, A_{t+1}, \ldots, A_{t+n-1}$ .

Per-decision Methods with Control Variates

The original problem is: Naive Off-policy Returns can have high variance. When using off-learning, we use a behaviour policy $b$ to generate the data, but we want to evaluate a target policy $\pi$ . The importance sampling ratio $\rho_t = \dfrac{\pi(A_t|S_t)}{b(A_t|S_t)}$ is used to correct for the difference in the probabilities of actions taken by the two policies. If $\pi$ chooses an action with a much higher probability than $b$ , the importance sampling ratio will be high, and the return will be scaled up. This can lead to high variance in the returns (and vice-versa).

Similarly if $\pi$ chooses an action with a much lower probability than $b$ , the importance sampling ratio will be low, and the return will be scaled down. This can also lead to high variance in the returns.

For $n$ steps ending at horizon $h$ , the $n$ -step return can be written as

G_{t:h} = R_{t+1} + \gamma G_{t+1:h} \quad \text{for } t < h < T

One classical variance‐reduction trick is to add and subtract a “baseline” whose expected value is zero under the same distribution.

G_{t:h} = \rho_t(R_{t+1} + \gamma G_{t+1:h}) + (1 - \rho_t)V_{h-1}(S_t) \quad \text{for } t < h = T

The second term $(1 - \rho_t)V_{h-1}(S_t)$ is the control variate. It reduces the variance of the return by subtracting the expected value of the return under the behavior policy.

Why Doesn’t This Introduce Bias?*

The key reason is that the baseline has zero mean once you factor in the importance sampling correction. In more explicit derivations (e.g., Sutton & Barto, Chapter 7), one shows that

\rho_t \bigl[R_{t+1} + \gamma G_{t+1:h} - V_{h-1}(S_t)\bigr] \;+\; V_{h-1}(S_t)

has the same expectation as the purely corrected return

\rho_t \bigl(R_{t+1} + \gamma G_{t+1:h}\bigr).

Equivalently, we add something whose expected value is zero under the mixture of $\pi$ and $\mu$ . Thus the update remains an unbiased estimator of the target-policy return, yet with typically much smaller variance.

For action values, the off-policy definition of $n$ -step return is a little different because the first action doesn't play a role in importance sampling. It has already been taken, so the weighted importance sampling is done for rewards following the first action. So,

G_{t:h} = R_{t+1} + \gamma (\rho_{t+1}G_{t+1:h} + \hat{V}_{h-1}(S_{t+1}) - \rho_{t+1}Q_{h-1}(S_{t+1}, A_{t+1}))

Off-policy Learning Without Importance Sampling: The $n$ -step Tree Backup Algorithm

A good way to see the tree‐backup n‐step return is that it “splits” the backup at each intermediate state into two parts:

All other actions (besides the one currently taken i.e. $A_{t+1}$ ). It uses the current estimate $Q_{t + n - 1}(S_{t + 1}, a)$ weighed by the $\pi(a|S_{t+1})$ .
The action currently taken i.e. $A_{t+1}$ . It uses the current estimate $Q_{t + n - 1}(S_{t + 1}, A_{t + 1})$ and recursively goes deeper with $G_{t+1:h}$ .

Here's the equation:

G_{t:t+n} \;\dot{=}\; R_{t+1} \;+\; \gamma \sum_{a \neq A_{t+1}} \pi\bigl(a \mid S_{t+1}\bigr)\,Q_{t+n-1}\bigl(S_{t+1},a\bigr) \;+\; \gamma \,\pi\bigl(A_{t+1} \mid S_{t+1}\bigr)\,G_{t+1:t+n}

and $G_{T - 1, t + n} = R_{T}$

The update equation remains the same: (the way of calculating the target (or error) is different)

Q_{t + n}(S_t, A_t) = Q_{t + n - 1}(S_t, A_t) + \alpha[G_{t:t+n} - Q_{t+n-1}(S_t, A_t)]

Bhavit Sharma

n-step Bootstrapping

$n-step$ TD Prediction

Error-reduction property

$n-step$ Sarsa

$n-step$ Off-policy Learning

Per-decision Methods with Control Variates

Why Doesn’t This Introduce Bias?*

Off-policy Learning Without Importance Sampling: The $n$ -step Tree Backup Algorithm

On this page

n-step Bootstrapping

n−stepn-stepn−step TD Prediction

Error-reduction property

n−stepn-stepn−step Sarsa

n−stepn-stepn−step Off-policy Learning

Per-decision Methods with Control Variates

Why Doesn’t This Introduce Bias?*

Off-policy Learning Without Importance Sampling: The nnn-step Tree Backup Algorithm

On this page

$n-step$ TD Prediction

$n-step$ Sarsa

$n-step$ Off-policy Learning

Off-policy Learning Without Importance Sampling: The $n$ -step Tree Backup Algorithm