n-step Bootstrapping
TD Prediction
Consider a sequence of states and rewards , where . The return is defined as
This is a very natural generalization of the return .
Update equation:
Error-reduction property
The uses the value function to correct for the missing rewards beyond because we sum the rewards from to , and then add the value of the state at to create the " return".
Explanation This statement formalizes why n-step returns are guaranteed to be “better” (in a worst‐state sense) than the old value estimate . Concretely, it says
- Worst‐State Error The expression means we look at the largest possible difference (error) across all states .
Sarsa
Simple generalization is:
The natural update equation is:
We know that that for the normal Sarsa, the update equation is:
Off-policy Learning
For step TD, the update equation will simply be:
where .
Similarly, for step Sarsa, the update equation will be:
We're using because we're moving from state already to for , whereas in the case of we're moving from to which only includes actions: .
Per-decision Methods with Control Variates
The original problem is: Naive Off-policy Returns can have high variance. When using off-learning, we use a behaviour policy to generate the data, but we want to evaluate a target policy . The importance sampling ratio is used to correct for the difference in the probabilities of actions taken by the two policies. If chooses an action with a much higher probability than , the importance sampling ratio will be high, and the return will be scaled up. This can lead to high variance in the returns (and vice-versa).
Similarly if chooses an action with a much lower probability than , the importance sampling ratio will be low, and the return will be scaled down. This can also lead to high variance in the returns.
For steps ending at horizon , the -step return can be written as
One classical variance‐reduction trick is to add and subtract a “baseline” whose expected value is zero under the same distribution.
The second term is the control variate. It reduces the variance of the return by subtracting the expected value of the return under the behavior policy.
Why Doesn’t This Introduce Bias?*
The key reason is that the baseline has zero mean once you factor in the importance sampling correction. In more explicit derivations (e.g., Sutton & Barto, Chapter 7), one shows that
has the same expectation as the purely corrected return
Equivalently, we add something whose expected value is zero under the mixture of and . Thus the update remains an unbiased estimator of the target-policy return, yet with typically much smaller variance.
For action values, the off-policy definition of -step return is a little different because the first action doesn't play a role in importance sampling. It has already been taken, so the weighted importance sampling is done for rewards following the first action. So,
Off-policy Learning Without Importance Sampling: The -step Tree Backup Algorithm
A good way to see the tree‐backup n‐step return is that it “splits” the backup at each intermediate state into two parts:
- All other actions (besides the one currently taken i.e. ). It uses the current estimate weighed by the .
- The action currently taken i.e. . It uses the current estimate and recursively goes deeper with .
Here's the equation:
and
The update equation remains the same: (the way of calculating the target (or error) is different)
Last updated on