Monte Carlo Methods
Importance Sampling
We have sampling ratio for a trajectory using behavior policy and target policy as:
Now, we have to show that i.e. expectation summation over all the possible trajectories.
So, we have:
To see more about the importance sampling, particularly in the context of variance reduction then see this (Hsieh 2020)
*Discounting Aware Importance Sampling
These are methods used to significantly reduce the variance of off-policy estimators.
Why variance with the existing importance sampling high?: Consider the case where and . For concretness, lets say that episodes last 100 steps and that . Then, the return from time will be just . The importance sampling ratio will be .
The returns are scaled by all the factors but technically we only need as all the other factors are independent of the reward . This is where the variance comes from: can be extremely large or extremely small.
Lets say: . We can also show that:
Thus we arrive at: Notice that all the individual terms uses a better than the original for all of them.
and a weighted-importance sampling estimator is:
*Per-decision Importance Sampling
We know that
We can also see that:
Only the first and last terms are related. The expectation of rest of the factors is because of the behavior policy .
We can then show . If repeat the process for the same subterms, we can show that that
It follows then that the expectation of our original term (5.11) can be written
where
We call this idea per-decision importance sampling. Only the first and last terms are directly affected by off-policy corrections, and the expectation of the intermediate factors is 1 due to the behavior policy .
Hsieh, Michael. 2020. “Monte Carlo Simulation, Variation Reduction & Advanced Topics.” https://www.columbia.edu/~mh2078/MonteCarlo/MCS_Var_Red_Advanced.pdf.
Last updated on