Monte Carlo Methods

Importance Sampling

We have sampling ratio for a trajectory $\tau$ using behavior policy $b$ and target policy $\pi$ as:

\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)}

Now, we have to show that $\mathbb{E}_b[\rho_{t:T-1} G_t] = G_t$ i.e. expectation summation over all the possible trajectories.

So, we have:

\begin{align*} \mathbb{E}_b[\rho_{t:T-1} G_t] &= \sum_{\tau} \rho_{t:T-1} G_t P_{b}(\tau) \\ &= \sum_{\tau} \prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)} G_t P_{b}(\tau) \\ & \text{We know that } P_{b}(\tau) = \prod_{k=0}^{T-1} P(S_{k+1} \mid S_k, A_k) b(A_k \mid S_k) \\ & \text{Substitute } P_{b}(\tau) \text{ in the above equation} \\ &= \sum_{\tau} \prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)} G_t \prod_{k=0}^{T-1} P(S_{k+1} \mid S_k, A_k) b(A_k \mid S_k) \\ &= \sum_{\tau} P_{\pi}(A_k \mid S_k) G_t \\ &= v_{\pi}(S_t) \\ \end{align*}

To see more about the importance sampling, particularly in the context of variance reduction then see this (Hsieh 2020)

*Discounting Aware Importance Sampling

These are methods used to significantly reduce the variance of off-policy estimators.

Why variance with the existing importance sampling high?: Consider the case where $R_0 \neq 0$ and $\gamma = 0$ . For concretness, lets say that episodes last 100 steps and that $\gamma=0$ . Then, the return from time $0$ will be just $R_0$ . The importance sampling ratio will be $\tau = \dfrac{\pi(A_0 \mid S_0)}{b(A_0 \mid S_0)} \cdots \dfrac{\pi(A_{99} \mid S_{99})}{b(A_{99} \mid S_{99})}$ .

The returns are scaled by all the factors but technically we only need $\pi(A_0 \mid S_0)$ as all the other factors are independent of the reward $R_0$ . This is where the variance comes from: $\tau$ can be extremely large or extremely small.

Lets say: $\overline{G}_{t:h} \doteq \sum_{k=t + 1}^{h} R_k$ . We can also show that:

\begin{align*} G_t &\doteq R_{t+1} + \gamma\,R_{t+2} + \gamma^2\,R_{t+3} + \cdots + \gamma^{T-t-1}R_T\\ &= (1-\gamma)\,R_{t+1} + (1-\gamma)\gamma\bigl(R_{t+1} + R_{t+2}\bigr) + (1-\gamma)\gamma^2\bigl(R_{t+1} + R_{t+2} + R_{t+3}\bigr)\\ &\quad + \dots\\ &\quad + (1-\gamma)\gamma^{T-t-2}\bigl(R_{t+1} + R_{t+2} + \cdots + R_{T-1}\bigr) + \gamma^{T-t-1}\bigl(R_{t+1} + R_{t+2} + \cdots + R_T\bigr)\\ &= (1-\gamma)\sum_{h=t+1}^{T-1} \gamma^{\,h-t-1}\,\overline{G}_{t:h} \;+\;\gamma^{T-t-1}\,\overline{G}_{t:T}. \end{align*}

Thus we arrive at: Notice that all the individual terms uses a better $\rho$ than the original $\rho_{t:T-1}$ for all of them.

\begin{equation} V(s) \doteq \frac{ \displaystyle \sum_{t \in \mathcal{T}(s)} \Bigl[ (1-\gamma)\!\sum_{h=t+1}^{T^{(t)}-1} \gamma^{h-t-1}\,\rho_{t:\,h-1}\,\overline{G}_{t:\,h} \;+\; \gamma^{T^{(t)}-t-1}\,\rho_{t:\,T(t)-1}\,\overline{G}_{t:\,T(t)} \Bigr] } {\bigl|\mathcal{T}(s)\bigr|} \tag{5.9} \end{equation}

and a weighted-importance sampling estimator is:

\begin{equation} V(s) \doteq \frac{ \displaystyle \sum_{t \in \mathcal{T}(s)} \Bigl[ (1-\gamma)\!\sum_{h=t+1}^{T^{(t)}-1} \gamma^{h-t-1}\,\rho_{t:\,h-1}\,\overline{G}_{t:\,h} \;+\; \gamma^{T^{(t)}-t-1}\,\rho_{t:\,T(t)-1}\,\overline{G}_{t:\,T(t)} \Bigr] } { \displaystyle \sum_{t \in \mathcal{T}(s)} \Bigl[ (1-\gamma)\!\sum_{h=t+1}^{T^{(t)}-1} \gamma^{h-t-1}\,\rho_{t:\,h-1} \;+\; \gamma^{T^{(t)}-t-1}\,\rho_{t:\,T(t)-1} \Bigr] } \tag{5.10} \end{equation}

*Per-decision Importance Sampling

We know that

\begin{align*} \rho_{t:T-1} G_t &= \rho_{t:T-1}(R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{T-1} R_T) \\ &= \rho_{t:T-1} R_{t+1} + \rho_{t:T-1} \gamma R_{t+2} + \ldots + \rho_{t:T-1} \gamma^{T-1} R_T \end{align*}

We can also see that:

\rho_{t:T-1} R_{t+1} = \dfrac{\pi(A_t \mid S_t)}{b(A_t \mid S_t)} \dfrac{\pi(A_{t+1} \mid S_{t+1})}{b(A_{t+1} \mid S_{t+1})} \cdots \dfrac{\pi(A_{T-1} \mid S_{T-1})}{b(A_{T-1} \mid S_{T-1})} R_{t+1}

Only the first and last terms are related. The expectation of rest of the factors is $1$ because of the behavior policy $b$ .

\begin{aligned} \mathbb{E}\Bigl[\frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)}\Bigr] &= \sum_{a \in \mathbb{A}} b(a \mid S_k)\,\frac{\pi(a \mid S_k)}{b(a \mid S_k)} \\ &= \sum_{a \in \mathbb{A}} \pi(a \mid S_k) \\ &= 1 \quad \text{(because we're summing over all the actions treating $b$ as the random variable.)} \end{aligned}

We can then show $\mathbb{E}[\rho_{t:T-1} R_{t+1}] = \mathbb{E}[\rho_{t:t}R_{t+1}]$ . If repeat the process for the same subterms, we can show that that $\mathbb{E}(\rho_{t:T-1}R_{t+k}) = \mathbb{E}(\rho_{t:t+k-1}R_{t+k})$

It follows then that the expectation of our original term (5.11) can be written

\begin{aligned} \mathbb{E}\bigl[\rho_{t:T-1} G_t\bigr] &= \mathbb{E}\bigl[\widetilde{G}_t\bigr], \end{aligned}

where

\begin{aligned} \widetilde{G}_t &= \rho_{t:t}\,R_{t+1} \;+\;\gamma\,\rho_{t:t+1}\,R_{t+2} \;+\;\gamma^2\,\rho_{t:t+2}\,R_{t+3} \;+\;\dots \;+\;\gamma^{\,T - t - 1}\,\rho_{t:T-1}\,R_T. \end{aligned}

We call this idea per-decision importance sampling. Only the first and last terms are directly affected by off-policy corrections, and the expectation of the intermediate factors is 1 due to the behavior policy $b$ .

Hsieh, Michael. 2020. “Monte Carlo Simulation, Variation Reduction & Advanced Topics.” https://www.columbia.edu/~mh2078/MonteCarlo/MCS_Var_Red_Advanced.pdf.

Bhavit Sharma

Monte Carlo Methods

Importance Sampling

*Discounting Aware Importance Sampling

*Per-decision Importance Sampling

On this page