Neural Network Diagram

Bhavit Sharma

Monte Carlo Methods

Importance Sampling

We have sampling ratio for a trajectory τ\tau using behavior policy bb and target policy π\pi as:

ρt:T1=k=tT1π(AkSk)b(AkSk)\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)}

Now, we have to show that Eb[ρt:T1Gt]=Gt\mathbb{E}_b[\rho_{t:T-1} G_t] = G_t i.e. expectation summation over all the possible trajectories.

So, we have:

Eb[ρt:T1Gt]=τρt:T1GtPb(τ)=τk=tT1π(AkSk)b(AkSk)GtPb(τ)We know that Pb(τ)=k=0T1P(Sk+1Sk,Ak)b(AkSk)Substitute Pb(τ) in the above equation=τk=tT1π(AkSk)b(AkSk)Gtk=0T1P(Sk+1Sk,Ak)b(AkSk)=τPπ(AkSk)Gt=vπ(St)\begin{align*} \mathbb{E}_b[\rho_{t:T-1} G_t] &= \sum_{\tau} \rho_{t:T-1} G_t P_{b}(\tau) \\ &= \sum_{\tau} \prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)} G_t P_{b}(\tau) \\ & \text{We know that } P_{b}(\tau) = \prod_{k=0}^{T-1} P(S_{k+1} \mid S_k, A_k) b(A_k \mid S_k) \\ & \text{Substitute } P_{b}(\tau) \text{ in the above equation} \\ &= \sum_{\tau} \prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)} G_t \prod_{k=0}^{T-1} P(S_{k+1} \mid S_k, A_k) b(A_k \mid S_k) \\ &= \sum_{\tau} P_{\pi}(A_k \mid S_k) G_t \\ &= v_{\pi}(S_t) \\ \end{align*}

To see more about the importance sampling, particularly in the context of variance reduction then see this (Hsieh 2020)

*Discounting Aware Importance Sampling

These are methods used to significantly reduce the variance of off-policy estimators.

Why variance with the existing importance sampling high?: Consider the case where R00R_0 \neq 0 and γ=0\gamma = 0. For concretness, lets say that episodes last 100 steps and that γ=0\gamma=0. Then, the return from time 00 will be just R0R_0. The importance sampling ratio will be τ=π(A0S0)b(A0S0)π(A99S99)b(A99S99)\tau = \dfrac{\pi(A_0 \mid S_0)}{b(A_0 \mid S_0)} \cdots \dfrac{\pi(A_{99} \mid S_{99})}{b(A_{99} \mid S_{99})}.

The returns are scaled by all the factors but technically we only need π(A0S0)\pi(A_0 \mid S_0) as all the other factors are independent of the reward R0R_0. This is where the variance comes from: τ\tau can be extremely large or extremely small.

Lets say: Gt:hk=t+1hRk\overline{G}_{t:h} \doteq \sum_{k=t + 1}^{h} R_k. We can also show that:

GtRt+1+γRt+2+γ2Rt+3++γTt1RT=(1γ)Rt+1+(1γ)γ(Rt+1+Rt+2)+(1γ)γ2(Rt+1+Rt+2+Rt+3)++(1γ)γTt2(Rt+1+Rt+2++RT1)+γTt1(Rt+1+Rt+2++RT)=(1γ)h=t+1T1γht1Gt:h  +  γTt1Gt:T.\begin{align*} G_t &\doteq R_{t+1} + \gamma\,R_{t+2} + \gamma^2\,R_{t+3} + \cdots + \gamma^{T-t-1}R_T\\ &= (1-\gamma)\,R_{t+1} + (1-\gamma)\gamma\bigl(R_{t+1} + R_{t+2}\bigr) + (1-\gamma)\gamma^2\bigl(R_{t+1} + R_{t+2} + R_{t+3}\bigr)\\ &\quad + \dots\\ &\quad + (1-\gamma)\gamma^{T-t-2}\bigl(R_{t+1} + R_{t+2} + \cdots + R_{T-1}\bigr) + \gamma^{T-t-1}\bigl(R_{t+1} + R_{t+2} + \cdots + R_T\bigr)\\ &= (1-\gamma)\sum_{h=t+1}^{T-1} \gamma^{\,h-t-1}\,\overline{G}_{t:h} \;+\;\gamma^{T-t-1}\,\overline{G}_{t:T}. \end{align*}

Thus we arrive at: Notice that all the individual terms uses a better ρ\rho than the original ρt:T1\rho_{t:T-1} for all of them.

V(s)tT(s)[(1γ) ⁣h=t+1T(t)1γht1ρt:h1Gt:h  +  γT(t)t1ρt:T(t)1Gt:T(t)]T(s)\begin{equation} V(s) \doteq \frac{ \displaystyle \sum_{t \in \mathcal{T}(s)} \Bigl[ (1-\gamma)\!\sum_{h=t+1}^{T^{(t)}-1} \gamma^{h-t-1}\,\rho_{t:\,h-1}\,\overline{G}_{t:\,h} \;+\; \gamma^{T^{(t)}-t-1}\,\rho_{t:\,T(t)-1}\,\overline{G}_{t:\,T(t)} \Bigr] } {\bigl|\mathcal{T}(s)\bigr|} \tag{5.9} \end{equation}

and a weighted-importance sampling estimator is:

V(s)tT(s)[(1γ) ⁣h=t+1T(t)1γht1ρt:h1Gt:h  +  γT(t)t1ρt:T(t)1Gt:T(t)]tT(s)[(1γ) ⁣h=t+1T(t)1γht1ρt:h1  +  γT(t)t1ρt:T(t)1]\begin{equation} V(s) \doteq \frac{ \displaystyle \sum_{t \in \mathcal{T}(s)} \Bigl[ (1-\gamma)\!\sum_{h=t+1}^{T^{(t)}-1} \gamma^{h-t-1}\,\rho_{t:\,h-1}\,\overline{G}_{t:\,h} \;+\; \gamma^{T^{(t)}-t-1}\,\rho_{t:\,T(t)-1}\,\overline{G}_{t:\,T(t)} \Bigr] } { \displaystyle \sum_{t \in \mathcal{T}(s)} \Bigl[ (1-\gamma)\!\sum_{h=t+1}^{T^{(t)}-1} \gamma^{h-t-1}\,\rho_{t:\,h-1} \;+\; \gamma^{T^{(t)}-t-1}\,\rho_{t:\,T(t)-1} \Bigr] } \tag{5.10} \end{equation}

*Per-decision Importance Sampling

We know that

ρt:T1Gt=ρt:T1(Rt+1+γRt+2++γT1RT)=ρt:T1Rt+1+ρt:T1γRt+2++ρt:T1γT1RT\begin{align*} \rho_{t:T-1} G_t &= \rho_{t:T-1}(R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{T-1} R_T) \\ &= \rho_{t:T-1} R_{t+1} + \rho_{t:T-1} \gamma R_{t+2} + \ldots + \rho_{t:T-1} \gamma^{T-1} R_T \end{align*}

We can also see that:

ρt:T1Rt+1=π(AtSt)b(AtSt)π(At+1St+1)b(At+1St+1)π(AT1ST1)b(AT1ST1)Rt+1\rho_{t:T-1} R_{t+1} = \dfrac{\pi(A_t \mid S_t)}{b(A_t \mid S_t)} \dfrac{\pi(A_{t+1} \mid S_{t+1})}{b(A_{t+1} \mid S_{t+1})} \cdots \dfrac{\pi(A_{T-1} \mid S_{T-1})}{b(A_{T-1} \mid S_{T-1})} R_{t+1}

Only the first and last terms are related. The expectation of rest of the factors is 11 because of the behavior policy bb.

E[π(AkSk)b(AkSk)]=aAb(aSk)π(aSk)b(aSk)=aAπ(aSk)=1(because we’re summing over all the actions treating b as the random variable.)\begin{aligned} \mathbb{E}\Bigl[\frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)}\Bigr] &= \sum_{a \in \mathbb{A}} b(a \mid S_k)\,\frac{\pi(a \mid S_k)}{b(a \mid S_k)} \\ &= \sum_{a \in \mathbb{A}} \pi(a \mid S_k) \\ &= 1 \quad \text{(because we're summing over all the actions treating $b$ as the random variable.)} \end{aligned}

We can then show E[ρt:T1Rt+1]=E[ρt:tRt+1]\mathbb{E}[\rho_{t:T-1} R_{t+1}] = \mathbb{E}[\rho_{t:t}R_{t+1}]. If repeat the process for the same subterms, we can show that that E(ρt:T1Rt+k)=E(ρt:t+k1Rt+k)\mathbb{E}(\rho_{t:T-1}R_{t+k}) = \mathbb{E}(\rho_{t:t+k-1}R_{t+k})

It follows then that the expectation of our original term (5.11) can be written

E[ρt:T1Gt]=E[G~t],\begin{aligned} \mathbb{E}\bigl[\rho_{t:T-1} G_t\bigr] &= \mathbb{E}\bigl[\widetilde{G}_t\bigr], \end{aligned}

where

G~t=ρt:tRt+1  +  γρt:t+1Rt+2  +  γ2ρt:t+2Rt+3  +    +  γTt1ρt:T1RT.\begin{aligned} \widetilde{G}_t &= \rho_{t:t}\,R_{t+1} \;+\;\gamma\,\rho_{t:t+1}\,R_{t+2} \;+\;\gamma^2\,\rho_{t:t+2}\,R_{t+3} \;+\;\dots \;+\;\gamma^{\,T - t - 1}\,\rho_{t:T-1}\,R_T. \end{aligned}

We call this idea per-decision importance sampling. Only the first and last terms are directly affected by off-policy corrections, and the expectation of the intermediate factors is 1 due to the behavior policy bb.

Hsieh, Michael. 2020. “Monte Carlo Simulation, Variation Reduction & Advanced Topics.” https://www.columbia.edu/~mh2078/MonteCarlo/MCS_Var_Red_Advanced.pdf.

Last updated on