Neural Network Diagram

Bhavit Sharma

Actor-Critic Algorithms

Summary

Actor-critic methods learns a value-function and a policy simultaneously.

Improving the policy gradient

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)t=tTr(si,t,ai,t)”reward to go”\nabla_{\theta} J(\theta) \,\approx\, \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}\bigl(a_{i,t} \mid s_{i,t}\bigr) \,\underbrace{\sum_{t'=t}^T r\bigl(s_{i,t'},\,a_{i,t'}\bigr)}_{\text{''reward to go''}} Q^i,t  :  estimate of the expected reward if we take action ai,t in state si,t.\hat{Q}_{i,t} \;:\; \text{estimate of the expected reward if we take action } a_{i,t} \text{ in state } s_{i,t}.

Can we get a better estimate of Q^i,t\hat{Q}_{i,t} ? If we get access to the true expected reward Qi,tQ_{i,t} then the variance will be much lower.

Qi,t=t=tTEπθ[r(st,at)st,at]:true expected reward to goQ_{i, t} = \sum_{t'=t}^T \mathbb{E}_{\pi_{\theta}}\left[ r(s_t', a_t') \mid s_t, a_t \right]: \quad \text{true expected reward to go} θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)Qi,t\nabla_{\theta} J(\theta) \approx \dfrac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}\bigl(a_{i,t} \mid s_{i,t}\bigr) \cdot Q_{i,t}

With baselines:

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(Qi,tb)\nabla_{\theta} J(\theta) \approx \dfrac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}\bigl(a_{i,t} \mid s_{i,t}\bigr) \cdot \left( Q_{i,t} - b \right)

where bb is the baseline or average reward.

Baseline could be reduced even further by using V(st)V(s_t) as the baseline, where V(st)=Eatπθ(atst)[Q(st,at)]V(s_t) = \mathbb{E}_{a_t \sim \pi_{\theta}(a_t \mid s_t)}\left[Q(s_t, a_t)\right]. So the baseline becomes Q(si,t,ai,t)V(si,t)Q(s_{i, t}, a_{i, t}) - V(s_{i, t}), which is also known as the advantage function.

Action-critic methods don't necessarily produce unbiased estimates particularly if the advantage function is incorrect. Usually, we're okay with this because the variance is much lower.

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)A(si,t,ai,t)A(si,t,ai,t)=Q(si,t,ai,t)V(si,t)\begin{align*} \nabla_{\theta} J(\theta) &\approx \dfrac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}\bigl(a_{i,t} \mid s_{i,t}\bigr) \cdot A(s_{i, t}, a_{i, t}) \\ A(s_{i, t}, a_{i, t}) &= Q(s_{i, t}, a_{i, t}) - V(s_{i, t}) \end{align*}

State & State-Action Value Functions

Qπ(st,at)=t=tTEπθ[r(st,at)st,at]:total reward from taking at in state stQ^{\pi}(s_t, a_t) = \sum_{t'=t}^T \mathbb{E}_{\pi_{\theta}}\left[ r(s_t', a_t') \mid s_t, a_t \right]: \quad \text{total reward from taking } a_t \text{ in state } s_t Vπ(st)=Eatπθ(atst)[Qπ(st,at)]:total reward from state stV^{\pi}(s_t) = \mathbb{E}_{a_t \sim \pi_{\theta}(a_t \mid s_t)}\left[Q^{\pi}(s_t, a_t)\right]: \quad \text{total reward from state } s_t Aπ(st,at)=Qπ(st,at)Vπ(st):advantage functionA^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t): \quad \text{advantage function} θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)Aπ(si,t,ai,t)\nabla_{\theta} J(\theta) \approx \dfrac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}\bigl(a_{i,t} \mid s_{i,t}\bigr) \cdot A^{\pi}(s_{i, t}, a_{i, t})

The better the Aπ(si,t,ai,t)A^{\pi}(s_{i, t}, a_{i, t}) estimate, the lower the variance.

The unbiased, but high variance single-sample estimate of the policy gradient is:

θJ(θ)=1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTr(si,t,ai,t)b)\nabla_{\theta} J(\theta) = \dfrac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}\bigl(a_{i,t} \mid s_{i,t}\bigr) \cdot \left( \sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) - b \right)

The box\color{green}{box} now involves fitting either QπQ^{\pi} or VπV^{\pi} or AπA^{\pi}.

Now, we know that:

Qπ(st,at)=t=tTEπθ[r(st,at)st,at],Q^{\pi}(s_t,a_t) = \sum_{t'=t}^{T} \mathbb{E}_{\pi_\theta} \bigl[r(s_{t'},a_{t'}) \,\big|\, s_t,a_t\bigr], Qπ(st,at)=r(st,at)+t=t+1TEπθ[r(st,at)st,at],Q^{\pi}(s_t,a_t) = r(s_t,a_t) + \sum_{t'=t+1}^{T} \mathbb{E}_{\pi_\theta} \bigl[r(s_{t'},a_{t'}) \,\big|\, s_t,a_t\bigr], Qπ(st,at)=r(st,at)+Est+1p(st,at)[Vπ(st+1)],Q^{\pi}(s_t,a_t) = r(s_t,a_t) + \mathbb{E}_{s_{t+1}\sim p(\cdot\,|\,s_t,a_t)} \bigl[V^{\pi}(s_{t+1})\bigr], Qπ(st,at)r(st,at)+Vπ(st+1),Q^{\pi}(s_t,a_t) \approx r(s_t,a_t) + V^{\pi}(s_{t+1}), Aπ(st,at)r(st,at)+Vπ(st+1)Vπ(st).A^{\pi}(s_t,a_t) \approx r(s_t,a_t) + V^{\pi}(s_{t+1}) - V^{\pi}(s_t).

This is how we're approximating teh advantage function. it also might be easier to learn the value function since it has less parameters than the Q function (which needs action as input).

Policy Valuation

Vπ(st)=t=tTEπθ[r(st,at)st,at]:total reward from state stV^{\pi}(s_t) = \sum_{t'=t}^T \mathbb{E}_{\pi_{\theta}}\left[ r(s_t', a_t') \mid s_t, a_t \right]: \quad \text{total reward from state } s_t J(θ)=Es1p(s1)[Vπ(s1)]:expected total reward from the start stateJ(\theta) = \mathbb{E}_{s_1 \approx p(s_1)}[V^{\pi}(s_1)]: \quad \text{expected total reward from the start state}

How can we evaluate policy? We can use Monte Carlo policy evaluation.

Vπ(st)t=tTr(st,at):total reward from state st:along a single trajectoryV^{\pi}(s_t) \approx \sum_{t'=t}^T r(s_{t'}, a_{t'}): \quad \text{total reward from state } s_t: \quad \text{along a single trajectory} Vπ(st)1Ni=1Nt=tTr(si,t,ai,t):total reward from state st:averaged over N trajectoriesV^{\pi}(s_t) \approx \dfrac{1}{N} \sum_{i=1}^N \sum_{t'=t}^T r(s_{i, t'}, a_{i, t'}): \quad \text{total reward from state } s_t: \quad \text{averaged over } N \text{ trajectories}

We can also fit a neural network which maps from st:RnVπ(st):Rs_t: \mathbb{R}^n \rightarrow V^{\pi}(s_t): \mathbb{R}.

Training data is: {(si,t,Vπ(si,t))}\{(s_{i, t}, V^{\pi}(s_{i, t}))\}.

{(si,t,t=tTr(si,t,ai,t)yi,t)}\Bigl\{ \bigl( s_{i,t}, \underbrace{\sum_{t' = t}^{T} r\bigl(s_{i,t'}, a_{i,t'}\bigr)}_{y_{i,t}} \bigr) \Bigr\} L(ϕ)=12iV^ϕπ(si)yi2\mathcal{L}(\phi) = \frac{1}{2} \sum_{i} \Bigl\| \hat{V}_{\phi}^{\pi}(s_i) - y_i \Bigr\|^2

Can we do better?

ideal target: yi,t=t=tTEπθ[r(st,at)si,t]    r(si,t,ai,t)  +  t=t+1TEπθ[r(st,at)si,t+1]\text{ideal target: } y_{i,t} = \sum_{t' = t}^{T} \mathbb{E}_{\pi_{\theta}}\bigl[r(s_{t'}, a_{t'}) \mid s_{i,t}\bigr] \;\approx\; r\bigl(s_{i,t}, a_{i,t}\bigr) \;+\; \sum_{t' = t+1}^{T} \mathbb{E}_{\pi_{\theta}}\bigl[r(s_{t'}, a_{t'}) \mid s_{i,t+1}\bigr] Monte Carlo target: yi,t=t=tTr(si,t,ai,t)\text{Monte Carlo target: } y_{i,t} = \sum_{t' = t}^{T} r\bigl(s_{i,t'}, a_{i,t'}\bigr)

So ideal target will be

yi,tr(si,t,ai,t)+Vπ(si,t+1)r(si,t,ai,t)+V^ϕπ(si,t+1)y_{i,t} \approx r\bigl(s_{i,t}, a_{i,t}\bigr) + V^{\pi}(s_{i,t+1}) \approx r\bigl(s_{i,t}, a_{i,t}\bigr) + \hat{V}_{\phi}^{\pi}(s_{i,t+1})

Training data becomes:

{(si,t,r(si,t,ai,t)+V^ϕπ(si,t+1)yi,t)}\Bigl\{ \bigl( s_{i,t}, \underbrace{r\bigl(s_{ i,t}, a_{i,t}\bigr) + \hat{V}_{\phi}^{\pi}(s_{i,t+1})}_{y_{i,t}} \bigr) \Bigr\}

From Evaluation to Actor-Critic Methods

batch actor-critic algorithm:

  • Sample {si,ai}\{s_i, a_i\} from πθ(as)\pi_{\theta}(a \mid s) (e.g., by running the policy on the robot).
  • Fit V^ϕπ(s)\hat{V}_{\phi}^{\pi}(s) to sampled reward sums.
  • Evaluate A^π(si,ai)=r(si,ai)+V^ϕπ(si)V^ϕπ(si).\hat{A}^{\pi}(s_i, a_i) = r(s_i, a_i) + \hat{V}_{\phi}^{\pi}(s_i') - \hat{V}_{\phi}^{\pi}(s_i).
  • Compute θJ(θ)iθlogπθ(aisi)A^π(si,ai).\nabla_{\theta} J(\theta) \approx \sum_{i} \nabla_{\theta} \log \pi_{\theta}(a_i \mid s_i)\,\hat{A}^{\pi}(s_i, a_i).
  • Update θ    θ+αθJ(θ).\theta \;\leftarrow\; \theta + \alpha \,\nabla_{\theta} J(\theta).

What if TT \rightarrow \infty?

simple trick: better to get rewards sooner than later. Use γ\gamma to discount future rewards.

yi,t=r(si,t,ai,t)+γV^ϕπ(si,t+1)y_{i,t} = r\bigl(s_{i,t}, a_{i,t}\bigr) + \gamma\hat{V}_{\phi}^{\pi}(s_{i,t+1})

What about (Monte Carlo) policy gradients?

option 1:θJ(θ)  1Ni=1Nt=1Tθlogπθ(ai,tsi,t)  (t=tTγttr(si,t,ai,t)).\textbf{option 1:} \quad \nabla_{\theta} J(\theta) ~\approx~ \frac{1}{N}\,\sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta}\,\log\,\pi_{\theta}\bigl(a_{i,t}\mid s_{i,t}\bigr) \;\left( \sum_{t' = t}^T \gamma^{\,t'-t}\,r\bigl(s_{i,t'},a_{i,t'}\bigr) \right). option 2:θJ(θ)  1Ni=1N(t=1Tθlogπθ(ai,tsi,t))(t=1Tγt1r(si,t,ai,t)).\textbf{option 2:} \quad \nabla_{\theta} J(\theta) ~\approx~ \frac{1}{N}\,\sum_{i=1}^N \left( \sum_{t=1}^T \nabla_{\theta}\,\log\,\pi_{\theta}\bigl(a_{i,t}\mid s_{i,t}\bigr) \right) \left( \sum_{t'=1}^T \gamma^{\,t'-1}\,r\bigl(s_{i,t'},a_{i,t'}\bigr) \right). option 2 (alternate form):θJ(θ)  1Ni=1Nt=1Tθlogπθ(ai,tsi,t)  (t=tTγt1r(si,t,ai,t)).\textbf{option 2 (alternate form):} \quad \nabla_{\theta} J(\theta) ~\approx~ \frac{1}{N}\,\sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta}\,\log\,\pi_{\theta}\bigl(a_{i,t}\mid s_{i,t}\bigr) \;\left( \sum_{t' = t}^T \gamma^{\,t'-1}\,r\bigl(s_{i,t'},a_{i,t'}\bigr) \right). option 2 (alternate form):θJ(θ)  1Ni=1Nt=1Tγt1θlogπθ(ai,tsi,t)  (t=tTγttr(si,t,ai,t)).\textbf{option 2 (alternate form):} \quad \nabla_{\theta} J(\theta) ~\approx~ \frac{1}{N}\,\sum_{i=1}^N \sum_{t=1}^T \gamma^{\,t-1}\, \nabla_{\theta}\,\log\,\pi_{\theta}\bigl(a_{i,t}\mid s_{i,t}\bigr) \;\left( \sum_{t' = t}^T \gamma^{\,t'-t}\,r\bigl(s_{i,t'},a_{i,t'}\bigr) \right).

Option 2 higlights the importance of taking better actions sooner than later. In practice, we often use option 1 (didn't really understand from the lecture why?)

Online Actor-Critic Algorithm:

  1. Take action aπθ(as)a \sim \pi_{\theta}(a \mid s), obtaining (s,a,s,r)(s, a, s', r).
  2. Update V^ϕπ(s)\hat{V}_{\phi}^{\pi}(s) using the target r  +  γV^ϕπ(s).r \;+\; \gamma \,\hat{V}_{\phi}^{\pi}(s').
  3. Evaluate A^π(s,a)  =  r(s,a)  +  γV^ϕπ(s)    V^ϕπ(s).\hat{A}^{\pi}(s,a) \;=\; r(s,a) \;+\; \gamma\,\hat{V}_{\phi}^{\pi}(s') \;-\; \hat{V}_{\phi}^{\pi}(s).
  4. Compute θJ(θ)    θlogπθ(as)A^π(s,a).\nabla_{\theta}\,J(\theta) \;\approx\; \nabla_{\theta}\,\log\,\pi_{\theta}(a \mid s) \,\hat{A}^{\pi}(s,a).

architecture of online-critic algorithms

  • shared network design for estimating πθ(as)\pi_{\theta}(a \mid s) and V^ϕπ(s)\hat{V}_{\phi}^{\pi}(s).

Synchonous Actor-Critic Algorithm

  1. Take action aπθ(as)a \sim \pi_{\theta}(a \mid s), obtaining (s,a,s,r)(s, a, s', r) for many agents.
  2. Update V^ϕπ(s)\hat{V}_{\phi}^{\pi}(s) using the target r  +  γV^ϕπ(s).r \;+\; \gamma \,\hat{V}_{\phi}^{\pi}(s'). by synchronizing at time tt.

Asynchronous Actor-Critic Algorithm

Last updated on