Actor-Critic Algorithms

Summary

Actor-critic methods learns a value-function and a policy simultaneously.

Improving the policy gradient

\nabla_{\theta} J(\theta) \,\approx\, \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}\bigl(a_{i,t} \mid s_{i,t}\bigr) \,\underbrace{\sum_{t'=t}^T r\bigl(s_{i,t'},\,a_{i,t'}\bigr)}_{\text{''reward to go''}}

\hat{Q}_{i,t} \;:\; \text{estimate of the expected reward if we take action } a_{i,t} \text{ in state } s_{i,t}.

Can we get a better estimate of $\hat{Q}_{i,t}$ ? If we get access to the true expected reward $Q_{i,t}$ then the variance will be much lower.

Q_{i, t} = \sum_{t'=t}^T \mathbb{E}_{\pi_{\theta}}\left[ r(s_t', a_t') \mid s_t, a_t \right]: \quad \text{true expected reward to go}

\nabla_{\theta} J(\theta) \approx \dfrac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}\bigl(a_{i,t} \mid s_{i,t}\bigr) \cdot Q_{i,t}

With baselines:

\nabla_{\theta} J(\theta) \approx \dfrac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}\bigl(a_{i,t} \mid s_{i,t}\bigr) \cdot \left( Q_{i,t} - b \right)

where $b$ is the baseline or average reward.

Baseline could be reduced even further by using $V(s_t)$ as the baseline, where $V(s_t) = \mathbb{E}_{a_t \sim \pi_{\theta}(a_t \mid s_t)}\left[Q(s_t, a_t)\right]$ . So the baseline becomes $Q(s_{i, t}, a_{i, t}) - V(s_{i, t})$ , which is also known as the advantage function.

Action-critic methods don't necessarily produce unbiased estimates particularly if the advantage function is incorrect. Usually, we're okay with this because the variance is much lower.

\begin{align*} \nabla_{\theta} J(\theta) &\approx \dfrac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}\bigl(a_{i,t} \mid s_{i,t}\bigr) \cdot A(s_{i, t}, a_{i, t}) \\ A(s_{i, t}, a_{i, t}) &= Q(s_{i, t}, a_{i, t}) - V(s_{i, t}) \end{align*}

State & State-Action Value Functions

Q^{\pi}(s_t, a_t) = \sum_{t'=t}^T \mathbb{E}_{\pi_{\theta}}\left[ r(s_t', a_t') \mid s_t, a_t \right]: \quad \text{total reward from taking } a_t \text{ in state } s_t

V^{\pi}(s_t) = \mathbb{E}_{a_t \sim \pi_{\theta}(a_t \mid s_t)}\left[Q^{\pi}(s_t, a_t)\right]: \quad \text{total reward from state } s_t

A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t): \quad \text{advantage function}

\nabla_{\theta} J(\theta) \approx \dfrac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}\bigl(a_{i,t} \mid s_{i,t}\bigr) \cdot A^{\pi}(s_{i, t}, a_{i, t})

The better the $A^{\pi}(s_{i, t}, a_{i, t})$ estimate, the lower the variance.

The unbiased, but high variance single-sample estimate of the policy gradient is:

\nabla_{\theta} J(\theta) = \dfrac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta}\bigl(a_{i,t} \mid s_{i,t}\bigr) \cdot \left( \sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) - b \right)

The $\color{green}{box}$ now involves fitting either $Q^{\pi}$ or $V^{\pi}$ or $A^{\pi}$ .

Now, we know that:

Q^{\pi}(s_t,a_t) = \sum_{t'=t}^{T} \mathbb{E}_{\pi_\theta} \bigl[r(s_{t'},a_{t'}) \,\big|\, s_t,a_t\bigr],

Q^{\pi}(s_t,a_t) = r(s_t,a_t) + \sum_{t'=t+1}^{T} \mathbb{E}_{\pi_\theta} \bigl[r(s_{t'},a_{t'}) \,\big|\, s_t,a_t\bigr],

Q^{\pi}(s_t,a_t) = r(s_t,a_t) + \mathbb{E}_{s_{t+1}\sim p(\cdot\,|\,s_t,a_t)} \bigl[V^{\pi}(s_{t+1})\bigr],

Q^{\pi}(s_t,a_t) \approx r(s_t,a_t) + V^{\pi}(s_{t+1}),

A^{\pi}(s_t,a_t) \approx r(s_t,a_t) + V^{\pi}(s_{t+1}) - V^{\pi}(s_t).

This is how we're approximating teh advantage function. it also might be easier to learn the value function since it has less parameters than the Q function (which needs action as input).

Policy Valuation

V^{\pi}(s_t) = \sum_{t'=t}^T \mathbb{E}_{\pi_{\theta}}\left[ r(s_t', a_t') \mid s_t, a_t \right]: \quad \text{total reward from state } s_t

J(\theta) = \mathbb{E}_{s_1 \approx p(s_1)}[V^{\pi}(s_1)]: \quad \text{expected total reward from the start state}

How can we evaluate policy? We can use Monte Carlo policy evaluation.

V^{\pi}(s_t) \approx \sum_{t'=t}^T r(s_{t'}, a_{t'}): \quad \text{total reward from state } s_t: \quad \text{along a single trajectory}

V^{\pi}(s_t) \approx \dfrac{1}{N} \sum_{i=1}^N \sum_{t'=t}^T r(s_{i, t'}, a_{i, t'}): \quad \text{total reward from state } s_t: \quad \text{averaged over } N \text{ trajectories}

We can also fit a neural network which maps from $s_t: \mathbb{R}^n \rightarrow V^{\pi}(s_t): \mathbb{R}$ .

Training data is: $\{(s_{i, t}, V^{\pi}(s_{i, t}))\}$ .

\Bigl\{ \bigl( s_{i,t}, \underbrace{\sum_{t' = t}^{T} r\bigl(s_{i,t'}, a_{i,t'}\bigr)}_{y_{i,t}} \bigr) \Bigr\}

\mathcal{L}(\phi) = \frac{1}{2} \sum_{i} \Bigl\| \hat{V}_{\phi}^{\pi}(s_i) - y_i \Bigr\|^2

Can we do better?

\text{ideal target: } y_{i,t} = \sum_{t' = t}^{T} \mathbb{E}_{\pi_{\theta}}\bigl[r(s_{t'}, a_{t'}) \mid s_{i,t}\bigr] \;\approx\; r\bigl(s_{i,t}, a_{i,t}\bigr) \;+\; \sum_{t' = t+1}^{T} \mathbb{E}_{\pi_{\theta}}\bigl[r(s_{t'}, a_{t'}) \mid s_{i,t+1}\bigr]

\text{Monte Carlo target: } y_{i,t} = \sum_{t' = t}^{T} r\bigl(s_{i,t'}, a_{i,t'}\bigr)

So ideal target will be

y_{i,t} \approx r\bigl(s_{i,t}, a_{i,t}\bigr) + V^{\pi}(s_{i,t+1}) \approx r\bigl(s_{i,t}, a_{i,t}\bigr) + \hat{V}_{\phi}^{\pi}(s_{i,t+1})

Training data becomes:

\Bigl\{ \bigl( s_{i,t}, \underbrace{r\bigl(s_{ i,t}, a_{i,t}\bigr) + \hat{V}_{\phi}^{\pi}(s_{i,t+1})}_{y_{i,t}} \bigr) \Bigr\}

From Evaluation to Actor-Critic Methods

batch actor-critic algorithm:

Sample $\{s_i, a_i\}$ from $\pi_{\theta}(a \mid s)$ (e.g., by running the policy on the robot).
Fit $\hat{V}_{\phi}^{\pi}(s)$ to sampled reward sums.
Evaluate $\hat{A}^{\pi}(s_i, a_i) = r(s_i, a_i) + \hat{V}_{\phi}^{\pi}(s_i') - \hat{V}_{\phi}^{\pi}(s_i).$
Compute $\nabla_{\theta} J(\theta) \approx \sum_{i} \nabla_{\theta} \log \pi_{\theta}(a_i \mid s_i)\,\hat{A}^{\pi}(s_i, a_i).$
Update $\theta \;\leftarrow\; \theta + \alpha \,\nabla_{\theta} J(\theta).$

What if $T \rightarrow \infty$ ?

simple trick: better to get rewards sooner than later. Use $\gamma$ to discount future rewards.

y_{i,t} = r\bigl(s_{i,t}, a_{i,t}\bigr) + \gamma\hat{V}_{\phi}^{\pi}(s_{i,t+1})

What about (Monte Carlo) policy gradients?

\textbf{option 1:} \quad \nabla_{\theta} J(\theta) ~\approx~ \frac{1}{N}\,\sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta}\,\log\,\pi_{\theta}\bigl(a_{i,t}\mid s_{i,t}\bigr) \;\left( \sum_{t' = t}^T \gamma^{\,t'-t}\,r\bigl(s_{i,t'},a_{i,t'}\bigr) \right).

\textbf{option 2:} \quad \nabla_{\theta} J(\theta) ~\approx~ \frac{1}{N}\,\sum_{i=1}^N \left( \sum_{t=1}^T \nabla_{\theta}\,\log\,\pi_{\theta}\bigl(a_{i,t}\mid s_{i,t}\bigr) \right) \left( \sum_{t'=1}^T \gamma^{\,t'-1}\,r\bigl(s_{i,t'},a_{i,t'}\bigr) \right).

\textbf{option 2 (alternate form):} \quad \nabla_{\theta} J(\theta) ~\approx~ \frac{1}{N}\,\sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta}\,\log\,\pi_{\theta}\bigl(a_{i,t}\mid s_{i,t}\bigr) \;\left( \sum_{t' = t}^T \gamma^{\,t'-1}\,r\bigl(s_{i,t'},a_{i,t'}\bigr) \right).

\textbf{option 2 (alternate form):} \quad \nabla_{\theta} J(\theta) ~\approx~ \frac{1}{N}\,\sum_{i=1}^N \sum_{t=1}^T \gamma^{\,t-1}\, \nabla_{\theta}\,\log\,\pi_{\theta}\bigl(a_{i,t}\mid s_{i,t}\bigr) \;\left( \sum_{t' = t}^T \gamma^{\,t'-t}\,r\bigl(s_{i,t'},a_{i,t'}\bigr) \right).

Option 2 higlights the importance of taking better actions sooner than later. In practice, we often use option 1 (didn't really understand from the lecture why?)

Online Actor-Critic Algorithm:

Take action $a \sim \pi_{\theta}(a \mid s)$ , obtaining $(s, a, s', r)$ .
Update $\hat{V}_{\phi}^{\pi}(s)$ using the target $r \;+\; \gamma \,\hat{V}_{\phi}^{\pi}(s').$
Evaluate $\hat{A}^{\pi}(s,a) \;=\; r(s,a) \;+\; \gamma\,\hat{V}_{\phi}^{\pi}(s') \;-\; \hat{V}_{\phi}^{\pi}(s).$
Compute $\nabla_{\theta}\,J(\theta) \;\approx\; \nabla_{\theta}\,\log\,\pi_{\theta}(a \mid s) \,\hat{A}^{\pi}(s,a).$

architecture of online-critic algorithms

shared network design for estimating $\pi_{\theta}(a \mid s)$ and $\hat{V}_{\phi}^{\pi}(s)$ .

Synchonous Actor-Critic Algorithm

Take action $a \sim \pi_{\theta}(a \mid s)$ , obtaining $(s, a, s', r)$ for many agents.
Update $\hat{V}_{\phi}^{\pi}(s)$ using the target $r \;+\; \gamma \,\hat{V}_{\phi}^{\pi}(s').$ by synchronizing at time $t$ .

Bhavit Sharma