Actor-Critic Algorithms
Summary
Actor-critic methods learns a value-function and a policy simultaneously.
Improving the policy gradient
Can we get a better estimate of ? If we get access to the true expected reward then the variance will be much lower.
With baselines:
where is the baseline or average reward.
Baseline could be reduced even further by using as the baseline, where . So the baseline becomes , which is also known as the advantage function.
Action-critic methods don't necessarily produce unbiased estimates particularly if the advantage function is incorrect. Usually, we're okay with this because the variance is much lower.
State & State-Action Value Functions
The better the estimate, the lower the variance.
The unbiased, but high variance single-sample estimate of the policy gradient is:
The now involves fitting either or or .
Now, we know that:
This is how we're approximating teh advantage function. it also might be easier to learn the value function since it has less parameters than the Q function (which needs action as input).
Policy Valuation
How can we evaluate policy? We can use Monte Carlo policy evaluation.
We can also fit a neural network which maps from .
Training data is: .
Can we do better?
So ideal target will be
Training data becomes:
From Evaluation to Actor-Critic Methods
batch actor-critic algorithm:
- Sample from (e.g., by running the policy on the robot).
- Fit to sampled reward sums.
- Evaluate
- Compute
- Update
What if ?
simple trick: better to get rewards sooner than later. Use to discount future rewards.
What about (Monte Carlo) policy gradients?
Option 2 higlights the importance of taking better actions sooner than later. In practice, we often use option 1 (didn't really understand from the lecture why?)
Online Actor-Critic Algorithm:
- Take action , obtaining .
- Update using the target
- Evaluate
- Compute
architecture of online-critic algorithms
- shared network design for estimating and .
Synchonous Actor-Critic Algorithm
- Take action , obtaining for many agents.
- Update using the target by synchronizing at time .
Asynchronous Actor-Critic Algorithm
Last updated on