Deep RL 5 Actor Critic
Actor-critic algorithms build on the policy gradient framwork that we discussed in the previous lecture, but also augment it with learning value functions and Q-functions. The goal of this augmentation is still reducing variance, but from a slightly different angle.
Let’s take a look at the original policy gradient and it’s Monte Carlo approximator:
In equation
Even with the use of causality and baseline, which gives
with
Actor-critic algorithms aims at better estimating the reward to go.
1 Fit the Value FunctionPermalink
We start by recalling the goal of reinforcement learning:
Now define the Q-function:
Q-function is exactly the expected reward to go from step
How about the baseline
Value function measures how good the state is (i.e. the value of the state). This is exactly the expected reward of state averaged over different actions.
In addition, we define the advantage
In fact,
If we have Q-function and value function, we can plug them in the original policy gradient and get the most ideal estimator:
However, we do not have
So we want to fit two neural networks to approximate
And in practice we use one sample estimation to approximate
Therefore, we only need to fit
So, we use training data
where
With fitted value function
and therefore the advantage:
And our actor-critic policy gradient is
The batch actor-critic algorithm is:
- run current policy
and get trajectories and rewards - fit value function
by minimizing equation - calculate the advantage of each state action pair
- calculate actor-critic policy gradient
- gradient update:
We call it batch in that for each policy update we collect a batch of trajectories. We can also update the policy (and value function) using only one step of data i.e.
2 Discount FactorPermalink
Our previous discussion on policy gradient and actor-critic algorithms are all within the finite horizon or episodic learning scenario, where there is an ending time step
Well in that case the original algorithm can run into problems because at the second step,
To remedy that, we introduce the discount factor
Therefore, the policy gradient and actor-critic policy gradient are
where in
Usually we set
3 Online Actor-critic algorithmsPermalink
So far we’ve been discussing the batch actor-critic algorithm, which for each gradient update, we need to run the policy to collect a batch of trajectories. In this section, we introduce online actor-critic algorithms, which allow faster neural network weights update, and with some techniques can work better than batch actor-critic algorithms in some cases.
The simplest version of online actor-critic algorithm is similar to online learning, where instead of calculating the gradient using a batch of trajectories and rewards, it only uses one transition tuple
- run policy
for one time step and collect - gradient update
using (s, r + V^{\pi}_{\phi}(s’))$$ - evaluate
- calculate policy gradient
- gradient update:
However, this algorithm does not really work in most cases, because one sample estimate has very high variance, coupled with policy gradient, the variance can be notoriously high. To deal with the high variance problem, we introduce the synchronized parallel actor-critic algorithm, which is basically several agent running basic online actor-critic algorithm but using and updating the shared policy and value network

This can be very easily realized by just changing the the random seeds of the code.
Another variants which has been proved to work very well when we have a very large pool of workers is called the asynchronized parallel actor-critic algorithm:

Each worker (agent) send the one step transition data to the center to update the parameters (both
4 Variance/Bias Tradeoff in Estimating the AdvantagePermalink
In this section we go back to the actor critic gradient, what distinguishes it from the vanilla policy gradient is that it uses the advantage
A question would be, can we bring the best from the two worlds and get a unbiased estimate of advantage while keep the variance low? Or further, can be develop a machanism that allows us to tradeoff the variance and bias in estimating the advantage?
The answer is yes and the rest of this section will introduce three advantage estimator that gives different variance bias tradeoff.
Recall that the original advantage estimator in actor-critic algorithm is:
4.1 critic as baseline (state dependent baseline )Permalink
The first advantage estimator is
Compare to equation
We can actually show that any state dependent baseline in policy gradient can lead to unbiased gradient estimator. I.e. we want to prove
let’s take one element from the summation
4.2 state-action dependent baselinePermalink
To be updated, material is in Gu et al. 16’
4.3 Generalized Advantage Estimation (GAE)Permalink
Lastly let’s compare the advantage estimation introduced in section 4.1 (let’s call it
Stare at these two estimators for a while, you might notice that the essential part that decide variance bias tradeoff is the estimation of
This also make sense intuitively, because the more distant from the current time step, the higher the variance will be. On the other hand, although
where
It can be shown that
Where