11 minute read

Actor-critic algorithms build on the policy gradient framwork that we discussed in the previous lecture, but also augment it with learning value functions and Q-functions. The goal of this augmentation is still reducing variance, but from a slightly different angle.

Let’s take a look at the original policy gradient and it’s Monte Carlo approximator:

(1)θJ(θ)=Eτpθ(τ)θlogπθ(atst)r(st,at)(2)1Ni=1N(t=1Tθlogπθ(atisti))(t=1Trti)

In equation 2, N sample trajectories is used to approximate the expectation Eτpθ(τ), but in equation 2 there are still one quantity that are approximated, i.e. the reward function r(rti,ati), and unfortuately, we are only using one sample rti to approximate it.

Even with the use of causality and baseline, which gives

(3)θJθ1Ni=1Nt=1Tθlogπθ(atisti)(t=tTrtib)

with b=1Ni=1Nt=tTrti, we are improving the variance in terms of expectation over trajectories (Eτpθ(τ)) but not the estimation of reward to go (or better-than-average reward to go).

Actor-critic algorithms aims at better estimating the reward to go.

1 Fit the Value FunctionPermalink

We start by recalling the goal of reinforcement learning:

(4)argmaxEp(τ)t=1Tr(st,at)

Now define the Q-function:

(5)Qπ(st,at)=Epθ[t=tTr(st,at)st,at](6)=r(st,at)+Eat+1πθ(at+1st+1),st+1p(st+1st,at)[Qπ(st+1,at+1)]

Q-function is exactly the expected reward to go from step t given the state and action (st,at).

How about the baseline b in 3? We can also replace it with lower variance estimate. To do that, we define the value function:

(7)Vπ(st)=Epθ[t=tTr(st,at)st](8)=Eatπθ(atst){r(st,at)+Est+1p(st+1st,at)[Vπ(st+1)]}

Value function measures how good the state is (i.e. the value of the state). This is exactly the expected reward of state averaged over different actions.

In addition, we define the advantage

(9)Aπ(st,at)=Qπ(st,at)Vπ(st)

In fact, t=tTrti1Ni=1Nt=tTrti in equation 3 looks very much just like the one sample estimate of the advantage Aπ(st,at). In fact, t=tTrti is an one sample estimate of The Q-function Qπ(st,at), however, 1Ni=1Nt=tTrti is not a estimate of Vπ(st) (actually if the state space is continuous, we will only have an one sample estimate of Vπ(st), which is also the t=tTrti, which is the same as the one sample estimate of Qπ(st,at). We only have an one sample estimate because we will never visit the state again). But Vπ(st) is intuitively better than 1Ni=1Nt=tTrti even if we don’t consider the variance, because the former is the expected reward for state st, and the later is an estimate of expected reward at time step t averaged over all possibly state. Since we want to know how good the action is in the current state, rather than in the current time step, we prefer the former.

If we have Q-function and value function, we can plug them in the original policy gradient and get the most ideal estimator:

θJ(θ)=1Ni=1Nt=1Tθπθ(atisti)(Qπ(sti,ati)Vπ(sti))

However, we do not have Qπ(st,at) or Vπ(st) and therefore we want to estimate them. Instead of using Monte Carlo estimate, we use function approximation, which might lead to a biased estimatio, but will give enormous variance reduction, in practice, the later usually brings more benefits than the hurts the former brings.

So we want to fit two neural networks to approximate Qπ(st,at) or Vπ(st)separately? Well, it’s actually not necessary if we notice the relationship between the two functions:

(10)Qπ(st,at)=r(st,at)+Est+1p(st+1st,at)Vπ(st+1)

And in practice we use one sample estimation to approximate Est+1p(st+1st,at), then we have

(11)Qπ(sti,ati)rti+Vπ(st+1i)

Therefore, we only need to fit Vπ. For training data, since Vπ(st)=Epθ[t=tTr(st,at)st], ideally for every state st, we want to have a bunch of differernt rollouts and rewards collected starting from that state, and then use the sample mean of rewards as the estimate of Vπ(st). But this require reseting the simulator and actually impossible in the real world. Therefore, we have one sample estimator t=tTrt,

So, we use training data {(sti,tTrti)}i=1,t=1N,T to train a neural network Vϕπ in a supervised way. But we can further reduce the variance even if we only have one sample estimation of the expected reward — we can again apply the function approximation idea and replace tTrti with rti+Vϕπ(st+1i) in the training data, where Vϕπ is the previously fitted value function (i.e. ϕ is one gradient step before ϕ). rti+Vϕπ(st+1i) is called the Bootstrapp estimate of the value function. In summary, we fit Vϕπ to Vπ by minimizing

(12)1NTi,t=1,1N,TVϕπ(sti)yti2

where yti=tTrti or rti+Vϕπ(st+1i) and the later usually works better.

With fitted value function Vϕπ, we can estimate the Q-function (Q-value) by

Q^π(sti,ati)rti+Vϕπ(st+1i)

and therefore the advantage:

A^π(sti,ati)=rti+Vϕπ(st+1i)Vϕπ(sti)

And our actor-critic policy gradient is

(13)θJ(θ)=1Ni=1Nt=1Tθπθ(atisti)(Qπ(sti,ati)Vπ(sti))(14)=1Ni=1Nt=1Tθπθ(atisti)(r(sti,ati)+Est+1p(st+1sti,ati)Vπ(st+1)Vπ(sti))(15)1Ni=1Nt=1Tθπθ(atisti)(rti+Vϕπ(st+1i)Vϕπ(sti))(16)=1Ni=1Nt=1Tθπθ(atisti)A^π(sti,ati)

The batch actor-critic algorithm is:

  1. run current policy πθ and get trajectories {τi}i=1N and rewards {rti}i,t=1,1N,T
  2. fit value function Vϕπ by minimizing equation 12
  3. calculate the advantage of each state action pair A^π(sti,ati)=rti+Vϕπ(st+1i)Vϕπ(sti)
  4. calculate actor-critic policy gradient θJ(θ)=1Ni=1Nt=1Tθπθ(atisti)A^π(sti,ati)
  5. gradient update: θθ+αθJ(θ)

We call it batch in that for each policy update we collect a batch of trajectories. We can also update the policy (and value function) using only one step of data i.e. (st,at,rt,ss+1), which leads to the online actor-critic algorithm which we will introduce later.

2 Discount FactorPermalink

Our previous discussion on policy gradient and actor-critic algorithms are all within the finite horizon or episodic learning scenario, where there is an ending time step T for the task. What about the infinite horizon scenario i.e. T=?

Well in that case the original algorithm can run into problems because at the second step, Vϕπ can get infinitely large in many cases. Or in vanilla policy gradient method, the sum of reward can get infinitely large.

To remedy that, we introduce the discount factor γ(0,1) and define the discounted expected reward, value function, and Q-function to be

(17)J(θ)=Eτpθ(τ)tγtr(st,at)(18)Vπ(st)=Eτtpθ(τt)[t=tγttr(st,at)st](19)Qπ(st,at)=Eτtpθ(τt)[t=tγttr(st,at)st,at](20)=r(st,at)+γEst+1p(st+1st,at)Vπ(st+1)

Therefore, the policy gradient and actor-critic policy gradient are (21)θJPG(θ)=1Ni=1Nt=1θlogπθ(atisti)(t=tγttrtib)(22)θJAC(θ)=1Ni=1Nt=1θlogπθ(atisti)(rti+γVϕπ(st+1i)vϕπ(sti))

where in JPG(θ), b is also a discounted baseline e.g. 1Ni=1Nt=tγttrti.

Usually we set γ to be something like 0.99. It can be proved that discount factor can prevent the expected reward from being infinity and reduce the variance. Therefore, actually in most cases, no matter it’s finite of infinite horizon, people use discounted policy gradient, i.e. equation 21 and 22, rather than the original ones.

3 Online Actor-critic algorithmsPermalink

So far we’ve been discussing the batch actor-critic algorithm, which for each gradient update, we need to run the policy to collect a batch of trajectories. In this section, we introduce online actor-critic algorithms, which allow faster neural network weights update, and with some techniques can work better than batch actor-critic algorithms in some cases.

The simplest version of online actor-critic algorithm is similar to online learning, where instead of calculating the gradient using a batch of trajectories and rewards, it only uses one transition tuple (s,a,s,r). The algorithm is the following:

  1. run policy πθ for one time step and collect (s,a,s,r)
  2. gradient update Vϕπ using (s, r + V^{\pi}_{\phi}(s’))$$
  3. evaluate A^π(s)=r+γVϕπ(s)Vϕπ(s)
  4. calculate policy gradient θJ(θ)=θlogπθ(as)A^π(s)
  5. gradient update: θθ+αθJ(θ)

However, this algorithm does not really work in most cases, because one sample estimate has very high variance, coupled with policy gradient, the variance can be notoriously high. To deal with the high variance problem, we introduce the synchronized parallel actor-critic algorithm, which is basically several agent running basic online actor-critic algorithm but using and updating the shared policy and value network πθ and Vϕπ. See the following figure:

parallel ac

This can be very easily realized by just changing the the random seeds of the code.

Another variants which has been proved to work very well when we have a very large pool of workers is called the asynchronized parallel actor-critic algorithm:

parallel ac

Each worker (agent) send the one step transition data to the center to update the parameters (both θ and ϕ), but do not wait for the updated weights to be deployed before it execute the next step in the environment. This is a little counterintuitive, because in this way, the worker might not using the latest policy to decide actions. But it turns out that the policy network will be very similar in the near time steps and since the algorithm is run asychronizely, it can be very efficient.

4 Variance/Bias Tradeoff in Estimating the AdvantagePermalink

In this section we go back to the actor critic gradient, what distinguishes it from the vanilla policy gradient is that it uses the advantage A^π(st,at) to replace the original one sample estimate of the expected discounted reward t=tγttrti. The advantage has smaller variance, but it can be biased as Vϕπ(st) can be an imperfect estimation of the value function.

A question would be, can we bring the best from the two worlds and get a unbiased estimate of advantage while keep the variance low? Or further, can be develop a machanism that allows us to tradeoff the variance and bias in estimating the advantage?

The answer is yes and the rest of this section will introduce three advantage estimator that gives different variance bias tradeoff.

Recall that the original advantage estimator in actor-critic algorithm is:

(23)A^π=Q^π(st,at)Vϕπ(st)(24)=rt+γVϕπ(st+1)Vϕπ(st)

4.1 critic as baseline (state dependent baseline )Permalink

The first advantage estimator is (25)A^π=t=tTγttrtVϕπ(st)

Compare to equation 24, we replace the neural estimation of discounted expected reward to go with the one sample estimaion t=tTγttrt used in policy gradient. This estimator has give lower variance than policy gradient whose baseline is a constant Greensmith et al. 04’ (but the variance is still higher than the actor-critic gradient). But is this advantage estimator really leads to an unbiased gradient estimator?

We can actually show that any state dependent baseline in policy gradient can lead to unbiased gradient estimator. I.e. we want to prove

(26)θJ(θ)=Eτpθ(τ)t=1Tθlogπθ(atst)(t=tTγttr(st,at)Vπ(st))(27)=Eτpθ(τ)t=1Tθlogπθ(atst)t=tTγttr(st,at)

let’s take one element from the summation t=1T out:

(28)θJ(θ)t=Eτtpθ(τt)θlogπθ(atst)(t=tTγttr(st,at)Vπ(st))(29)=Eτtpθ(τt)θlogπθ(atst)(t=tTγttr(st,at))Eτtpθ(τt)θlogπθ(atst)Vπ(st)(30)=Eτtpθ(τt)θlogπθ(atst)(t=tTγttr(st,at))Es1:t,a1:t1Vπ(st)Eatπθ(atst)θlogπθ(atst)(31)=Eτtpθ(τt)θlogπθ(atst)(t=tTγttr(st,at))Es1:t,a1:t1Vπ(st)0(32)=Eτtpθ(τt)θlogπθ(atst)(t=tTγttr(st,at))

4.2 state-action dependent baselinePermalink

To be updated, material is in Gu et al. 16’

4.3 Generalized Advantage Estimation (GAE)Permalink

Lastly let’s compare the advantage estimation introduced in section 4.1 (let’s call it AMCπ) and the advantage estimation used in vanilla actor-critic algorithm (let’s call it ACπ)

AMCπ=t=tTγttrtVϕπ(st)ACπ=rt+γVϕπ(st+1)Vϕπ(st)

AMCπ has higher variance because of the one sample estimation t=tTγttrt, while ACπ is biased because rt+γVϕπ(st+1) might be an biased estimator of Qπ(st+1).

Stare at these two estimators for a while, you might notice that the essential part that decide variance bias tradeoff is the estimation of Qπ, one sample estimation has high variance while neural estimation can be biased. We can combine the two and use one sample estimation for the first n steps and use neural estimation for the rest. i.e.

(33)Alπ=t=tt+l1γttrt+γlVϕπ(st+l)Vϕπ(st)

This also make sense intuitively, because the more distant from the current time step, the higher the variance will be. On the other hand, although Vϕπ can be biased, when being multiplied by γl, the effect can be small. Therefore, l controls the variance bias tradeoff of the advantage estimation. let l=1, we recover ACπ, which has the highest bias and lowest variance, let l=, we recover AMCπ, which is unbiased by has the highest variance.

AGAEπ is defined as a exponentially weighted sum of Alπ’s:

(34)AGAEπ=l=1wlAlπ

where wlλl1

It can be shown that AGAEπ can be equivalently writen as

(35)AGAEπ=t=t(γλ)ttδt

Where

δt=rt+γVϕπ(st+1)Vϕπ(st)

5 Demo: DeepMind A3CPermalink