[译] Jason
Kevin Gimpel是TTIC的助理教授。他的研究领域是自然语言处理,近年主要研究兴趣包括:representation learning, structured prediction, robust and data-efficient NLP, world modeling for NLP. 根据谷歌学术,截止2021年6月,Kevin有9600+被引量。
我在芝大上学时很荣幸上过Kevin的课,他上课非常认真,布置的作业自己会先亲自做一遍。Kevin的这篇给博士生的建议我读过多遍,觉得应该被更多的人看到。我表达了翻译此文的想法,他欣然同意。原文请见:https://home.ttic.edu/~kgimpel/etc/phd-advice.pdf。
博士生的生活中充满各种危险。
敌人们从各个角度蹿出来。
一些攻击源自外部:低收入、无关的课程、匿名评审。
但是更多的攻击来自于博士生自己:自我怀疑、焦虑、缺乏安全感,这些才是主要的对手。
举个例子,著名的学者印第安纳·琼斯教授(系列电影《夺宝奇兵》中虚构的历史学教授)。电影中外部的危险有很多,但是琼斯教授总能机智化解,最激烈的斗争反而都是内部的。
博士生涯是很漫长的。生涯前期,你的时间主要花在技能训练上 — 数学成熟度,代码能力,实验管理能力,广阔的视野。之后,你将开始更多地为你所在的领域做出学术贡献。在这些方面,已经有很多很好的建议供你参考。例如[desJardin, Dredze and Wallach, Guo, Stearns, inter alia]
常常被忽视的,是如何管理自己丰富的内心世界。本文提供一些建议,从具体到抽象,希望能帮助你成为一个更快乐的博士生。一个更快乐的学生能做出更好的工作,但更重要的是,一个更快乐的学生是一个更快乐的人。
博士生常常为如何安排时间而焦虑。我的建议是,确定一个可持续的时间表,并严格执行。
对于我来说,这意味着每周工作五天,每天八小时。不是一天五小时,另一天十一小时。不是每周工作六天,每天七小时。不是待在办公室九小时,但是花两小时在与工作无关的事情上。不是不想工作就不工作,然后用几个通宵来赶进度。以上这些我都试过了。
严格地执行一个可持续的时间表既能防止你工作太少,也能防止工作太多。工作太少或太多都会降低你的幸福感。
不要工作太多。找到工作之外的生活。发展与实验室其他同学的友谊(所以你可以和朋友们讨论工作),找到实验室以外的朋友(所以和他们在一起是你没办法讨论工作)。如果工作是你的一切,它的重要性将支配你的存在,那么不可避免的挫折会让你感到难以承受。
不要工作太少。科研具有风险性和开放性。高风险的点子往往被漫不经心的探索,或者根本不被探索。因为这样的点子可能不会有效,学生迟迟不想开始研究。这会带来压力,从而让学生浪费更多时间,并使自己与导师的关系变得紧张。学生担心探索高风险的点子会浪费他的时间,但他其实是因为拖延在浪费时间!
当你为了严格执行你的时间表而工作,你会开始做那些你不想做的事。你会很容易的就花一个小时探索那个高风险的项目,因为这样会帮助你完成每天工作n小时的超短期目标。当你必须花这一个小时来工作,而不是为不工作找借口,或者为做什么而焦虑,你应该抱着试试看的心态探索一下。谁知道呢?说不定你会收获有趣的发现。这引出了下一点。
科研是对事实的追求,而事实常常难以被抓住。事情总是不会按我们预料的发展。所以有些时候,我们应该停止理论,编程实现你的想法。
一个执行得很好的实验会产出一个实验结果。而这个结果在某种意义下总是一个发现。要么你得到了别人从没得到过的结果,要么你成功地复现了别人的结果。不管怎么样,我们都应停下来想想 — 这个结果意味着什么?它改变了我们某些的认知吗?它应该如何更新我们对世界的看法?
令人比较难接受的一点是,你得到的结果可能并没有包含很多信息量。很多时候,你都在设法说明你那闪亮的新点子比标准的基线模型要好。当处于这种心态时,人很容易在代码实现或者实验设计环节中犯马虎。这引出了下一点。
不论你的研究顺利与否,不管进展快慢,不管结果是好于预期还是不及预期,你很可能有bug。无数已经发表的论文中有bug。这些bug往往在论文发表很久之后才被人们发现,有时这些bug不会影响论文结论,有时会。作为研究者,人类是灵活而有创造力的,但是也很容易经常犯一些小错误。
当一个学生实现他那闪亮的新点子并获得第一批实验结果时,他通常要么非常高兴,要么非常沮丧。等待和理解这些最初的结果是令人精疲力尽的。更糟糕的是,最初的结果往往甚至没有包含很多信息。
我的建议是对两种情况都制定应对方案。不管得到的是正面还是负面的结果,都要努力排除bug。尝试你想法的变种,这要求你多次重复你的实验过程,从而提高发现错误的几率。在代码中添加更多的声明和注释。每一个结果都称得上是一个发现,但前提是没有bug。
bug的形式有很多。对于实验计算机科学家来说,bug是代码中的错误;如果你的研究是证明理论,bug可以是一个不成立的假设,或者步骤中的问题;如果你是生物学家,bug可以是来自任何来源的污染。
不管你的进展是顺利,不顺利,还是一般般:你很可能有bug。在你找到并修正了十个bug之后,你可能还有一个bug;在你基于同一个代码库发表了十篇精彩的论文之后,你可能还是有bug。
把注意力放在现象上。让我们来想象一个戒备心很重,没有安全感,争强好胜的气象学家。总是为自己气象预测的准确性焦虑着。预测不准确时轻描淡写,预测准确时却大肆宣扬。总是在拿自己的预测与他的竞争对手做比较。
现在,我们再来想象另一个气象学家,他诚实而谦虚,对自己预测的准确和偏差都直言不讳,并且解释原因。他忘记自己,把注意力都集中在气候现象的美,复杂性,和重要性上。这样的天气预报才是你更想看的。
博士生们(或者是所有人)都更容易成为那个没有安全感的气象学家 — 尽管像其他人一样容易犯错,却总是想树立自己的光辉形象。没有人在乎你是否聪明。他们看天气预报是因为他们关心天气,而不是关心你。(唯一在乎你是否聪明的人可能就是那些考虑招聘你的人。但是同样的,招聘官更喜欢那些能把自己沉浸在丰富多彩的科研中的求职者)
如果你过分沉湎于自己的才华和名声,你永远不会快乐。过分的自我意识会带来压力,并扼杀享受科学发现之美的能力。谦卑能使人接受现实的奇妙和丰富。正如切斯特顿所说的:
“接下来的话可能会被误解;但我应该先告诉人们不要自我陶醉(but I shouldbegin my sermon by telling people not to enjoy themselves)。我应该告诉他们享受舞蹈、戏剧、游乐设施、香槟和牡蛎;享受爵士乐,鸡尾酒和夜总会,如果他们没有其他更好的选择;享受重婚和入室盗窃以及其他任何犯罪,而不是自我陶醉;永远不要学会自我陶醉。人类是快乐的,只要他们还会惊讶、接纳、感激……”
“当一个人的自我意识比外在的惊喜和冒险更占统治地位时,会变得极度自我挑剔,并有希望破灭的感觉。这是极度的饥渴和绝望的象征。”
宣传优秀的科研成果。如果那正好是你的成果,没问题。但不能因为成果是你的,所以它就是优秀的。切斯特顿接下来写到:“一个人的自负会使他自己成为事物唯一的评价标准,而不是让事实作为评价标准……”
研究是在一个社区里进行的,过度的自我宣传是格格不入并且有害的,这与社区的健康运转背道而驰。你认为印第安纳·琼斯会自我宣传吗?他正忙着拯救世界呢!如果拯救世界是你的日常,你不需要自我宣传,别人会为你拍电影。
你的领域会出现一些有争议的问题。比如说,谁对某项发明的贡献最大?某个成果是否有效?某个数据集或任务是应该被保留还是抛弃?应该接纳还是批评某个新的结论、度量方式或者研究方法?
有时候,一个坚定不移的立场是必要的。在这种情况下,果断选择自己的立场。
但是大部分问题都不值得为之站边,因为通常两边都有相当的证据。由于我们天生的不等所有证据都呈现之前就选择一个信念的不理性习惯,大部分研究者都会站边。有意识的或无意识的,通常是主观的标准在帮他们做选择,比如个性,智力,风格或者学术派别。有时这些主观标准所占权重大于任何客观标准。
科学家应该能够退一步,从更广阔的角度看待问题。最简单的方法就是拒绝站边。这样你会更快乐,因为当你回顾关于这个问题的新证据时,你不会因为需要确认自己偏见而感到压力。人们会注意到你在寻求真理,并尊重你,即使在站在中间可能会感到不舒服(而且孤独)。
当机构用纳税人的钱支持我们做科研时,他们是在支持客观的科学家在一个社区里共同工作,为人类知识添砖加瓦。这些钱不能被用来助长鸡毛蒜皮的八卦,或者为两派的斗争输送“弹药”,或者让人们相信你的贡献比某人更多。这都是滥用公款,而且不会让你更快乐。
博士生所经历的大部分压力来自于将自己与他人进行比较。学生试图与她臆想中的对手竞争,却无法辨别这种竞争是否合理。
工作时长是最具体的比较形式之一。有些人吹嘘工作时间长,有些人吹嘘工作时间短,有时同一个人甚至会既吹嘘自己工作的时间长,又吹嘘自己工作的时间短!
拒绝参与这个游戏,它毫无益处。时常为这种事情操心会让你把注意力过分的集中在自己身上。如果总在拿自己与他人作比较,你将没办法专注于你工作的质量,理解的深度,和视角的广度。
每个人都不一样,所以拿你自己和他人作比较是完全不合适的。只有将所有变量都控制了之后,比较才是合理的。但是任何两个人之间都有太多的不同点。所以任何的比较都将更多地反映那些无关紧要的不同点,而不是与你科研能力相关的品质。如果别人发的论文比你多,那不应让你陷入绝望;如果别人发的论文比你少,那不应给你带来狂喜。
让我们来明确一件事:你不是最好的。这没关系。你不能做所有事,你也不会做出人类历史上最伟大的发现。但是你可以做一些事,你有机会为科学做出真实的,经得起时间考验的,之前别人从未发现,甚至不会发现的贡献。
现实很奇怪。我们构造出的所有世界都相对简单,而现实中总有些东西是我们无法理解的。我们也永远不会将它完全理解。研究是对现实的探索,这是一次美妙的探索,它让我们保持谦虚。真正的谦虚是努力看清事物本来的样子,而不是我们所希望的样子。
一旦你摆脱了内心的自负和自我陶醉,通过扎根于现实的生活进入真正的自由,你会更加快乐。
我将用两段引文来结束:
在成人的世界里,没有人是绝对的无信仰者,没有人不信仰某种事物。每个人都有信仰的对象,不同的只是选择信仰什么。我们之所以只能选择信仰某种神灵,或者心灵皈依 — 不论是耶稣或是阿拉,不论是耶和华、巫术母亲女神,或是佛教的四圣谛,亦或是某种不可亵渎的道德准则 — 因为基本上其他的选择(例如对金钱、权力的信仰 — 译者注)都会让你陷于危险的境地。
如果你信仰金钱与物质,如果你们依靠那些东西来追求生命的意义,那么你将永远无法被满足,永远无法感到满足;信仰你自己的形象与魅力,你会永远觉得自己丑陋,当年龄的痕迹慢慢浮现,当人们为你送最后一程而哀悼时,你的心早已死了一百万次了。我们多少都懂这个道理,它早被编入了神话、谚语、警句、寓言之中。这个道理是很多伟大故事的骨架。诀窍只有一个,那就是让真理成为你意识的最高准则。信仰权力,你会感到虚弱、害怕,你会需要更多的权力以凌驾于他人之上,好麻痹自己的恐惧;信仰智力,想做个看起来很聪明的人,你总觉得自己是个愚蠢的骗子,永远处于事情即将败露的恐惧中。以上这些形式的信仰,本质并不邪恶,这些信仰是无意识的,是默认的设定,它会让你逐渐沉沦,日复一日,你会对眼前的事物越来越挑剔,错误的估量每个行为的意义。所谓的真实世界不会阻止你采用这种默认设定,因为恐惧、轻视、挫败、欲望和自我崇拜的燃料,正让这由人、金钱,和权力构成的真实世界很好地运转。当今社会文化产出了巨大的财富,舒适的生活,和个人的自由,那是一种处于万物中心,做自己的上帝的自由。这种自由当然有可取之处,但是不要忘了还有其他类型的自由。而在这汲汲营营的,处处宣扬胜利、成功、炫耀的世界,你很难听到人们谈论那种最宝贵的自由,那种自由需要专注、意识、自律,需要你真正关心他人,并在日常生活中通过各种各样看似笨拙的方式为他人付出。那才是真正的自由。你不应该无意识地选择默认设定,那是无意义的老鼠赛跑,通过不停地啃咬来品尝曾经的拥有与失去,仿佛身处无间道。
—大卫·福斯特·华莱士 [Wallace]
最后一段的翻译超出了我的能力,为了不影响原意,直接用原文:
And from a Christian perspective: We must not think Pride is something God forbids because He is offended at it, or that Humility is something He demands as due to Hisown dignity—as if God Himself was proud….He wants you to know Him: wants togive you Himself. And He and you are two things of such a kind that if you really get into any kind of touch with Him you will, in fact, behumble—delightedly humble, feeling the infinite relief of having for once gotrid of all the silly nonsense about your own dignity which has made you restless and unhappy all your life. He is trying to make you humble in order to make this moment possible: trying to take off a lot of silly, ugly, fancy-dress in which we have all got ourselves up and are strutting about like the little idiots we are. I wish I had got a bit further with humility myself: if I had, I could probably tell you more about the relief, the comfort, of taking the fancy-dress off—getting rid of the false self, with all its ‘Look at me’ and ‘Aren’t I a good boy?’ and all its posing and posturing. To get even near it, even for a moment, is like a drink of cold water to a man in a desert.
—C. S. Lewis, Mere Christianity References
[Chesterton]Chesterton, G. K. The Common Man.
[desJardin] desJardin, M. How to succeed in graduate school: A guidefor students and advisors. https://www.eng.auburn.edu/~troppel/Advice_for_Grad_Students.pdf. Accessed: 2021-06-20.
[Dredzeand Wallach] Dredze, M. and Wallach, H. M. How to be a successful PhD student(in computer science (in NLP/ML)). https://people.cs.umass.edu/~wallach/how_to_be_a_successful_phd_student.pdf. Accessed: 2021-06-20.
[Guo]Guo, P. J. Advice for new Ph.D. students. http://pgbovine.net/early-stage-PhD-advice.htm. Accessed: 2016-01-31.
[Lewis]Lewis, C. S. Mere Christianity.
[Stearns]Stearns, S. C. Some modest advice for graduate students. http://stearnslab.yale.edu/some-modest-advice-graduate-students.Accessed: 2021-06-20.
[Wallace]Wallace, D. F. This is Water. https://en.wikipedia.org/wiki/This_Is_Water.My translation partly refered to https://www.youtube.com/watch?v=nSYLeqWZwSw.Accessed: 2021-06-20.
On the other hand, if we have an explicit policy, we can make decision at each time step based on the state at that time step, and therefore no need to plan the whole action sequence all in one go. This is closed-loop planning and it’s more desirable in the stochastic dynamics setting.
Suppose we have learned dynamics \(s_{t+1} = f(s_t, a_t)\) and reward function \(r(s_t, a_t)\), and want to learn optimal policy \(a_t = \pi_{\theta}(s_t)\). (Here is use deterministic dynamics and policy to make a point. The point also applies to stochastic dynamics, but the derivation is slightly more involved and will be introduced in the future; Also I drop the parameters notation in the dynamics and reward function for simplicity). Same as policy gradient, our goal will be:
\[\begin{align} \theta^* = \text{argmax}_{\theta} \mathbb{E}_{\tau\sim p(\tau)}\sum_t r(s_t, a_t) \end{align}\]Since we have dynamics and reward function, we can write the objective as
\[\begin{align} \mathbb{E}_{\tau\sim p(\tau)}\sum_t r(s_t, a_t) = \sum_t r(f(s_{t-1}, a_{t-1}), \pi_{\theta}(f(s_{t-1}, a_{t-1}))), \text{ where } s_{t-1} = f(s_{t-2}, a_{t-2}) \end{align}\]Very similar to shooting methods, the objective is defined recursively, which lead to high sensitivity to the first actions and lead to poor numerical stability. However, for shooting methods, if we define the process as LQR, we can use a dynamical programming to solve it in a very stable fashion. Unfortunately, unlike LQR, since the the parameters of the policy couple all time steps, we cannot solve by dynamical programming (i.e. can’t calculate the best policy parameter for the last time step and solve for the policy parameters for second to last time step and so on).
What we can use is backpropagation:
If you are familiar with recurrent neural network, we might realize that the kind of backprop shown above is the so called Backpropagation Through Time or BPTT, which is usually used on recurrent neural nets like LSTM. BPTT famously has the vanishing or exploding gradients issue, because all the jacobians of different time steps get multiplied together. This issue can only get worse in policy learning, because in sequence deep learning, we can choose architectures like LSTM that has good gradient behavior, but in model-based RL, the dynamics has to fit to the data and we don’t have control over the gradient behavior.
In the next two sections, we will introduce two popular ways to model-based RL. The first way is a bit controvesal, it does model-free optimization (policy gradient, actor-critic, Q-learning etc.) and use model to only generate sythetic data. Despite looking weird backwards, this idea can work very well in pratice; the second way is to use simple local models and local policies, which can be solved using stable algorithms.
Reinforcement learning is about getting better by interacting with the world, and the interacting, try-and-error process can be time consuming (even in a simulator sometimes). If we have a mathematical model that represent how the world works, then we can effortlessly generate data (transitions) from it for model-free algorithms to get better. However, it’s impossible to have a comprehansive mathematical model of the world, or even of the environment we want to run our RL algorithms. Nevertheless, a learned dynamics is a representation of the environment and we can use it to generate data.
The general idea is to use the learned dynamics to provide more training data for model-free algorithms, it does it by generate model-based rollouts from real world states.
The general algorithm is the following:
There are a few things to be cleared. Above algorithm is very general and explicitly considers both policy gradient and Q-learning, this will affect what we actually do in step 1, 5, and 6. If we use policy gradient, in step 1 and step 5, we can run the learned policy, and in step 6, we run policy gradient update; if we use Q-learning, then in step 1 and step 5, we run policy indiced by the learned Q-function, e.g. \(\epsilon\)-greedy policy. And in step 6 we update the Q-function by taking the gradient of temporal difference.
Model-based rollout step k is an very important hyperparameter, since we completely reply on \(f_{\phi}(s_t, a_t)\) during model-based rollout, the discrepancy between \(f_{\phi}(s_t, a_t)\) and the ground truth dynamics can result in a distribution shift problem, i.e. the expectation in the objective we are optimizing is over a distribution that is very different from the true distribution. We’ve encounter this issue several times (e.g. imitation learning, TRPO etc.). We know that if there is a discrepancy between fitted dynamics and true dynamics, the error between true objective and the objective we optimizes grow linearly with the length of rollout. Therefore, we don’t want the length of model-based rollout to be too long; on the other hand, too short a rollout provide little learning signal, which is undesirable for policy update or Q-function update. Therefore, we need to choose an appropriate k for the algorithm.
Since for every model-based rollout, the initial state is sampled from real world data, this algorithm can be intuitively understand as imagining different possibilities starting from real world situations:
Here we give one instatiation of the general algorithm introduced above, which combines model-based RL with policy gradient methods. The algorithm is called Model-Based Policy Optimization or MBPO Janner et al. 19’:
For instatiation with Q-learning, see Gu et al. 16’ and Feinberg et al 18’.
A local model is a model that is valid only in the neighborhood of one or some trajectories. Previously, we learned (i)LQR, which assumes linear dynamics (approximates dynamics by a linear function), which could be too simple for most scenarios, but it might be a good assumption locally, i.e. for one or a few very close trajectories, we can assume a linear dynamics. Suppose we are given these trajectories, we want fit a linear dynamics to it by linear regression at each time step, and then perform (i)LQR to get actions and execute these actions in the environment, we can get new trajectories, and we can again fit a linear dynamics to these trajectories, and then run (i)LQR and execute the planned actions……
The procedure looks like the following:
Where the local linear dynamics is defined as
\[p(x_{t+1}\mid x_t, u_t) = \mathcal{N}(A_t x_t + B_t u_t + c_t, \Sigma)\]Where \(A_t, B_t, c_t\) are to be fitted using trajectories \(\{ \tau_i \}\). \(\Sigma\) can be tuned as a hyperparameter or also be estimated from data.
The policy is defined as
\[p(u_t\mid x_t) = \mathcal{N}(K_t (x_t - \hat{x_t}) + k_t + \hat{u_t}, \Sigma_t)\]Note that this correspond to iLQR, i.e. \(K_t, k_t\) are calculated from the fitted dynamics and \(\hat{x_t}, \hat{u_t}\) are the actual states and actions in the trajectories \(\{ \tau_i \}\). \(\Sigma_t\) is set to be \(Q_{u_t, u_t}^{-1}\), which is also intermediate result of running iLQR. Intuitively, \(Q_{u_t, u_t}\) is gradient of the cost to go w.r.t. the action. If the gradient is low, that means the total reward doesn’t depend very strongly on the action, which means many different actions may lead to similar reward, then it’s a good idea to test out different actions, so we want the variance of \(p(u_t\mid x_t)\) to be high, and vice versa. Setting \(\Sigma_t\) to be \(Q_{u_t, u_t}^{-1}\) gives us such property.
One more thing to notice is that since the fitted dynamics is only valid locally, if the action we take lead to very different state distribution then the subsequent actions planned might be very bad and lead to even worse result. Therefore, we need to make sure the new trajectory distribution is close enough to the old distribution. This can be inforced by using again using KL divergence:
\[D_{\text{KL}}(p_{\text{new}}(\tau) \lvert p_{\text{old}}(\tau))\]For details about how this is implemented, please see Sergey and Abbeel 14’.
If we have a bunch of local policies e.g. \(\{\pi_{\text{LQR, i}}\}_{i}\), which are derived from local models (e.g. LQR models), we can distill the knowledge of these local policies to get a global policy by supervised learning.
The idea above can be view as a special case of a more general framework is known as knowledge distillation (Hinton et al. 15’). Here we have a bunch of weak policies (local policies) and we can ensemble them to get a strong policy, but rather than directly using the ensemble, we distill knowledge from this ensemble to get one global neural network policy \(\pi_{\theta}\).We train the neural network using the trajectories we used for training LQR parameters and policies, except that instead of directly train the neural net policy to output the one actual action sequence at each time step, we train the neural net to predict probability of each action given the state.
In order for the algorithm to work better, we want the LQR policy \(\{\pi_{\text{LQR, i}}\}_{i}\) to be close to the neural net policy \(\pi_{\theta}\). We use KL divergence to inforce that, and it can be implemented as modifying the cost function of LQR.
The algorithm sketch is the following:
Where \(k\) index is number of the step in the algorithm and \(i\) index different LQR models which is instantiated by starting from different initial state. Step 3 is for making local policies and the global policy close to each other in terms of KL divergence and \(\lambda_{k+1, i}\) is the Lagrangian multiplier. This is just a sketch of the algorithm, and for details, please checkout the original paper by Levine and Finn et al. 16’.
Similar approach can also be extended to multitask transfer scenario:
Where the loss function for training the global policy is
\[\mathcal{L}^i = \sum_{a\in \mathcal{A}_{E_i}} \pi_{E_i(a\mid s)}\log \pi_{\theta}(a\mid s)\]For deteils, please see Parisotto et al. 16’.
Again, most of the algorithms will be introduced in the context of deterministic dynamics, i.e. \(s_{t+1} = f(s_t, a_t)\), but almost all of these algorithms can just as well be applied in the stochastic dynamics setting, i.e. \(s_{t+1}\sim p(s_{t+1}\mid s_t, a_t)\), and when the distinction is salient, I’ll make it explicit.
How to learn a model, the most direct way is supervised learning. Similar to the idea used before, we run a random policy to get transitions, and then fit a neural net to the transition:
Where in step 3, we can use CEM, MCTS, LQR etc.
Does this work? Well, in some cases. For example, if we have a full physics model of the dynamics and only need to fit a few parameters, this method can work. But still, some care should be taken to design a good base policy.
In general, however, this doesn’t workm and the reason is very similar to the one we encountered in imitation learning — distribution shift. The data we used to learn the dynamics comes from the trajectory distribution induced by random policy \(\pi_0\), but when we plan through the model, we can think of the algorithm is using another policy \(\pi_f\), and the trajectory distribution induced by this policy can be very different from the one induced by the base policy. The consequence is that, when we plan actions, we’ll arrive at the state action pair that the dynamics is very uncertain about, because it has never trained on similar data! Therefore, it will make bad prediction on the following state, which will in turn lead to bad actions, this will go on and on and in the end we are completely planning on the wrong states (prediction is different the reality). The intuitive plot is shown below:
How to deal with it? Same as how DAgger deals with distribution shift in imitation learning, we just need to make sure that the training data comes from the current dynamics (current policy). These lead to the first practical model-based RL algorithm:
However, even though the data is updating based on the learned dynamics, as long as we are replanning, it will always induce a new trajectory distribution which will be a little different from the previous distribution. In another word, the distribution shift will always exist. Therefore, as we plan through \(f_{\theta}(s,a)\), the actual trajectory will gradually deviate from the predicted trajectory which will lead to bad actions.
We can improve this algorithm by only execute the first planned action, and observe the next state that this action leads to, and then replan start from that state. and then take the first action etc. In a word, at each step, we only take the first planned action and observe the state and then replan from there. Because at each time step, we always take the action based on the actual state, this is more reliable than executing the whole plan actions all in one go. The algorithm is
This algorithm is call Model Predictive Control or MPC. Replanning at each time step can drastically increase the computation load, so people sometimes choose to shorten the time horizon of the trajectory. While this might lead to a decrease in the quality of actions, since we are constantly replanning, we can take the cost that individual plans is less perfect.
Since we plan actions replying on the fitted dynamics, whether or not the dynamics is a good representation of the world is crucial. When we use high capacity model like neural networks, we usually need to feed it with a lot of data in order to get a good fit. But in model-based RL, we usually don’t have a lot of data at the beginning, in fact, we can only have some bad data (generated by running some random policy), and then if we use neural network to fit the dynamics, it will overfit the data, and not have a good representation of the good part of the world. This will lead the algorithm to take bad actions, which can lead to bad states, which can then lead to neural net dynamics trained only on trajectories and thus it’s predictions on good states in unreliable, which again lead to algorithm to take bad actions…… This seems to be a chicken-and-egg problem, but if you think about it, the origin is that planning on unconfident state prediction can lead to bad actions.
The solution is to quantify uncertainty of the model, and take into consideration this uncertain in planning.
First of all, it’s important to know that uncertainty of a model is not the same thing as the probability of the model’s prediction on some state. Uncertainty is not about the setting where dynamics is noisy, but about the setting where we don’t know what the dynamics are.
The way to avoid taking risky actions on uncertain state is to plan based on expected expected reward. Wait, what is it? Yes, this is not a typo, the first expected is with respect to the model uncertainty, and the second expected is with respect to trajectory distribution. Mathematically, the objective is
\[\begin{align} &\int \int \sum_t r(s_t, a_t) p_{\theta}(\tau) p(\theta) \text{d}\tau \text{d}\theta \\ &= \int \left[\mathbb{E}_{\tau\sim p_{\theta}(\tau)}\sum_t r(s_t, a_t)\right] p(\theta)\text{d}\theta \\ &= \mathbb{E}_{\theta\sim p(\theta)}\left[ \mathbb{E}_{\tau\sim p_{\theta}(\tau)}\sum_t r(s_t, a_t) \right] \end{align}\]Having an uncertainty-aware formulation, the next steps are:
In this subsection we discuss how to get \(p(\theta)\). First of all, we should make sure that the general direction is to learn \(p(\theta)\) from data, thus we should explicitly write it as \(p(\theta\mid \mathcal{D})\) where \(\mathcal{D}\) is the data.
The first approach is Bayesian Neural Networks, or BNN. To consider the problem from a Bayesian perspective, we can first rethink our original approach, i.e. what is it that we are estimating when doing supervised training in step 2 in MPC? (Here we write it slightly differently for illustration)
learn dynamics model \(f_{\theta^*}(s,a)\) to minimize \(\sum_{i}\left\| s'_i - f_{\theta}(s_i, a_i) \right\|^2\)
The \(\theta\) that we find are actually the maximal likelihood estimation, i.e.
\[\theta^* = \text{argmax}_{\theta}p(\mathcal(D)\mid \theta)\]Adopting the Bayesian approach, we want to estimate the posterior distribution
\[\begin{align*} p(\theta\mid \mathcal{D}) =\frac{p(\mathcal(D)\mid \theta)p(\theta)}{p(\mathcal{D})} \end{align*}\]However, this calculation is usually intractable. In neural network setting, people usually resort to variance inference, which approaximates the intractable true \(p(\theta\mid \mathcal{D})\) by a tractable variational posterior \(p(\theta\mid\phi)\) by minimizing the Kullback-Leibler divergence (KL divergence)between the ground truth and approximation, where \(\phi\) is to be learned from data. We will introduce variational inference in future lectures, for now, we give a simple hand wavie example.
We define variational posterior to be fully factorized Gaussian:
\[p(\theta \mid \phi) = \prod N(\theta_j \mid \mu_j, \sigma^2_j)\]Where \(\mu_j\) and \(\sigma^2_j\) are learned s.t. the variational posterior is close to the true posterior. Then we use \(p(\theta \mid \phi)\) as the distribution over dynamics and take actions accrodingly.
The second approach, which is conceptually simpler and usually works better than BNN, is boostrap ensembles. The idea is to train many independent neural dynamics models, and average them. Mathematically, we learn independent neural network parameters \(\theta_1, \theta_2, \cdots, \theta_m\) and the ensembled dynamics is
\[p(\theta\mid \mathcal{D}) = \frac1m sum_{j=1}^{m}\delta(\theta_j)\]Where \(\delta\) is the delta function, and the probability of state \(s_{t+1}\) from the dynamics ensemble is an average of the probabilities of each independent neural dynamics:
\[\begin{align} \int p(s_{t+1}\mid s_t, a_t, \theta)p(\theta\mid \mathcal{D}) \text{d}\theta = \frac1m \sum_{j=1}^m p(s_{t+1}\mid s_t, a_t, \theta_j) \end{align}\]But how do we get the \(m\) independent neural dynamics? We use boostrap. The idea is to resample the dataset \(\mathcal{D}\) with replacement to get \(m\) dataset and for each of the \(m\) dataset, train a dynamics. The bootstrap method is developed by statistician Bradley Efron since 1979. It has solid statistical foundation and has been applied to many areas. I encourage interested reader checkout this book by Efron and Tibshirani.
In practice, people find that for neural dynamics, it is not necessary to resample the data. What people do is just train neural nets with same dataset but set different random seed. The use of SGD will make each neural net sufficiently independent.
Having uncertainty-aware dynamics i.e. a distribution over dynamics. It’s very natural to derive an uncertainty-aware MPC algorithm. Recall that in the MPC algorithm, we plan using the objective
\[J(a_1, \cdots, a_T) = \sum_{t=1}^{T}r(s_t, a_t), s_t = f_{\theta}(s_{t-1}, a_{t-1})\]Now the objective has changed to
\[\begin{align}\label{un_obj} &J(a_1, \cdots, a_T) = \frac1m \sum_{j=1}^{m}\sum_{t=1}^{T}r(s_{t,j}, a_t)\\ &\text{ where } s_{t,j} = f_{\theta_j}(s_{t-1,j}, a_{t-1}) \text{ or } s_{t,j} \sim p(s_t\mid s_{t-1,j}, a_{t-1}, \theta_j)\\ &\text{ and } \theta_j \sim p(\theta\mid \mathcal{D}) \end{align}\]With this, we can write out the uncertainty-aware MPC algorithm:
You might notice that this algorithm seems do not use the objective i.e. equation \(\ref{un_obj}\), but actually at step 4, the algorithm is actually planning based on equation \(\ref{un_obj}\), and since the reward relies on ensemble dynamics, we conveniently say “plan through the ensemble dynamics to choose actions”.
Previously we’ve been assuming that state is obserable, because we’ve been using transitions \(\{ (s_i, a_i, s'_i) \}\) for supervised learning of dynamics (or distribution of dynamics). In some cases, especially when the observation is image, directly treating it as state for supervised learning of dynamics can be troublesome, and the reasons are:
We will now introduce the state-space model that models POMDPs, which treats states as latent variables and model observation using distributions conditioned on states.
Let’s recall how dynamics is learned when we assume states are observable. We parameterize the dynamics using a neural net with parameter \(\theta\):
\[p(s_{1:T}) = \prod_{t=1}^{T}p_{\theta}(s_{t+1}\mid s_t, a_t)\]Note that we slightly abuse the notation for clarity, for example \(p_{\theta}(s_{1}\mid s_0, a_0) = p_{\theta}(s_1)\).
And solve for \(\theta\) using maximal likelihood on collected transitions \(\{ (s^i_{t+1}, s^i_t, a^i_t) \}_{i,t=1}^{N,T}\):
\[\max_{\theta}\frac1N \sum_{i=1}^{N} \sum_{t=1}^{T} \log p_{\theta}(s^i_{t+1}\mid s^i_t, a^i_t)\]Now consider state unobservable, We have:
\[p(s_{1:T}, o_{1:T}) = \prod_{t=1}^{T}p_{\theta}(s_{t+1}\mid s_t, a_t)p_{\phi}(o_t\mid s_t)\]Where \(p_{\theta}(s_{t+1}\mid s_t, a_t)\) is the transition model and \(p_{\phi}(o_t\mid s_t)\) is the observation model. Similarly, we solve for \(\theta \text{and} \phi\) using maximal likelihood
\[\begin{align} &\log \prod_{t=1}^{T} p_{\phi}(o_{t}\mid s_t) \nonumber \\ &=\log \mathbb{E}_{(s_t, s_{t+1}) \sim p(s_t, s_{t+1}\mid o_{1:t}, a_{1:t})}\prod_{t=1}^{T} p_{\theta}(s_{t+1}\mid s_t, a_t) p_{\phi}(o_{t}\mid s_t) \nonumber \\ &\geq \mathbb{E}_{(s_t, s_{t+1}) \sim p(s_t, s_{t+1}\mid o_{t}, a_{t})} \log \prod_{t=1}^{T} p_{\theta}(s_{t+1}\mid s_t, a_t) p_{\phi}(o_{t}\mid s_t) \nonumber \\ &\approx \frac1N \sum_{i=1}^{N} \sum_{t=1}^{T} \log p_{\theta}(s^i_{t+1}\mid s^i_t, a^i_t)+ \log p_{\phi}(o^i_{t}\mid s^i_t) \label{latent_obj} \end{align}\]We maximize equation \(\ref{latent_obj}\), which is lower bound of the log likelihood, it actually uses one sample estimation for estimating the expectation (in terms of \((s_t, s_{t+1})\)), more sample can be used.
One issue is that by Bayes’ rule,
\[\begin{align} &p(s_t, s_{t+1}\mid o_{t}, a_{t}) \\ &= p_{\theta}(s_{t+1}\mid s_t, a_t) p(s_t\mid o_t) \\ &= p_{\theta}(s_{t+1}\mid s_t, a_t) \frac{ p_{\phi}(o_t\mid s_t)p(s_t) }{p(o_t)} \end{align}\]and \(p(s_t\mid o_t)\) is intractale. Thus we can learn another neural net \(q_{\psi}(s_t\mid o_t)\). A full treatment involvs variational inference, which we will cover in future lectures. In this lecture, we simplify the case and model posterior of state as delta function, i.e. \(q_{\psi}(s_t\mid o_t) = \delta(s_t = g_{\psi}(o_t))\), which is just \(s_t = g_{\psi}(o_t)\).
Plug this in the objective equation \(\ref{latent_obj}\), we have
\[\begin{equation}\label{real_obj} \frac1N \sum_{i=1}^{N} \sum_{t=1}^{T} \log p_{\theta}(g_{\psi}(o^i_{t+1})\mid g_{\psi}(o^i_t), a^i_t)+ \log p_{\phi}(o^i_{t}\mid g_{\psi}(o^i_t)) \end{equation}\]We maximize this to find \(\theta, \phi\) and \(\psi\). In case you are wondering, assuming \(s_t\) can be deterministically derived from \(o_t\) doesn’t indicate \(p_{\phi}(o_{t}\mid s_t)\) is also a delta function, because \(g_{\psi}(\cdot)\) can be a one-to-many function.
Lastly, if we want to plan using iLQR or plan better, we usually also want to model the cost function, it can be modeled as a deterministic function like \(r_t = r_{\xi}(s_t, a_t)\) or stochastically like \(r_t \sim p_{\xi}(r_t\mid s_t, a_t)\). With the observed transitions and rewards \(\{ (o^i_t, a^i_t, r^i_t) \}_{i,t=1}^{N,T}\), we similar to how to derived \(\ref{real_obj}\), we maximize the objective
\[\frac1N \sum_{i=1}^{N} \sum_{t=1}^{T} \log p_{\theta}(s^i_{t+1}\mid s^i_t, a^i_t)+ \log p_{\phi}(o^i_{t}\mid s^i_t) + \log p_{\xi}(r^i_t\mid s^i_t, a^i_t)\]Lastly, I want to point out that sometimes it’s difficult to build a compact state space for the observations, and directly modeling observations and making prediction on future observations can actually work better. I.e. instead of modeling \(o_t = g_{\psi}(s_t)\), we model \(p(o_t \mid o_{t-1}, a_t)\) and plan actions acrodingly. We will not introduce these branch and encourage interested readers to check out Finn et al. 17’ and Ebert at al 17’, this two papers both directly model observations and plan actions using MPC.
where
\[\begin{equation} p(\tau) = p(s_1)\prod_{t=1}^{T}p(s_{t+1}\mid s_t, a_t)\pi(a_t\mid s_t) \end{equation}\]In most methods that we’ve introduced so far, such as policy gradient, actor-critic, Q-learning, etc. the transition dynamics \(p(s_{t+1}\mid s_t, a_t)\) is assumed to be unknown. But in many cases, the dynamics is actually known to us, such as the game of Go (we know what the board will look like after we make a move), Atari games, car navigation, anything in simulated environments (although we may not want to utilize the dynamics in this case) etc.
Knowing the dynamics provides addition information, which in principle should improve the actions we take. In this lecture, we study how to plan actions to maximize the expected reward when the dynamics is known. We will mostly study deterministic dynamics, i.e. \(s_{t+1} = f(s_t, a_t)\). Although we will also generalize some methods to stochastic dynamics, i.e. \(s_{t+1} \sim p(s_{t+1}\mid s_t, a_t)\).
If we know the deterministic dynamics, then giving the first state \(s_1\), we should be able to know all the remaining states given the actions sequence (and therefore the rewards). Open-loop planning aims at directly giving optimal actions sequences without waiting for the trajectory to unfold.
Below we introduce two methods that completely ignore the feedback control and optimize the objective as a blackbox, that is to say, the methods do not even utilize the known dynamics. For simplicity, Let’s write the objective i.e. the expected retrun as \(J(\mathbf{A})\), where \(\mathbf{A} := a_1, a_2, \cdots, a_T\). The goal is to find \(\mathbf{A^*}\) that maximizes this objective.
The first method is called random shooting, which can be explained in one line: randomly sample \(\mathbf{A_1}, \mathbf{A_2}, \cdots, \mathbf{A_N}\) from some distribution (g.e. uniform) and then choose the one that gives the highest \(J(\mathbf{A_i})\) as \(A^*\).
Random shooting seems to be a bad idea, but it actually works well on some low action dimension, short horizon problem. And it’s very easy to implement and parallelize.
However, this is still a overly simple method that completely relies on luck. One method can dramatically improve random shooting method while still maintaining the benefits is called cross-entropy method or CEM. Below the algorithm of CEM:
Initialize the actions sequence distribution \(p(\mathbf{A})\)
sample \(\mathbf{A_1}, \mathbf{A_2}, \cdots, \mathbf{A_N}\) from \(p(\mathbf{A})\)
evaluate \(J(\mathbf{A_1}), J(\mathbf{A_2}), \cdots, J(\mathbf{A_N})\)
pick the elites \(\mathbf{A_{i_1}}, \mathbf{A_{i_2}}, \cdots, \mathbf{A_{i_M}}\) with the highest value, where \(M < N\)
refit \(p(\mathbf{A})\) to the the elites. Go to 2.
Where setting \(M = 10\%N\) is usually a good choice. The key of CEM is that the action distribution is constantly changing based on the action evaluation. This help the algorithm to find and concentrate the probability mass on areas where actions are more likely to give high value.
Similar to random shooting, CEM is easy to implement and parallelize, while also has harsh dimensionality limits (actions space dimension times the horizon), the exactly limit obviously depends on the problem, but generally these methods cannot go beyond \(60\) dimension, e.g. action dimension is \(5\) and time horizon is \(12\).
In this section we introduce the famous Monte Carlo Tree Search algorithm or MCTS, which has been use in AlphaGO. MTCS is used in cases when the action space is discrete.
We formulize the problem of planning as a tree search, where the nodes are states and taking different actions leads to the tree branching out to different nodes. Note that the transition can be stochastic and the state space can be contiuous, and in fact, we don’t worry to much about the actual state but only focus on the time step of a state, i.e. \(s_t\) can represent different state at time step \(t\).
Start from the initial state \(s_1\), an naive idea is to just try to take different actions at every state and collect the reward. And after the tree is fully unfold, pick the path that gives the biggest reward.
However, this is prohibitly expensive as the computation complexity is \(O(T^{\lvert\mathcal{A}\rvert})\). MCTS is heuristic method that can approximate the state action value without exactly expand the whole tree. The algorithm is the following:
Choose a leaf node \(s_l\) by applying TreePolicy recursively from \(s_1\)
Run DefaultPolicy(\(s_{l}\)) and evaluate the the value of \(s_l\)
Update all values in tree between \(s_1\) and \(s_l\). While within the computational budget, go back to step 1.
When the algorithm is done, we take the best action starting from \(s_1\).
Now let’s first explain in detail what each steps means, and then we will show an example of how MCTS works.
Step 1. The TreePolicy is basically a node selection strategy. We start from \(s_1\) and recursively apply it to descend through the tree until we find a node that satisfies the strategy and select the node. While there are many strategies, we only introduce one most popular one, namely Upper Confidence Bounds for Trees, or UCT policy. UCT(\(s_t\)) works this way, if \(s_t\) is not fully expanded, i.e. there are possible actions that we haven’t taken, then take that action, if there are multiple actions, just randomly choose one; else, choose a child node \(s_{t+1}\) with the best score Score\((s_{t+1})\), with Score\((s_{t+1})\) is defined as
\[\begin{equation}\label{score} \text{Score}(s_{t+1}) = \frac{Q(s_{t+1})}{N(s_{t+1})} + 2C \sqrt{\frac{2\ln N(s_t)}{N(s_{t+1})}} \end{equation}\]Where \(Q(s_{t+1})\) is the value of the node \(s_{t+1}\), but note that this is not the one that we’ve defined previously in this course, but is an accumulated value - every time we evaluate itself and it’s descedents, we add the value to it. For example, for node \(s_{t+1}\), if we evaluate it self to be \(10\) and later on in the algorithm we evaluate it’s two decendents to be \(5\) and \(11\), then \(Q(s_{t+1}) = 10 + 5 + 11 = 26\). \(N(s_{t+1})\) is the number the node has been visited, in this example, \(N(s_{t+1})\) is \(3\).
Equation \(\ref{score}\) is very intuitive. The first term measure the exact value of the node, the second term measure how often this node has been visited — if \(N(s_t)\) is big, while \(N(s_{t+1})\), that means a lot of visits to \(N(s_t)\) has not pass down to \(N(s_{t+1})\) but other descendents of \(N(s_t)\), and this indicates that we migth want to visit \(N(s_{t+1})\) more often.
Step 2. When we decide to take some action and go to node \(s_l\), we run DefaultPolicy from this state (till it terminates) and collect reward (which we called evaluate the value this node).
Step 3. We add the reward to the value \(Q\) of every node along the path which we follow to get to node \(s_l\). Also update the \(N\) of each node along the path.
Here we put the illustration by Prof. Sergey Levine, Where the illustration starts at 16:50.
You might notice that methods introduced in previous two sections actually do not require a known dynamics. In this section, we will finally introduce methods that do require and utilize a known dynamics. since the methods in this section are mostly studied in the optimal control community, we will follow the their notation and denote action as \(u_t\), state as \(x_t\), dynamics as \(x_{t+1} = f(x_t, u_t)\) or \(x_{t+1}\sim p(x_{t+1}\mid x_t, u_t)\), and cost as \(c(x_t, u_t)\). This is the first time the term “cost” appears in this series, but it’s really just the opposite of reward, where reward measures how good an action state pair is, and cost measures how bad an action state pair is. Note that different from the classic RL setting, in addition to the dynamics, we also assume the cost function is known.
Similar to policy gradient methods, we aim at directly minimizing the sum of cost:
\[\begin{align} &\min_{u_1,\cdots, u_T}\sum_{t=1}^{T}c(x_t, u_t) \\ &\text{s.t.} x_{t+1} = f(x_t, u_t) \end{align}\]We can actually incorporate the constraint into the objective and make it an uncontraint optimization problem:
\[\begin{align}\label{obj} &\min_{u_1,\cdots, u_T}\, c(x_1, u_1) + c(f(x_1,u_1), u_2) + \cdots + c(f(f(\cdots)\cdots),u_T) \end{align}\]Linear Quadratic Regulator or LQR further simpifies this problem by assume a linear dynamics and quadratic cost:
\[\begin{equation} \begin{aligned} & f(x_t, u_t) = F_t \begin{bmatrix} x_t \\ u_t \end{bmatrix} + f_t \\ & c(x_t, u_t) = \frac12 \begin{bmatrix} x_t \\ u_t \end{bmatrix}^T C_t \begin{bmatrix} x_t \\ u_t \end{bmatrix} + \begin{bmatrix} x_t \\ u_t \end{bmatrix}^T c_t \end{aligned} \end{equation}\]Note that \(F_t, f_t, C_t, c_t\) are all known quantities.
To solve LQR, the simplest method is just take the derivative of the objective w.r.t. actions and set them to \(0\). But this is numerically very unstable because the sensitivities of actions at different time step to the cost is very different, for example, the first action is in every term of the objective and has a huge effect on the total cost, while the last action has a very small effect.
We introduce a stable iterative method to solve LQR. We start from the last action \(u_T\), since it doesn’t affect previous states and also has no affect on future states (there is no future state!). Treat all terms that are not effected by \(u_T\) as constant, we can write the cost as
\[\begin{equation} \begin{aligned} Q(x_T, u_T) = \text{const.} + \frac12 \begin{bmatrix} x_T \\ u_T \end{bmatrix}^T C_T \begin{bmatrix} x_T \\ u_T \end{bmatrix} + \begin{bmatrix} x_T \\ u_T \end{bmatrix}^T c_T \end{aligned} \end{equation}\]take the derivative
\[\begin{align*} &\nabla_{u_T}Q(x_T, u_T) = C_{u_T, x_T}x_T + C_{u_T, x_T}u_T + c_{u_T}^T = 0 \\ &\Rightarrow u_T = -C_{u_T, u_T}^{-1}(C_{u_T, x_T}x_T + c_{u_T}) \end{align*}\]where
\[\begin{equation} \begin{aligned} C_T = \begin{bmatrix} C_{x_T, x_T} & C_{x_T, u_T}\\ C_{u_T, x_T} & C_{u_T, u_T} \end{bmatrix} \quad c_T = \begin{bmatrix} c_{x_T} \\ c_{u_T} \end{bmatrix} \end{aligned} \end{equation}\]To better see the pattern (useful for later derivation), we denote
\[\begin{align*} &K_T = -C_{u_T, u_t}^{-1}C_{u_T, x_t} \\ &k_T = - C_{u_T, u_T}^{-1}c_{u_T} \end{align*}\]and write \(u_T\) as
\[\begin{equation}\label{xt} u_T = K_Tx_T + k_T \end{equation}\]This equation shows that the optimal \(u_T\) is a linear function of \(x_T\).
Our goal is to represent \(u_t\)’s using \(x_t\)’s and then once we have the first state \(x_1\), we can get \(u_1\), and then via the dynamics we have \(x_2\) and then \(u_2\) etc. This way, we can get all the actions (and states).
Now let’s try to represent optimal \(u_{T-1}\) using \(x_{T-1}\). Note that \(u_{T-1}\) can only affect \(x_T, u_T\), and thus we can treat all terms that are not effected by \(u_{T-1}\) as constant and write the objective as
\[\begin{equation} \begin{aligned} &Q(x_{T-1}, u_{T-1}) \\ &= \text{const.}+ \frac12 \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^T C_{T-1} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^T c_{T-1} \\ +& \frac12 \begin{bmatrix} x_T \\ K_Tx_T + k_T \end{bmatrix}^T C_T \begin{bmatrix} x_T \\ K_Tx_T + k_T \end{bmatrix} + \begin{bmatrix} x_T \\ K_Tx_T + k_T \end{bmatrix}^T c_T \\ &=\text{const.}+\frac12 \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^T C_{T-1} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^T c_{T-1} + \frac12 x_T^TV_Tx_T + x^T_T v_T \end{aligned} \end{equation}\]Where \(V_T, v_T\) are terms that depends on \(C_T, c_T\) only. We can see that this is again a sum of linear and quadratic terms of \(x_{T-1}, u_{T-1}\).
We can take the derivative of it w.r.t. \(u_{T-1}\) and set it to \(0\). We will get:
\[\begin{equation}\label{xt-1} u_{T-1} = K_{T-1}x_{T-1} + k_{T-1} \end{equation}\]Where \(K_{T-1}\) and \(k_{T-1}\) are functions of \(F_{T-1}, f_{t-1}, C_{T-1}, c_{T-1}. V_T, v_T\), the expression is a bit hairy, but the important thing to known that \(K_{T-1}\) and \(k_{T-1}\) are known quantities.
Therefore we show that we can always represent \(u_t\) as a linear function of \(x_t\).
The full algorithm contains first starting from time step \(T\) and go backward to represent \(u_t\) using \(x_t\), and then run forward from time step \(1\) to get action state and action at every time step.
Concretly, the backward iteration is
And the forward iteration is
When the dynamics is stochastic, we want to minimize the expected cost
\[\begin{equation}\label{sto} \min_{u_1, \cdots, u_T}\mathbb{E}\sum_{t=1}^{T}c(x_t, u_t) \end{equation}\]Where the expectation is taken w.r.t dynamics \(p(x_{t+1}\mid x_t, u_t)\).
Here we briefly introduce applying LQR in a special case of stochastic dynamics — Guassian linear dynamics
\[\begin{align*} p(x_{t+1}\mid x_t, u_t) = \mathcal{N}( F_t \begin{bmatrix} x_t \\ u_t \end{bmatrix} + f_t, \Sigma_t ) \end{align*}\]It turns out that if the cost is still quadratic in state and action, the objective in equation \(\ref{sto}\) can be solved analytically and we can apply the same iterative procedure and actually get the same solution \(u_t = K_t x_t + k_t\). Details are left to the readers.
Now we get rid of the assumption that the dynamics is linear and cost is quadratic.
We can use first Taylor expansion to approximate the dynamics as
\[\begin{align} f(x_t, u_t) \approx f(\hat{x}_t, \hat{u}_t) + \nabla_{x_t, u_t}f(\hat{x}_t, \hat{u}_t) \begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{u}_t \end{bmatrix} \end{align}\]Use second order Taylor expansion to approximate cost as
\[\begin{align} c(x_t, u_t) \approx c(\hat{x}_t, \hat{u}_t) + \nabla_{x_t, u_t}c(\hat{x}_t, \hat{u}_t) \begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{u}_t \end{bmatrix} \\ + \frac12 \begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{u}_t \end{bmatrix}^T \nabla_{x_t, u_t}^2 c(\hat{x}_t, \hat{u}_t) \begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{u}_t \end{bmatrix} \end{align}\]Denote
\[\begin{align} \delta x_t = x_t - \hat{x}_t \\ \delta u_t = u_t - \hat{u}_t \\ f_t = f(\hat{x}_t, \hat{u}_t) \\ F_t = \nabla_{x_t, u_t}f(\hat{x}_t, \hat{u}_t) \\ c_t = \nabla_{x_t, u_t}c(\hat{x}_t, \hat{u}_t) \\ C_t = \nabla_{x_t, u_t}^2 c(\hat{x}_t, \hat{u}_t) \end{align}\]No need to worry about the constant term \(c(\hat{x}_t, \hat{u}_t)\) in cost approximation, as it will disappear when we take the derivative, i.e. it will not affect the solution.
We can first randomly pick sequence of actions as \(\hat{u}_t\)’s and then get the states \(\hat{x}_t\) based on the true dynamics. Then, run backward and forward LQR algorithm on
\[\begin{equation} \begin{aligned} & f(\delta x_t, \delta u_t) = F_t \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix} + f_t \\ & c(\delta x_t, \delta u_t) = \frac12 \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix}^T C_t \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix} + \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix}^T c_t \end{aligned} \end{equation}\]which gives \(\delta x_t, \delta u_t\), add them by \(c(\hat{x}_t, \hat{u}_t)\) and we get the \(x_t\)’s and \(u_t\)’s, we then denote them as \(\hat{x}_t, \hat{u}_t\), and repeat the process. Put it in one place, the algorithm is the following:
Note that in the forward pass of LQR, we use the true dynamics rather than the quadratic approximation, to get the states. When \(\hat{x}_t, \hat{u}_t\)’s are very close to \(x_t, u_t\) newly obtained the current LQR forward iteration, we say the algorithm has converged.
This algorithm is very similar to Newton’s method, and in fact, the only difference is that Newton’s method will approximate dynamics using second order Taylor expension.
Since we are using approximations, too big a step in the update may lead to worse result due too the approximations being inaccurate. To rememdy this, when runnig the forward pass to get \(u_t\), we introduce a parameter \(\alpha\), and change the update rule to be
\[\begin{equation} u_t = K_t(x_t - \hat{x}_t) + \alpha k_t + \hat{u}_t \end{equation}\]\(\alpha\) controls the step size in the update (how much \(u_t\) will deviate from \(\hat{u}_t\)). And we can perform a search over \(\alpha\), until we see improvements on the cost.
In this section, we derive stable policy gradient methods, by firstly framing them as policy iteration.
Let’s write down the difference between expected return under previous policy \(q\) and under new (updated) policy \(\pi_{\theta'}\):
\[\begin{align} &J(\theta') - J(\theta)\\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^tr(s_t, a_t) \right] - \mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[ \sum_t \gamma^tr(s_t, a_t) \right] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^tr(s_t, a_t) \right] - \mathbb{E}_{s_0 \sim p(s_0)}\left[ V^{q}(s_0) \right] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^tr(s_t, a_t) \right] - \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ V^{q}(s_0) \right] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^tr(s_t, a_t) \right] - \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ V^{q}(s_0) + \sum_{t=1}^{\infty}\gamma^{t}V^{q}(s_{t}) - \sum_{t=1}^{\infty}\gamma^{t}V^{q}(s_{t}) \right] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^tr(s_t, a_t) \right] + \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t\gamma^{t}(\gamma V(s_{t+1}) - V^{q}(s_{t})) \right] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^t(r(s_t, a_t) + \gamma V(s_{t+1}) - V^{q}(s_{t})) \right] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^t A^{\pi_{\theta}}(s_t, a_t) \right] \end{align}\]Here we have proved an intersting equality:
\[\begin{equation}\label{diff} J(\theta') - J(\theta) = \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^t A^{\pi_{\theta}}(s_t, a_t) \right] \end{equation}\]The difference of expected return equals to the expected value of the advantage of the previous policy \(q\) under the trajectory distribution of the new policy \(\pi_{\theta'}\).
Note that we haven’t done any policy gradient specific operation, so this equality is universal. We can use this to understand why policy iteration improve the expected return at every iteration i.e. \(J(\theta') - J(\theta) \geq 0\): in policy iteration, the policy is deterministic and updated as \(\pi'(s) = \text{argmax}_a A^{\pi}(s_t, a_t)\). Therefore when the \(s_t, a_t\) are from the new policy \(\pi'\), we always have \(A^{\pi_{\theta}}(s_t, a_t) \geq 0\), and thus \(J(\theta') - J(\theta) \geq 0\).
Now let’s consider how to have this monotonic improvement in expected return in policy gradient methods. Well, this cannot be guaranteed theoretically because we need to introduce some approximation in order to derive a policy gradient algorithm from equation \(\ref{diff}\). Nevertheless, the resulting method — TRPO — is the first stable RL algorithm in that during training the return will improve gradually (whereas another popular methods at the time — DQN — is very unstable).
As a policy gradient method, TRPO aims at directly maximizing equation \(\ref{diff}\), but this cannot be done because the trajectory distribution is under the new policy \(\pi_{\theta'}\) while the sample trajectories that we have can onlu come from the previous policy \(q\).
This might reminds you on importance sampling that we used for deriving off-policy policy gradient. Yes, we will rewrite equation \(\ref{diff}\) using importance sampling:
\[\begin{align} &J(\theta') - J(\theta) \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\left[ \sum_t \gamma^t A^{\pi_{\theta}}(s_t, a_t) \right] \\ &= \sum_t\mathbb{E}_{s_t\sim p_{\theta'}(s_t)}\left[ \mathbb{E}_{a_t \sim \pi_{\theta'}} \gamma^t A^{\pi_{\theta}}(s_t, a_t)\right]\\ &= \sum_t\mathbb{E}_{s_t\sim p_{\theta'}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right] \label{diff_importance} \end{align}\]However, even though we don’t need to sample from \(p_{\theta'}(\tau)\) now, \(p_{\theta'}(s_t)\) is still impossible. A natural question is, can we just use \(p_{\theta}(s_t)\)? I.e. approximating the equation above by
\[\begin{align} &\approx \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right] \\ &= \mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[ \sum_t \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t) \right] \label{final} \end{align}\]Eqaution \(\ref{final}\) will lead to almost the same gradient as the off-policy policy gradient, but with reward \(r(s_t, a_t)\) begin replaced by advantage \(A^{\pi_{\theta}}(s_t, a_t)\). And you might remember that we also used \(p_{\theta}(s_t)\) to approaximate \(p_{\theta'}(s_t)\) and briefly mentioned that this approximation error “is bounded when the gap between \(q\) and \(\pi_{\theta'}\) are not too big”.
Now let’s try to quantitative give the gap between \(q\) and \(\pi_{\theta'}\). The first quantitative gap actually has been introduced in lecture 2 when we introduce the error bound on DAgger for imitation learning — we define \(\pi_{\theta'}\) is close to \(\pi_{\theta}\) if
\[\begin{equation}\label{cond1}\left| \pi_{\theta'}(a_t\mid s_t) - \pi(a_t\mid s_t)\right|< \epsilon, \forall s_t\end{equation}\]This will give
\[\begin{align*} &\left| p_{\theta'}(s_t) - p_{\theta}(s_t) \right|\\ &= \left| (1-\epsilon)^tp_{\theta}(s_t) + (1-(1-\epsilon)^t)p_{\text{mistake}}(s_t) - p_{\theta}(s_t) \right|\\ &= (1-(1-\epsilon)^t)\left| p_{\text{mistake}}(s_t) - p_{\theta}(s_t) \right|\\ &\leq 2(1-(1-\epsilon)^t)\\ &\leq 2\epsilon t \end{align*}\]This is very similar to the derivation we have for DAgger, and if there is anything that is unclear to you, please see lecture 2 section 3.2.
Now let’s reveal what \(\lvert p_{\theta'}(s_t) - p_{\theta}(s_t) \rvert \leq 2\epsilon t\) can bring us:
Since
\[\begin{align*} &\mathbb{E}_{p_{\theta'}(s_t)}\left[ f(s_t) \right]\\ &= \sum_{s_t}p_{\theta'}(s_t)f(s_t) \\ &\geq \sum_{s_t}p_{\theta}(s_t)f(s_t) - \left|p_{\theta'}(s_t) - p_{\theta}(s_t)\right|\max_{s_t}f(s_t)\\ &\geq \sum_{s_t}p_{\theta}(s_t)f(s_t) - 2\epsilon t \max_{s_t}f(s_t) \end{align*}\]Therefore, we have
\[\begin{align} &\sum_t\mathbb{E}_{s_t\sim p_{\theta'}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right] \\ &\geq \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right] - \sum_t 2\epsilon t C \\ &\geq \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right] - \frac{4\epsilon\gamma}{(1-\gamma)}D_{\text{KL}}^{\text{max}}(\theta,\theta') \\ \end{align}\]Where \(C \propto O(Tr_{\text{max}})\) in finite horizon case or \(C \propto O(\frac{r_{\text{max}}}{1-\gamma})\) in infinite horizon case. This tells us two things: first, the approximate objective
\[\sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right]\]is a lower bound of the original objective
\[\sum_t\mathbb{E}_{s_t\sim p_{\theta'}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right]\]and this is good as maximizing this approximate objective is maximizing an lower bound on the thing that we initially want maximize. Second, the error bound of the approximation is \(\sum_t 2\epsilon t C\), while this error might seem big because C is linearly time and maximal reward, but we can keep it very small by keeping the gap between new and old policy to be very small.
But how do we impose this constraint (equation \(\ref{cond1}\)) in practice?
Well, it’s not a very convenient constraint to use in practice, luckily, we have
\[\begin{equation}\label{cond2} \left| \pi_{\theta'}(a_t\mid s_t) - q(a_t\mid s_t)\right| < \sqrt{\frac12 D_\text{KL}(\pi_{\theta} \lVert \pi_{\theta'})}, \forall s_t \end{equation}\]and the KL divergence has nice properties that make it much easier to approximate!
Now, we have the Trust Region Policy Optimization set up:
\[\begin{align} &\theta' \leftarrow \text{argmax}_{\theta'}\, \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right]\\ & \text{subject to } D_\text{KL}(\pi_{\theta} \lVert \pi_{\theta'}) < \epsilon \end{align}\]For small enough \(\epsilon\), this is gauranteed to improve \(J(\theta') - J(\theta)\).
How do we solve this constrained optimization problem?
In this section we introduce two ways for solving the TRPO — dual gradient ascent and natural policy gradient.
Dual gradient ascent introduces augmented the objective with the Lagrangian multiplier to incorporperate the constraint:
\[\begin{align} \mathcal{L}(\theta', \lambda) &= \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right] \\ &- \lambda (D_\text{KL}(\pi_{\theta} \lVert \pi_{\theta'}) - \epsilon) \end{align}\]This can be maximized by running the following two steps iteratively:
Where the first step can be imcomplete, i.e. we just need to to run a few gradient updates and go to step 2.
Natural policy gradient was introduced much earlier than TRPO, but it turns out to be a special case of TRPO.
To ease the notation, let’s denote the objective as
\[\bar{A}(\theta') := \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\gamma^t A^{\pi_{\theta}}(s_t, a_t)\right]\]The idea of natural policy gradient is to use linear approximation to the objective \(\bar{A}(\theta')\) and quadratic approximation to the constraint. This will lead to a very simple optimization problem that can be solved analytically by hand.
Use first order Taylor expension on \(\bar{A}(\theta')\), we have
\[\begin{align*} &\bar{A}(\theta') \\ &\approx \bar{A}(\theta) + \nabla_{\theta'}\bar{A}(\theta)^T(\theta' - \theta)\\ &\propto \nabla_{\theta'}\bar{A}(\theta)^T(\theta' - \theta) \end{align*}\]Where we drop the constant in terms of \(\theta'\)
As a side note, we have
\[\begin{align} &\nabla_{\theta'}\bar{A}(\theta) \\ &= \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \frac{\pi_{\theta}(a_t\mid s_t)}{\pi_{\theta}(a_t\mid s_t)}\gamma^t \nabla_{\theta}\log \pi_{\theta'}(a_t\mid s_t) A^{\pi_{\theta}}(s_t, a_t)\right] \\ &= \sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \mathbb{E}_{a_t \sim q} \gamma^t \nabla_{\theta}\log \pi_{\theta}(a_t\mid s_t) A^{\pi_{\theta}}(s_t, a_t)\right] \\ \end{align}\]Which is actually the actor-critic policy gradient.
Then we expend the constraint to the second order
\[\begin{align*} D_\text{KL}(\pi_{\theta} \lVert \pi_{\theta'}) \approx \frac12 (\theta' - \theta)^T\nabla^2 D_\text{KL}(\pi_{\theta} \lVert \pi_{\theta'})(\theta' - \theta) \end{align*}\]Where the constant and first order term can be shown to be both zeros. We can approximate the constraint using sample:
\[\{(s_t, a_t, r_t)\}_{t=0}^{T}\] \[\begin{equation}\label{second} \frac12 (\theta' - \theta)^T \left[\frac1T \sum_{t=1}^{T} \frac{\partial^2}{\partial \theta_i \partial \theta_j} D_\text{KL}(\pi_{\theta}(\cdot\mid s_t) \lVert \pi_{\theta'}(\cdot\mid s_t))\right](\theta' - \theta) < \delta \end{equation}\]Where the KL term can usually be calculated analytically.
Also, since
\[\begin{equation*} \nabla^2 D_\text{KL}(\pi_{\theta} \lVert \pi_{\theta'}) = \mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[ \nabla_{\theta}\pi_{\theta}(a_t\mid s_t) \nabla_{\theta}\log\pi_{\theta}(a_t\mid s_t)^T \right] \end{equation*}\]where the right hand side is the Fisher information matrix of \(\pi_{\theta}(a_t\mid s_t)\).
With this, we can also approximate the constraint by
\[\begin{equation}\label{fisher} \frac12(\theta' - \theta)^T \left[\frac1T \sum_{t=1}^{T} \frac{\partial}{\partial \theta_i}\log \pi_{\theta}(a_t\mid s_t) \frac{\partial}{\partial \theta_j}\log \pi_{\theta}(a_t\mid s_t)^T\right](\theta' - \theta)< \epsilon \end{equation}\]Which approximation should we use? Equation \(\ref{second}\) use the fact that KL divergence of policy can usually be calculated analytically and therefore the MC estimator is more stable, but it requires taking second order derivative, which is not very compatible with automatic differentiation packages. Equation \(\ref{fisher}\) doesn’t require taking second order derivative, but it requires we store all the policy gradients along trajectories, also since we need to use single sample estimate to approximate the value of \(\log \pi_{\theta}(a_t\mid s_t)\), this approximate has larger variance.
Nevertheless, Since the course uses the fisher information matrix, we will follow it and express the contraint as
\[\begin{align*} \frac12 (\theta' - \theta)^T\mathbf{F}(\theta' - \theta)< \epsilon \end{align*}\]With the objective:
\[\max_{\theta'} \nabla_{\theta}\bar{A}(\theta)^T(\theta' - \theta)\]We can easily solve the constraint optimization by hand and arrive:
\[\theta' = \theta + \alpha \mathbf{F}^{-1}\nabla_{\theta}\bar{A}(\theta)\]Where
\[\alpha = \sqrt{\frac{2\epsilon}{\nabla_{\theta}\bar{A}(\theta)^T\mathbf{F}\nabla_{\theta}\bar{A}(\theta)}}\]PPO is proposed to deal with the issues of TRPO while maintain it’s advantages. The component that makes TRPO stable is the trust region (i.e. the constraint), but the constraint optimization problem it leads to is difficult to solve.
Essentially PPO differs from TRPO by the way it formulize the trust region in optimization. Let
\[r_t(\theta') = \frac{\pi_{\theta'}(a_t\mid s_t)}{q(a_t\mid s_t)}\]To makes sure the new and old policy are close, in TRPO, we formulize it as a constraint on the KL divergence; in PPO, we directly incorporate it in the object:
\[\begin{equation}\label{ppo_obj} \mathcal{L}^{\text{CLIP}} = \sum_t \mathbb{E}_{s_t,a_t \sim p_{\theta}(s_t, a_t)}\left[ \gamma^t \text{min}\left(r_t(\theta')A^{\theta}(s_t, a_t), \text{CLIP}(r_t(\theta'), 1-\epsilon, 1+\epsilon)A^{\theta}(s_t, a_t) \right) \right] \end{equation}\]The first term in the min is the original TRPO objective (without incorporating the constraint). The clipping removes the incentive for moving \(r_t(\theta')\) outside of the interval \([1 − \epsilon, 1 + \epsilon]\). (the paper shows empirically that setting \(\epsilon=0.2\) gives best results). Since we take the “minimum of the clipped and unclipped objective, the final objective is a lower bound on the unclipped objective.” “With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.” (quoted sentences are directly from the PPO paper by Schulman et al. 17’).
Link to the article by OpenAI
]]>A little bit terminology: Q-learning and Q-iteration mean the same thing, the crucial part is if there is a “fitted” in front of them, when there is, that means the Q-function is approximated using some parametric function (e.g. a neural network).
to see the issues of online Q-iteration, let’s write out the algorithm:
The first issue with this algorithm is that, the transitions that are close to each other are highly correlated, this will lead to the Q-function to locally overfit to windows of transitions and fail to see broader context in order to accurately fit the whole function.
The second issue is that the target of the Q-function is changing very gradient step while the gradient doesn’t account for that change. To explain, for same transition \((s, a, r, s')\), when the current Q-function is \(Q_{\phi_1}\), the target is \(r + \max_{a'}Q_{\phi_1}(s', a')\), however, after one step of gradient update and the Q-function is \(Q_{\phi_2}\), the target also change to \(r + \max_{a'}Q_{\phi_2}(s', a')\). This is just like the Q-function is chasing it’s tail.
We solve issue one in this section. Note that as pointed out in the previous lecture, different from policy gradient methods which view data as trajectories, value function methods (including Q-iteration) view data as transitions which are snippets of trajectories. This means that the completeness of data as whole trajectories doesn’t matter in terms of learning a good Q-function.
Follow this idea, we introduce replay buffers, a concept that has been introduced to RL in the nineties.
A replay buffer \(\mathcal{B}\) stores many transition tuples \((s,a,s',r)\) which are collected ever time we run a policy during training (so the transitions doesn’t have to come from the same/latest policy). In Q-iteration, if the transitions are random samples from \(\mathcal{B}\) then we don’t have to worry about them being correlated. This gives algorithm:
Note that the data in the replay buffer are still coming from the policies induced from the Q-iteration policy (original policy, epsilon greedy, or Boltzmann exploration etc.). It very common to just set \(K=1\), which makes the algorithm even more similar to the original online Q-iteration algorithm.
We can represent the algorithm using the following graph to make it more intuitive
Since we are constantly adding new and possibly more relevent transitions to the buffer, we evict old transitions to keep the total amount of transtions in the buffer fixed.
In the rest of this lecture, we will always use replay buffers in any algorithms that we introduce.
We deal with the second issue in this section. Instead of calculating the target always using the latest Q-function (which results in the Q-function chasing it’s own tail), we use a target network (also output Q-value) which is not too far from the latest Q-function, but fixed for a considerable amount of gradient steps.
Let’s see the Q-learning algorithm with both replay buffer and target network:
Note that the loop contains step 2, 3 and 4 is just plain regression, as the target network \(Q_{\phi'}\) is fixed within the loop. In practice, we usually set \(K\) to be between \(1\) and \(4\), while set \(N\) to be something like \(10000\).
As a specially case of the above algorithm, setting \(K = 1\) give us the famous classic DQN algorithm (Minh et al. 13’). We can switch step 1 and 2, and the resulting algorithm also works.
You might feel a little uncomfortable with this algorithm because after we just assign the target network parameters \(\phi'\) to be the current Q-function parameters \(\phi\), during the first few gradient steps, the lag between \(Q_{\phi'}\) and \(Q_{\phi}\) will be small, and as we update the Q-function \(Q_{\phi}\) in step 4, the lag become larger. We might not want the lag to be constantly changing during gradient update. To remedy this, we can use exponentially decaying moving average to update target network \(\phi'\) after every gradient update of \(\phi\) (or make N much smaller than \(10000\))
\[\phi' \leftarrow \tau \phi' + (1-\tau)\phi\]Where \(\tau\) can be some value that is very close to \(1\), such as \(0.999\)
For simplicity, we will sometimes just use “update \(\phi'\)” or “target update” in the remaining lecture, rather than specifying exactly how \(\phi'\) is updated.
This section is based on van Hasselt 10’ and van Hasselt et al. 15’.
Recall that using definition, we can derive the relation between value function and Q-function in Q-learning:
\[\begin{equation} \label{value} V(s) = \max_{a}Q(s,a) \end{equation}\]Since we don’t know the true Q-function, we need to estimate it using Monte Carlo samples.
Let’s use an simple example to show how we end up using the wrong estimator and overestimate \(\max_{a}Q(s,a)\).
Suppose there are three different actions that we can take \(a_1, a_2, a_3\), this means we need to estimate \(Q(s, a_1)\), \(Q(s,a_2)\), and \(Q(s, a_3)\) using their Monte Carlo samples and then take the max. For each value, we use
\[\begin{equation}\label{maxexp} \max\{ \mathbb{E}Q(s,a_1), \mathbb{E}Q(s,a_2), \mathbb{E}Q(s,a_3) \} \end{equation}\]and we will use one sample estimate to estimate equation \(\ref{maxexp}\)
\[\begin{equation}\label{esti} \max \{ Q_{\phi}(s,a_1), Q_{\phi}(s,a_2), Q_{\phi}(s,a_3) \}\end{equation}\]However, this is not unbiased estimator of equation \(\ref{maxexp}\), but an unbiased estimate of
\[\begin{equation} \label{expmax} \mathbb{E} \{\max\{ Q(s,a_1), Q(s,a_2), Q(s,a_3) \}\} \end{equation}\]Since we have
\[\mathbb{E} \{\max\{ Q(s,a_1), Q(s,a_2), Q(s,a_3) \}\} \geq \max\{ \mathbb{E}Q(s,a_1), \mathbb{E}Q(s,a_2), \mathbb{E}Q(s,a_3) \}\]Our estimator equation \(\ref{esti}\) will over estimate the target in equation \(\ref{value}\).
To make it even more concrete, consider the case where for all three actions, the true Q-values are all zero, but our estimated Q-values are
\[Q_{\phi}(s, a_1) = -0.1, Q_{\phi}(s, a_2) = 0, Q_{\phi}(s, a_3) = 0.1\]Then \(\max \{ Q_{\phi}(s,a_1), Q_{\phi}(s,a_2), Q_{\phi}(s,a_3) \} = 0.1\).
To see why \(\ref{esti}\) overestimates from another angle, the function approximation \(Q_{\phi}\) we are using is a biased estimate of \(Q\), and in equation \(\ref{esti}\), we use this \(Q_{\phi}\) to both estimate the Q-values and select the best Q-value, i.e.
\[\begin{equation} \max \{ Q_{\phi}(s,a_1), Q_{\phi}(s,a_2), Q_{\phi}(s,a_3) \} = Q_{\phi}(s,\text{argmax}_{a_i}\, \{ Q_{\phi}(s,a_1), Q_{\phi}(s,a_2), Q_{\phi}(s,a_3) \}) \end{equation}\]Thus the noise in \(Q_{\phi}\) will get accumulated and lead to overestimation.
This leads to one solution to the problem — Double Q-learning, which uses two different Q-functions for estimation and selection separately:
\[\begin{align} &a^* = \text{argmax}_{a}Q_{\phi_{select}}(s,a) \\ &\max_{a}Q(s,a) \approx Q_{\phi_{eval}}(s, a^*) \end{align}\]And if \(Q_{\phi_{select}}\) and \(Q_{\phi_{evak}}\) are noisy in different ways, the overestimation problem will go away!
So, we need to learn two neural networks? Well, that’s one possible way, but we can actually just use the current network as \(Q_{\phi_{select}}\) and the target network as \(Q_{\phi_{eval}}\). I.e.
\[\begin{align} &a^* = \text{argmax}_{a}Q_{\phi}(s,a) \\ &\max_{a}Q(s,a) \approx Q_{\phi'}(s, a^*) \end{align}\]These two networks are actually correlated, but they are sufficiently far away from each other (note that we assign the current network to the target network every 10000 or more gradient steps) that in practice this method works really well.
This section is based on Munos et al. 16’.
In actor-critic lecture, we talked about the bias-variance tradeoff between estimating the expected sum of rewards using \(\sum_t \gamma^{t}r_t\) and \(r_t + \gamma V_{\phi}(s_{t+1})\). The former is a unbiased one sample estimate of the sum of return, which has high bias; the later is one step reward plus future rewards estimated by a fitted value value function, which can be biased but has less variance. Based on this, we can tradeoff bias and variance by using
\[\sum_{t'=t}^{t+N-1}\gamma^{t'-t}r_{t'} + \gamma^{N} V_{\phi}(s_{t+N})\]Where bigger \(N\) leads to smaller bias and higher variance.
Similarly, for Q-learning, we can estimate the target Q-value by
\[\begin{equation} \label{trade}y_t = \sum_{t'=t}^{t+N-1}\gamma^{t'-t}r_{t'} + \gamma^{N} \max_{a_{t+N}}Q_{\phi}(s_{t+N},a_{t+N})\end{equation}\]This seems ok at the first glance, but recall that \(y_t\) is estimating the Q-value under the current policy (our objective is to minimize \(\sum_t\left\|Q_{\phi}(s_t) - y_t\right\|^2\)), we need to make sure that the transitions \((s_{t'}, a_{t'},s_{t'+1})\) and rewards \(r_{t'}\) for \(t < t' \leq t+N-1\) come from running the current policy.
There are several ways to deal with this:
So far we’ve been assuming that \(\max_{a}Q_{\phi}(s,a)\) is tractable and fast operation, because it appears int the inner loop of Q-learning algorithms. This is true for discrete action space, where we can just parametrized the \(Q_{\phi}\) to take input \(s\) and output a vector of dimension \(\left\|\mathcal{A}\right\|\), where each entry of the vector is the Q-value for a specific action.
What if the action space is continuous?
We will briefly introduce three techniques that make Q-learning algorithms work in continuous actions space by making the operation \(\max_{a}Q_{\phi}(s,a)\) fast.
The simplest solution is just randomly sample a bunch of actions and choose the one that gives the best estimated Q-value as the action we will take and the corresponding value as the value of the state, i.e.
\[max_{a}Q_{\phi}(s,a) \approx \max\{Q_{\phi}(s,a_1),Q_{\phi}(s,a_1),\cdots, Q_{\phi}(s,a_N)\}\]where \(a_i \sim \mathcal{A}\), \(\forall i=1:N\).
The advantages of this method is that it’s extremely simple and can be parallized easily, and the disadvantage is that it’s not very accurate, especially when the action space dimension is high.
There are other more complicated randomized search method such as cross-entropy method (we will introduce in detail in later lectures) and CMA-ES. However, these methods do not really work when the dimension of the action space is higher than \(40\).
We can easily find the maximal value of \(Q_{\phi}(s,a)\) is it is quadratic in \(a\). This leads to the Normalized Advantage Functions or NAFs (Gu et al. 16’), which parameterizes Q-function as
\[Q_{\phi}(s,a) = -\frac12 (a - \mu_{\phi}(s))^TP_{\phi}(s)(a - \mu_{\phi}(s)) + V_{\phi}(s)\]And the architecture is
Where the network takes in state \(s\) and output vector \(\mu_{\phi}(s)\), positive-definite square matrix \(P_{\phi}(s)\) and scaler value \(V_{\phi}(s)\).
Using this parameterization, we have
\[\begin{align*} &\text{argmax}_a\,Q_{\phi}(s,a) = \mu_{\phi}(s)\\ &\max_a Q_{\phi}(s,a) = V_{\phi}(s) \end{align*}\]The disadvantage of this method is that the representation power is sacrificed because of the limited quadratic form.
Recall that in double Q-learning
\[max_{a}Q_{\phi'}(s,a) = Q_{\phi'}(s, \text{argmax}_a Q_{\phi}(s,a))\]the max operation can be fast if we can learn an approximate maximizer that output \(\text{argmax}_a Q_{\phi}(s,a)\). And this is the idea of Deep Deterministic Policy Gradient or DDPG (Lillicrap et al. 15’).
We parameterize the maximizer as a neural network \(\mu_{\theta}(s)\), that is to say we want to find \(\theta\) s.t.
\[\mu_{\theta}(s) = \text{argmax}_aQ_{\phi}(s,a)\]and therefore
\[max_{a}Q_{\phi'}(s,a) = Q_{\phi'}(s, \mu_{\theta}(s))\]This can be solved by stochastic gradient ascent with gradient update
\[\theta \leftarrow \theta + \beta \frac{\partial Q_{\phi}(s,a)}{\partial \mu_{\theta}(s)}\frac{\partial \mu_{\theta}(s)}{\partial \theta}\]To aviod the maximizer to chase its own tail similar to what happend to the Q-function in vanilla Q-learning, we use a target maximizer \(\theta'\) when assign
\[y = r + \gamma Q_{\phi'}(s', \mu_{\theta'}(s'))\]And update \(\theta'\) based on the current \(\theta\) by schedule during training.
The algorithm of DDPG can be writen as
Here are some tips for applying Q-learnig methods
You are right! A parametric policy is not needed if we have a good understanding of how good a state or action is. In this lecture, we will introduce methods that utilize the value functions or Q-functions to make decisions. In addition to CS285, part of this tutorial is based on CS287 Advanced Robotics by Professor Pieter Abbeel.
Let’s assume the state space is discrete. Define
\(V^*_t(s)\): expected sum of rewards accumulated starting from state s, acting optimally for \(i\) steps
\(\pi^*_t(s)\): optimal action when in state s and getting to act for \(i\) steps
Note that we usually denote time index \(t\) as an subscript of state and action, but here for clarity we put them as subscript of \(V\) and \(\pi\).
The value iteration algorithm is the following:
This algorithm is very stragtforward, at each time step and state, we just choose the action that give the highest Q-value and assign that value to be the value of the state. The policy is deterministic and because the way we obtain it, it is better than any policy for this state in terms of estimated Q-value. The update is called value update for Bellman update/back-up.
Value iteration is gauranteed to converge, and at convergence we have found the optimal value function \(V^*\) for the discounted infinite horizon problem, which satisfies the Bellman equations:
\[\forall s \in \mathcal{S}, V^*(s) = \max_a r(s,a) + \gamma\mathbb{E}_{s'\sim p(s'\mid s, a)}V^*(s')\]which also tells us how to act, namely
\[\forall s \in \mathcal{S}, \pi^*(s) = \text{argmax}_a\, r(s,a) + \gamma\mathbb{E}_{s'\sim p(s'\mid s, a)}V^*(s')\]Note that infinite horizon optimal policy is stationary, meaning that the optimal action at state \(s\) is the same for any time step (which means it’s efficient to store).
Note that Q-function is the only essential quantity, as value function is obtained by maximizing it w.r.t. action, and policy is obtained by argmaxing it w.r.t. action. As a very simple example, suppose that in our problem, both the state and action space are discrete and each contains 4 different choices like the following
At convergence, at each state, we can get the maximal value and the corresponding action like the following
This table is the only thing we need in order to make decisions.
Policy iteration is an iterative algorithm that extends value iteration, it guarantees that the policy will improve at each iteration in terms of Q-value (Note that this is different from the previous likewise statement for value iteration which says the policy is better than any other policy for this state).
We first introduce policy evaluation. Suppose we now have a fix policy \(\pi(s)\) and want evaluate it. we can simply use the value update:
And this is guaranteed to converge to the stationary infinite horizon value function, i.e. for state \(s\) the value is the same for any time step.
After value update, we do policy improvement:
Policy iteration is iteratively run policy evaluation and policy improvement. And like value iteration, policy iteration is also guaranteed to converge, and at convergence, the current policy and the value function is the optimal policy and value function.
Need to add better intuitive comparison between VI and PI, Need to add theoretical and empirical comparision between VI and PI.
One problem with vanilla value iteration and policy iteration is that the number of parameters is linear in the number of possible states, therefore, they are impractical for continuous state-space problem (or states are in a very fine-grained discrete space, i.e. RGB images).
If you went through the previous lecture, you will naturally think about using a neural net to fit the value function. This is called the fitted value iteration:
The problem with this algorithm is the second step. First, it’s very likely that the agent has been to any state at most once, which means we only have collected one action and corresponding reward for that state, and therefore given state \(s^i_t\), we cannot maximize over r(s^i_t, a^i_t) in terms of different actions i.e.
\[y^i_t \leftarrow \max_{a^i_t} \left(r(s^i_t, a^i_t) + \gamma\mathbb{E}_{s^i_{t+1}\sim p(s^i_{t+1}\mid s^i_t, a^i_t)}V(s^i_{t+1})\right)\]shoule really be
\[y^i_t \leftarrow r^i_t + \max_{a^i_t}\gamma\mathbb{E}_{s^i_{t+1}\sim p(s^i_{t+1}\mid s^i_t, a^i_t)}V(s^i_{t+1})\]Second, even if we move the max operation like above, since we don’t know the transition dynamics \(p(s^i_{t+1}\mid s^i_t, a^i_t)\), it’s pretty difficult to estimate the expectation term and therefore the maximization can be unreliable.
To get away from the above two issues, we instead fit the Q-function. First of all, there is no max operation involved in estimating the target Q-value
\[\begin{equation}\label{q1}Q(s,a) = r(s, a) + \gamma\mathbb{E}_{s'\dim p(s'\mid s, a)}V(s')\end{equation}\]We still see \(V(s')\) here, but don’t worry, we always have \(V(s') = \max_{a'}Q(s', a')\) (or equivalently \(\pi(s') = \text{argmax}_{a'} \, Q(s', a')\) and recall that by definition \(V(s') =\mathbb{E}_{a'\dim\pi(a'\mid s')}Q(s',a')\)). Therefore, the equation \(\ref{q1}\) is equivalent to
\[\begin{equation}\label{q2}Q(s,a) = r(s, a) + \gamma\mathbb{E}_{s'\dim p(s'\mid s, a)}\max_{a'}Q(s', a')\end{equation}\]Here we again don’t have transition dynamics so \(\mathbb{E}_{s'\dim p(s'\mid s, a)}\) needs to be estimated using samples. We can only use one sample estimation, i.e. the one in the sample trajectories that we have collected. This gives
\[\begin{equation}\label{q3}Q(s,a) \approx r(s, a) + \gamma\max_{a'}Q(s', a')\end{equation}\]Note that in the equation above, \(s\), \(a\) and \(s'\) are the data that we collected, \(a'\) is a variable with respect to what we maximize the Q-function.
Using one sample estimation here is just like what we did in the previous lecture when fitting value function using bootstrap estimate. Q-function can be represented by neural network that takes in \(s\) and output a value for each possible action (e.g. there are \(10\) possible actions to take, then the output dimension is \(10\)), this makes it easier to do the max operation.
The above algorithm is called the fitted Q-iteration:
Or we can put it in the general RL algorithm framework:
Note that the blue box is degenerated, meaning that we don’t explicitly go through this part in the algorithm.
Combine step 2 and 3 of the algorithm (or the just the green box), we see that fitted Q-iteration is actually looking for \(\phi\) that minimizes
\[\begin{equation}\mathcal{L} = \mathbb{E}_{(s,a,s') \sim p_{\text{data}}}\left\| Q_{\phi}(s,a) - (r(s,a) + \gamma \max_{a'}Q_{\phi}(s',a')\right\|^2\end{equation}\]Where the difference in the norm is also referred to as the temporal difference error.
When the algorithm converges to \(\mathcal{L}=0\), we have \(Q_{\phi}(s,a) = (r(s,a) + \gamma \max_{a'}Q_{\phi}(s',a')\), \(\forall (s,a,s') \sim p_{\text{data}}\), we denote the Q-function as \(Q^*\), we have also found the optimal policy \(\pi^*\), where \(\pi^*(s) =\max_{a}Q_{\phi}(s,a)\).
However, the convergence of the algorithm is only guaranteed in the tabular case. You may wonder, it seems that step 3 can be done using gradient descentm, and then it’s just a regression problem, which should have many guarantees. But it’s actually not a vanilla gradient descent for regression, most of the time, the gradient of
\(\phi\) is taken to be \(\frac{1}{NT} \sum_{i=1}^{N}\sum_{t=1}^{T}\frac{\partial Q_{\phi}(s^i_t, a^i_t)}{\partial \phi}\left( Q_{\phi}(s^i_t, a^i_t) - y^i_t\right)\)
Note that \(y^i_t\) is a function of \(Q_{\phi}\), but we do not differentiate through it. You can actually manage to differentiate through it and get a read regression with gradient descent algorithm (which is call residual algorithm, see this link for more), but in general residual algorithm has some serious numerail issues and doesn’t work as well as vanilla Q-iteration.
Last thing to point out before we leave this section is that the Q-iteration algorithm is off-policy, the policy induced from Q-iteration is evolving all the time (policy is updated when we do \(\max_{a}Q_{\phi}(s,a)\)). Fitting Q-function requires only valid transition tuples \((s,a,s')\), which does not need to be generated by the newest policy. Essentially, for policy gradient and actor-critic algorithms, we view data as trajectories, while for value based methods, we view data as transition tuples, which are more flexible and contains less tightly bonded with any certain policy.
Similar to actor-critic algorithms, we also have an online version of the batch Q-iteration.
Here we ask the question, what policy should we use to collect data? Previous I mentioned that we do not have to use the latest deterministic policy derived from Q-function, and any valid transitions tuple will suffice. Well, actually, we want the training data to cover as much of the state-action space as possible. This is intuitive because we want the Q-function to cover more situations.
This is a bit on the opposite of the policy we obtained from Q-iteration. If we always generate data using latest Q-iteration policy, we might get stuck in some small subset of state-action space, because the agent will likely to always take the same action. To enable exploration, we modify the Q-iteration policy to make it probabilistic.
Here we introduce two simple (but effective) ways to do that, more sophistically methods will be introduced in later lectures.
Epsilon greedy
\[\begin{equation} \pi(a\mid s)= \begin{cases} 1-\epsilon,& \text{if } a = \text{argmax}_{a}\,Q(s,a)\\ \epsilon/(\|\mathcal{A}\|-1), & \text{otherwise} \end{cases} \end{equation}\]Where \(\epsilon\) is a small number between \(0\) and \(1\), and \(\|\mathcal{A}\|\) is the number of actions in the action space. This stochastic policy allows the possibility to act differently than the best action according to the current Q-iteration policy. One possible disadvantage of this epsilon greedy policy is that the probabilities of taking actions other than the best action are all the same. Imagine at some point we already have a ok-ish Q-function, and for state \(s\), there are several actions that lead to high Q-value, if we are using epsilon greedy, then all good and bad actions have equal probability except for the one that give the biggest Q-value. This issue leads to the next policy.
Boltzmann Exploration
\[\begin{equation} \pi(a\mid s) \propto \exp{(Q_{\phi}(s,a))} \end{equation}\]Under this policy, actions that of similar values will be even closer, i.e. similar probabilities of being selected.
]]>Let’s take a look at the original policy gradient and it’s Monte Carlo approximator:
\[\begin{align} \nabla_{\theta}J(\theta) &= \mathbb{E}_{\tau\sim p_{\theta}(\tau)}\nabla_{\theta}\log \pi_{\theta}(a_t\mid s_t)r(s_t, a_t) \label{orig} \\ &\approx \frac1N \sum_{i=1}^{N}\left(\sum_{t=1}^{T}\nabla_{\theta}\log \pi_{\theta}(a^i_t\mid s^i_t)\right) \left(\sum_{t=1}^T r^i_t\right) \label{approx} \\ \end{align}\]In equation \(\ref{approx}\), \(N\) sample trajectories is used to approximate the expectation \(\mathbb{E}_{\tau\sim p_{\theta}(\tau)}\), but in equation \(\ref{approx}\) there are still one quantity that are approximated, i.e. the reward function \(r(r^i_t, a^i_t)\), and unfortuately, we are only using one sample \(r^i_t\) to approximate it.
Even with the use of causality and baseline, which gives
\[\begin{align}\label{var_red} \nabla_{\theta}J{\theta}&\approx\frac1N \sum_{i=1}^{N}\sum_{t=1}^{T}\nabla_{\theta}\log \pi_{\theta}(a^i_t\mid s^i_t) \left(\sum_{t'=t}^T r^i_t - b\right) \end{align}\]with \(b = \frac1N \sum_{i=1}^N\sum_{t'=t}^{T}r^i_t\), we are improving the variance in terms of expectation over trajectories (\(\mathbb{E}_{\tau\sim p_{\theta}(\tau)}\)) but not the estimation of reward to go (or better-than-average reward to go).
Actor-critic algorithms aims at better estimating the reward to go.
We start by recalling the goal of reinforcement learning:
\[\begin{equation}\label{goal2}\text{argmax} \mathbb{E}_{p(\tau)}\sum_{t=1}^T r(s_t, a_t)\end{equation}\]Now define the Q-function:
\[\begin{align} Q^{\pi}(s_t,a_t) &= \mathbb{E}_{p_{\theta}}\left[\sum_{t'=t}^{T}r(s_{t'},a_{t'}) \mid s_t, a_t \right] \\ &= r(s_t, a_t) + \mathbb{E}_{a_{t+1} \sim \pi_{\theta}(a_{t+1}\mid s_{t+1}),s_{t+1}\sim p(s_{t+1}\mid s_t, a_t)} \left[ Q^{\pi}(s_{t+1}, a_{t+1}) \right] \end{align}\]Q-function is exactly the expected reward to go from step \(t\) given the state and action \((s_t, a_t)\).
How about the baseline \(b\) in \(\ref{var_red}\)? We can also replace it with lower variance estimate. To do that, we define the value function:
\[\begin{align} V^{\pi}(s_t) & = \mathbb{E}_{p_{\theta}}\left[\sum_{t'=t}^{T}r(s_{t'},a_{t'}) \mid s_t \right] \\ &= \mathbb{E}_{a_t\sim \pi_{\theta}(a_t\mid s_t)} \left\{ r(s_t, a_t) + \mathbb{E}_{s_{t+1} \sim p(s_{t+1}\mid s_t, a_t)}\left[ V^{\pi}(s_{t+1}) \right] \right\} \end{align}\]Value function measures how good the state is (i.e. the value of the state). This is exactly the expected reward of state averaged over different actions.
In addition, we define the advantage
\[\begin{equation} A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t) \end{equation}\]In fact, \(\sum_{t'=t}^T r^i_t - \frac1N \sum_{i=1}^N\sum_{t'=t}^{T}r^i_t\) in equation \(\ref{var_red}\) looks very much just like the one sample estimate of the advantage \(A^{\pi}(s_t, a_t)\). In fact, \(\sum_{t'=t}^T r^i_t\) is an one sample estimate of The Q-function \(Q^{\pi}(s_t, a_t)\), however, \(\frac1N \sum_{i=1}^N\sum_{t'=t}^{T}r^i_t\) is not a estimate of \(V^{\pi}(s_t)\) (actually if the state space is continuous, we will only have an one sample estimate of \(V^{\pi}(s_t)\), which is also the \(\sum_{t'=t}^T r^i_t\), which is the same as the one sample estimate of \(Q^{\pi}(s_t, a_t)\). We only have an one sample estimate because we will never visit the state again). But \(V^{\pi}(s_t)\) is intuitively better than \(\frac1N \sum_{i=1}^N\sum_{t'=t}^{T}r^i_t\) even if we don’t consider the variance, because the former is the expected reward for state \(s_t\), and the later is an estimate of expected reward at time step \(t\) averaged over all possibly state. Since we want to know how good the action is in the current state, rather than in the current time step, we prefer the former.
If we have Q-function and value function, we can plug them in the original policy gradient and get the most ideal estimator:
\[\nabla_{\theta}J(\theta) = \frac1N\sum_{i=1}^N\sum_{t=1}^T \nabla_{\theta}\pi_{\theta}(a^i_t\mid s^i_t)\left(Q^{\pi}(s^i_t, a^i_t) - V^{\pi}(s^i_t)\right)\]However, we do not have \(Q^{\pi}(s_t, a_t)\) or \(V^{\pi}(s_t)\) and therefore we want to estimate them. Instead of using Monte Carlo estimate, we use function approximation, which might lead to a biased estimatio, but will give enormous variance reduction, in practice, the later usually brings more benefits than the hurts the former brings.
So we want to fit two neural networks to approximate \(Q^{\pi}(s_t, a_t)\) or \(V^{\pi}(s_t) separately\)? Well, it’s actually not necessary if we notice the relationship between the two functions:
\[\begin{equation}\label{relation} Q^{\pi}(s_t, a_t) = r(s_t, a_t) + \mathbb{E}_{s_{t+1} \sim p(s_{t+1}\mid s_t, a_t)}V^{\pi}(s_{t+1})\end{equation}\]And in practice we use one sample estimation to approximate \(\mathbb{E}_{s_{t+1} \sim p(s_{t+1}\mid s_t, a_t)}\), then we have
\[\begin{equation} Q^{\pi}(s^i_t, a^i_t) \approx r^i_t + V^{\pi}(s^i_{t+1})\end{equation}\]Therefore, we only need to fit \(V^{\pi}\). For training data, since \(V^{\pi}(s_t)=\mathbb{E}_{p_{\theta}}\left[\sum_{t'=t}^{T}r(s_{t'},a_{t'}) \mid s_t \right]\), ideally for every state \(s_t\), we want to have a bunch of differernt rollouts and rewards collected starting from that state, and then use the sample mean of rewards as the estimate of \(V^{\pi}(s_t)\). But this require reseting the simulator and actually impossible in the real world. Therefore, we have one sample estimator \(\sum_{t'=t}^Tr_{t'}\),
So, we use training data \(\{ (s^i_t, \sum_{t'}^Tr^i_{t'}) \}_{i=1,t=1}^{N, T}\) to train a neural network \(V^{\pi}_{\phi}\) in a supervised way. But we can further reduce the variance even if we only have one sample estimation of the expected reward — we can again apply the function approximation idea and replace \(\sum_{t'}^Tr^i_{t'}\) with \(r^i_t + V^{\pi}_{\phi'}(s^i_{t+1})\) in the training data, where \(V^{\pi}_{\phi'}\) is the previously fitted value function (i.e. \(\phi'\) is one gradient step before \(\phi\)). \(r^i_t + V^{\pi}_{\phi'}(s^i_{t+1})\) is called the Bootstrapp estimate of the value function. In summary, we fit \(V^{\pi}_{\phi}\) to \(V^{\pi}\) by minimizing
\[\begin{equation}\label{value_obj} \frac{1}{NT}\sum_{i,t=1,1}^{N,T}\left\|V^{\pi}_{\phi}(s^i_t) - y^i_t\right\|^2 \end{equation}\]where \(y^i_t = \sum_{t'}^Tr^i_{t'}\) or \(r^i_t + V^{\pi}_{\phi'}(s^i_{t+1})\) and the later usually works better.
With fitted value function \(V^{\pi}_{\phi}\), we can estimate the Q-function (Q-value) by
\[\hat{Q}^{\pi}(s^i_t, a^i_t) \approx r^i_t + V^{\pi}_{\phi}(s^i_{t+1})\]and therefore the advantage:
\[\hat{A}^{\pi}(s^i_t, a^i_t) = r^i_t + V^{\pi}_{\phi}(s^i_{t+1}) - V^{\pi}_{\phi}(s^i_t)\]And our actor-critic policy gradient is
\[\begin{align} \nabla_{\theta}J(\theta) &= \frac1N\sum_{i=1}^N\sum_{t=1}^T \nabla_{\theta}\pi_{\theta}(a^i_t\mid s^i_t)\left(Q^{\pi}(s^i_t, a^i_t) - V^{\pi}(s^i_t)\right) \\ &= \frac1N\sum_{i=1}^N\sum_{t=1}^T \nabla_{\theta}\pi_{\theta}(a^i_t\mid s^i_t)\left(r(s^i_t, a^i_t) + \mathbb{E}_{s_{t+1} \sim p(s_{t+1}\mid s^i_t, a^i_t)}V^{\pi}(s_{t+1}) - V^{\pi}(s^i_t)\right) \\ &\approx \frac1N\sum_{i=1}^N\sum_{t=1}^T \nabla_{\theta}\pi_{\theta}(a^i_t\mid s^i_t)\left(r^i_t + V^{\pi}_{\phi}(s^i_{t+1}) - V^{\pi}_{\phi}(s^i_t)\right) \\ &=\frac1N\sum_{i=1}^N\sum_{t=1}^T \nabla_{\theta}\pi_{\theta}(a^i_t\mid s^i_t)\hat{A}^{\pi}(s^i_t, a^i_t) \end{align}\]The batch actor-critic algorithm is:
We call it batch in that for each policy update we collect a batch of trajectories. We can also update the policy (and value function) using only one step of data i.e. \((s_t, a_t, r_t, s_{s+1})\), which leads to the online actor-critic algorithm which we will introduce later.
Our previous discussion on policy gradient and actor-critic algorithms are all within the finite horizon or episodic learning scenario, where there is an ending time step \(T\) for the task. What about the infinite horizon scenario i.e. \(T = \infty\)?
Well in that case the original algorithm can run into problems because at the second step, \(V^{\pi}_{\phi}\) can get infinitely large in many cases. Or in vanilla policy gradient method, the sum of reward can get infinitely large.
To remedy that, we introduce the discount factor \(\gamma \in (0,1)\) and define the discounted expected reward, value function, and Q-function to be
\[\begin{align} J(\theta) &= \mathbb{E}_{\tau\sim p_{\theta}(\tau)}\sum_t \gamma^t r(s_t, a_t)\\ V^{\pi}(s_t) &= \mathbb{E}_{\tau_t \sim p_{\theta}(\tau_t)}\left[ \sum_{t'=t} \gamma^{t'-t}r(s_t', a_t') \mid s_t \right]\\ Q^{\pi}(s_t, a_t) &= \mathbb{E}_{\tau_t \sim p_{\theta}(\tau_t)}\left[ \sum_{t'=t} \gamma^{t'-t}r(s_t', a_t') \mid s_t, a_t \right]\\ &= r(s_t, a_t) + \gamma\mathbb{E}_{s_{t+1} \sim p(s_{t+1}\mid s_t, a_t)}V^{\pi}(s_{t+1}) \end{align}\]Therefore, the policy gradient and actor-critic policy gradient are \(\begin{align} \nabla_{\theta}J_{\text{PG}}(\theta) &= \frac1N \sum_{i=1}^N\sum_{t=1}\nabla_{\theta}\log \pi_{\theta}(a^i_t\mid s^i_t)\left(\sum_{t'=t}\gamma^{t'-t}r^i_{t'} - b\right) \label{dis_pg}\\ \nabla_{\theta}J_{\text{AC}}(\theta) &= \frac1N \sum_{i=1}^N\sum_{t=1}\nabla_{\theta}\log \pi_{\theta}(a^i_t\mid s^i_t)\left(r^i_t + \gamma V^{\pi}_{\phi}(s^i_{t+1}) - v^{\pi}_{\phi}(s^i_t)\right) \label{dis_ac} \end{align}\)
where in \(J_{\text{PG}}(\theta)\), \(b\) is also a discounted baseline e.g. \(\frac1N\sum_{i=1}^N\sum_{t'=t}\gamma^{t'-t}r^i_{t'}\).
Usually we set \(\gamma\) to be something like \(0.99\). It can be proved that discount factor can prevent the expected reward from being infinity and reduce the variance. Therefore, actually in most cases, no matter it’s finite of infinite horizon, people use discounted policy gradient, i.e. equation \(\ref{dis_pg} \text{ and } \ref{dis_ac}\), rather than the original ones.
So far we’ve been discussing the batch actor-critic algorithm, which for each gradient update, we need to run the policy to collect a batch of trajectories. In this section, we introduce online actor-critic algorithms, which allow faster neural network weights update, and with some techniques can work better than batch actor-critic algorithms in some cases.
The simplest version of online actor-critic algorithm is similar to online learning, where instead of calculating the gradient using a batch of trajectories and rewards, it only uses one transition tuple \((s, a, s', r)\). The algorithm is the following:
However, this algorithm does not really work in most cases, because one sample estimate has very high variance, coupled with policy gradient, the variance can be notoriously high. To deal with the high variance problem, we introduce the synchronized parallel actor-critic algorithm, which is basically several agent running basic online actor-critic algorithm but using and updating the shared policy and value network \(\pi_{\theta} \text{ and } V^{\pi}_{\phi}\). See the following figure:
This can be very easily realized by just changing the the random seeds of the code.
Another variants which has been proved to work very well when we have a very large pool of workers is called the asynchronized parallel actor-critic algorithm:
Each worker (agent) send the one step transition data to the center to update the parameters (both \(\theta\) and \(\phi\)), but do not wait for the updated weights to be deployed before it execute the next step in the environment. This is a little counterintuitive, because in this way, the worker might not using the latest policy to decide actions. But it turns out that the policy network will be very similar in the near time steps and since the algorithm is run asychronizely, it can be very efficient.
In this section we go back to the actor critic gradient, what distinguishes it from the vanilla policy gradient is that it uses the advantage \(\hat{A}^{\pi}(s_t, a_t)\) to replace the original one sample estimate of the expected discounted reward \(\sum_{t'=t}\gamma^{t'-t}r^i_{t'}\). The advantage has smaller variance, but it can be biased as \(V^{\pi}_{\phi}(s_{t})\) can be an imperfect estimation of the value function.
A question would be, can we bring the best from the two worlds and get a unbiased estimate of advantage while keep the variance low? Or further, can be develop a machanism that allows us to tradeoff the variance and bias in estimating the advantage?
The answer is yes and the rest of this section will introduce three advantage estimator that gives different variance bias tradeoff.
Recall that the original advantage estimator in actor-critic algorithm is:
\[\begin{align} \hat{A}^{\pi} &= \hat{Q}^{\pi}(s_t, a_t) - V^{\pi}_{\phi}(s_{t}) \\ &= r_t + \gamma V^{\pi}_{\phi}(s_{t+1}) - V^{\pi}_{\phi}(s_{t}) \label{orig_a} \end{align}\]The first advantage estimator is \(\begin{align} \hat{A}^{\pi} = \sum_{t'=t}^{T}\gamma^{t'-t}r_t - V^{\pi}_{\phi}(s_{t}) \end{align}\)
Compare to equation \(\ref{orig_a}\), we replace the neural estimation of discounted expected reward to go with the one sample estimaion \(\sum_{t'=t}^{T}\gamma^{t'-t}r_t\) used in policy gradient. This estimator has give lower variance than policy gradient whose baseline is a constant Greensmith et al. 04’ (but the variance is still higher than the actor-critic gradient). But is this advantage estimator really leads to an unbiased gradient estimator?
We can actually show that any state dependent baseline in policy gradient can lead to unbiased gradient estimator. I.e. we want to prove
\[\begin{align} \nabla_{\theta}J(\theta) &= \mathbb{E}_{\tau\sim p_{\theta}(\tau)}\sum_{t=1}^T\nabla_{\theta}\log \pi_{\theta}(a_t\mid s_t)\left( \sum_{t'=t}^{T}\gamma^{t'-t}r(s_{t'}, a_{t'}) - V^{\pi}(s_t)\right) \\ &= \mathbb{E}_{\tau\sim p_{\theta}(\tau)}\sum_{t=1}^T\nabla_{\theta}\log \pi_{\theta}(a_t\mid s_t)\sum_{t'=t}^{T}\gamma^{t'-t}r(s_{t'}, a_{t'}) \\ \end{align}\]let’s take one element from the summation \(\sum_{t=1}^T\) out:
\[\begin{align} \nabla_{\theta}J(\theta)_t &= \mathbb{E}_{\tau_t\sim p_{\theta}(\tau_t)}\nabla_{\theta}\log \pi_{\theta}(a_t\mid s_t)\left( \sum_{t'=t}^{T}\gamma^{t'-t}r(s_{t'}, a_{t'}) - V^{\pi}(s_t)\right) \\ &= \mathbb{E}_{\tau_t\sim p_{\theta}(\tau_t)}\nabla_{\theta}\log \pi_{\theta}(a_t\mid s_t)\left( \sum_{t'=t}^{T}\gamma^{t'-t}r(s_{t'}, a_{t'}) \right)- \mathbb{E}_{\tau_t\sim p_{\theta}(\tau_t)}\nabla_{\theta}\log \pi_{\theta}(a_t\mid s_t)V^{\pi}(s_t) \\ &= \mathbb{E}_{\tau_t\sim p_{\theta}(\tau_t)}\nabla_{\theta}\log \pi_{\theta}(a_t\mid s_t)\left( \sum_{t'=t}^{T}\gamma^{t'-t}r(s_{t'}, a_{t'}) \right) - \mathbb{E}_{s_{1:t}, a_{1:t-1}}V^{\pi}(s_t) \mathbb{E}_{a_t \sim \pi_{\theta}(a_t\mid s_t)}\nabla_{\theta}\log \pi_{\theta}(a_t\mid s_t) \\ &= \mathbb{E}_{\tau_t\sim p_{\theta}(\tau_t)}\nabla_{\theta}\log \pi_{\theta}(a_t\mid s_t)\left( \sum_{t'=t}^{T}\gamma^{t'-t}r(s_{t'}, a_{t'}) \right) - \mathbb{E}_{s_{1:t}, a_{1:t-1}}V^{\pi}(s_t) \cdot 0 \\ &= \mathbb{E}_{\tau_t\sim p_{\theta}(\tau_t)}\nabla_{\theta}\log \pi_{\theta}(a_t\mid s_t)\left( \sum_{t'=t}^{T}\gamma^{t'-t}r(s_{t'}, a_{t'}) \right) \end{align}\]To be updated, material is in Gu et al. 16’
Lastly let’s compare the advantage estimation introduced in section 4.1 (let’s call it \(A^{\pi}_{\text{MC}}\)) and the advantage estimation used in vanilla actor-critic algorithm (let’s call it \(A^{\pi}_{\text{C}}\))
\[\begin{align*} A^{\pi}_{\text{MC}} &= \sum_{t'=t}^{T}\gamma^{t'-t}r_t - V^{\pi}_{\phi}(s_{t}) \\ A^{\pi}_{\text{C}} &= r_t + \gamma V^{\pi}_{\phi}(s_{t+1}) - V^{\pi}_{\phi}(s_{t}) \end{align*}\]\(A^{\pi}_{\text{MC}}\) has higher variance because of the one sample estimation \(\sum_{t'=t}^{T}\gamma^{t'-t}r_t\), while \(A^{\pi}_{\text{C}}\) is biased because \(r_t + \gamma V^{\pi}_{\phi}(s_{t+1})\) might be an biased estimator of \(Q^{\pi}(s_{t+1})\).
Stare at these two estimators for a while, you might notice that the essential part that decide variance bias tradeoff is the estimation of \(Q^{\pi}\), one sample estimation has high variance while neural estimation can be biased. We can combine the two and use one sample estimation for the first \(n\) steps and use neural estimation for the rest. i.e.
\[\begin{align} A^{\pi}_{l} &= \sum_{t' = t}^{t+l-1} \gamma^{t'-t} r_{t'} + \gamma^{l} V^{\pi}_{\phi}(s_{t+l}) - V^{\pi}_{\phi}(s_{t}) \end{align}\]This also make sense intuitively, because the more distant from the current time step, the higher the variance will be. On the other hand, although \(V^{\pi}_{\phi}\) can be biased, when being multiplied by \(\gamma^{l}\), the effect can be small. Therefore, \(l\) controls the variance bias tradeoff of the advantage estimation. let \(l=1\), we recover \(A^{\pi}_{\text{C}}\), which has the highest bias and lowest variance, let \(l=\infty\), we recover \(A^{\pi}_{\text{MC}}\), which is unbiased by has the highest variance.
\(A^{\pi}_{GAE}\) is defined as a exponentially weighted sum of \(A^{\pi}_{l}\)’s:
\[\begin{equation}\label{form1} A^{\pi}_{GAE} = \sum_{l=1}^{\infty}w_l A^{\pi}_l \end{equation}\]where \(w_l \propto \lambda^{l-1}\)
It can be shown that \(A^{\pi}_{GAE}\) can be equivalently writen as
\[\begin{equation} A^{\pi}_{GAE} = \sum_{t'=t}^\infty (\gamma \lambda)^{t'-t} \delta_{t'} \end{equation}\]Where
\[\delta_{t'} = r_{t'} + \gamma V^{\pi}_{\phi}(s_{t'+1}) - V^{\pi}_{\phi}(s_{t'})\]Policy gradient methods are one of the most straightforward methods of RL because they directly optimizes the goal \(J(\theta)\) of RL by gradient descent (well, it’s actually gradient ascent, because we are maximizing the objective, but this doesn’t make a difference because we can also add a minus sign in the code).
\[\begin{equation}\label{first}J(\theta) = \mathbb{E}_{p_{\theta}(\tau)}\sum_{t=1}^T r(s_t, a_t)\end{equation}\]Where \(\tau = (s_1, a_1, \cdots, s_T, a_t)\) and \(p_{\theta}(\tau) = p(s_1)\prod_{t=1}^{T}p(s_{t+1}\mid s_t, a_t)\pi_{\theta}(a_t\mid s_t)\)
Now let’s derive the REINFORCE algorithm, which is the most basic policy gradient method. We simply take the derivative of \(J(\theta)\):
\[\begin{align} \nabla_{\theta}J(\theta) &= \nabla_{\theta}\mathbb{E}_{p_{\theta}(\tau)}\sum_{t=1}^T r(s_t, a_t) \nonumber \\ &= \nabla_{\theta}\int_{\tau} p_{\theta}(\tau)\left(\sum_{t=1}^T r(s_t, a_t)\right) \text{d}\tau \nonumber \\ &= \int_{\tau} \nabla_{\theta}p_{\theta}(\tau)\left(\sum_{t=1}^T r(s_t, a_t)\right) \text{d}\tau \label{equality1} \\ &= \int_{\tau} p_{\theta}(\tau)\nabla_{\theta}\log p_{\theta}(\tau)\left(\sum_{t=1}^T r(s_t, a_t)\right) \text{d}\tau \label{equality2} \\ &= \mathbb{E}_{p_{\theta}(\tau)}\nabla_{\theta}\log p_{\theta}(\tau)\left(\sum_{t=1}^T r(s_t, a_t)\right) \label{inter} \\ \end{align}\]To get from equation \(\ref{equality1}\) to equation \({\ref{equality2}}\), we used the equality \(\nabla_{\theta}\log p_{\theta}(\tau) = \frac{\nabla_{\theta}p_{\theta}(\tau)}{p_{\theta}(\tau)}\)
The result is ideal in that the derivative can still be writen as an expectation, which means we can obtain the Monte Carlo estimate of it using samples. However, the term \(p_{\theta}(\tau)\) is not known as we don’t know the model \(p(s_{t+1}\mid s_t, a_t)\). But this is actually not a problem, if we note
\[\begin{align} \nabla_{\theta}\log p_{\theta}(\tau) &= \nabla_{\theta}\log p(s_1)\prod_{t=1}^{T}p(s_{t+1}\mid s_t, a_t)\pi_{\theta}(a_t\mid s_t) \label{orig_1} \\ &= \enclose{downdiagonalstrike}{\nabla_{\theta}\left[\log \left(p(s_1)\prod_{t=1}^{T} p(s_{t+1}\mid s_t, a_t) \right)\right]} + \nabla_{\theta}\sum_{t=1}^{T}\log \pi_{\theta}(a_t\mid s_t) \label{orig_2} \\ &= \sum_{t=1}^{T}\nabla_{\theta}\log \pi_{\theta}(a_t\mid s_t) \label{orig_3} \end{align}\]plug this term into the original result equation \(\ref{inter}\), we have
\[\begin{align*} \label{gradient} \nabla_{\theta}J(\theta) = \mathbb{E}_{p_{\theta}(\tau)} \left(\sum_{t=1}^{T}\nabla_{\theta}\log \pi_{\theta}(a_t\mid s_t)\right)\left(\sum_{t=1}^T r(s_t, a_t)\right) \end{align*}\]In practice, after running the agent in environment or simulator, we have trajectories \(\{ \tau^i \}_{i=1}^{N}\) where \(\tau^i = (s^i_1, a^i_1, \cdots, s^i_T, a^i_T)\) and rewards \(\{ r^i_t \}_{i=1, t=1}^{N,T}\), then we can obtain Monte Carlo estimate of the gradient:
\[\begin{align} \label{gradient_estimate} \nabla_{\theta}J(\theta) = \frac1N \sum_{i=1}^{N}\left(\sum_{t=1}^{T}\nabla_{\theta}\log \pi_{\theta}(a^i_t\mid s^i_t)\right)\left(\sum_{t=1}^T s^i_t\right) \end{align}\]For a more rigorous notation, the left hand side of equation \(\ref{gradient_estimate}\) should be \(\widehat{\nabla_{\theta}J(\theta)}\), because it’s an estimator of the original quantity. However, since the objectives and gradients are always estimated using samples in Deep RL algorithms, we slightly abuse the notation i.e. do not use the ‘hat’ when the quantity is an estimator.
We can also get Monte Carlo estimate of the expected reward of the current policy using rewards \(\{ r^i_t \}_{i=1, t=1}^{N,T}\):
\[\begin{align} \label{obj_estimate} J(\theta) = \frac1N \sum_{i=1}^N \sum_{t=1}^T r^i_t \end{align}\]This means we can evaluate policy easily when using policy gradient methods.
The following is the REINFORCE algorithm:
sdfsdfsdfsd
Finally I want to point out that the REINFORCE algorithm also works in POMDP, where we don’t know state \(s_t\), but can only observe observation \(o_t\), which means the policy is \(\pi_{\theta}(a_t\mid o_t)\). This is clear if we write out the trajectory distribution in POMDP:
\[\begin{equation}\label{pomdp}p_{\theta}(\tau) = p(s_1)\prod_{i=1}^Tp(s_{t+1}\mid s_t, a_t)\pi_{\theta}(a_t\mid o_t)p(o_t\mid s_t)\end{equation}\]where \(\tau = (s_1, o_1, a_1, s_2, o_2, a_2,\cdots, s_T, o_T, a_t)\). Note that we don’t attempt to learn the emission probability distribution \(p(o_t\mid s_t)\). If we plug in equation \(\ref{pomdp}\) in to the derivation of original policy gradient equation \(\ref{orig_1}\), all the terms except for the one contains \(\theta\) will be zero when we take the derivative at equation \(\ref{orig_2}\). So the policy gradient for POMDP is
\(\begin{equation} \nabla_{\theta}J(\theta) = \frac1N\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log \pi_{\theta}(a_t^i\mid o_t^i)s^i_t\end{equation}\).
The REINFORCE algorithm is very straigtforward and match our intuition about common deep learning paradigm (i.e. using gradient descent to optimize our goal). But it has two major issues: first, the training is not very stable as the gradient estimator as large variance; second, the algorithm is not very efficient as new trajectories need to be collected after each gradient update. We will next introduce ways to get around with these two issues. At the end of this lecture, we will introduce how policy gradient method can be implemented in common automatic differentiation packages like TensorFlow and PyTorch.
We change the objective using a very simple common sense on causality — future states and actions cannot affect past rewards. Let’s see how causality can help us to derive a new objective and gradient with lower variance.
We first write down the original objective:
\[J(\theta) = \mathbb{E}_{p_{\theta}(\tau)}\sum_{t=1}^T r(s_t, a_t)\]Switch summation and expectation: \(\begin{equation}\label{orig_obj}J(\theta) = \sum_{t=1}^{T}\mathbb{E}_{p_{\theta}(\tau)}r(s_t, a_t)\end{equation}\)
Denote \(\tau_t = (s_1, a_1, \cdots, s_t, a_t)\), since future states and actions cannot affect past rewards, therefore optimizing future actions cannot change improve past rewards. So, let’s take out the future states and actions from the objective:
\[\begin{equation}\label{inter1}J(\theta) = \sum_{t=1}^{T}\mathbb{E}_{p_{\theta}(\tau_t)}r(s_t, a_t)\end{equation}\]Go through very similar steps as before to differentiate \(J(\theta)\) we have
\[\begin{equation}\label{diff_J} \nabla_{\theta}J(\theta) = \sum_{t=1}^{T}\mathbb{E}_{p_{\theta}(\tau_t)}\nabla_{\theta}\log p_{\theta}(\tau_t) r(s_t, a_t)\end{equation}\]Still very similar as previous steps: \(\begin{align} \nabla_{\theta}\log p_{\theta}(\tau_t) &= \nabla_{\theta}\log p(s_1)\pi_{\theta}(a_1\mid s_1)\prod_{t'=2}^{t}p(s_{t'}\mid s_{t'-1}, a_{t'-1})\pi_{\theta}(a_{t'}\mid s_{t'}) \nonumber \\ &= \enclose{downdiagonalstrike}{\nabla_{\theta}\left[\log \left(p(s_1)\prod_{t'=2}^{t} p(s_{t'}\mid s_{t'-1}, a_{t'-1}) \right)\right]} + \nabla_{\theta}\sum_{t'=1}^{t}\log \pi_{\theta}(a_{t'}\mid s_{t'}) \nonumber\\ &= \sum_{t'=1}^{t}\nabla_{\theta}\log \pi_{\theta}(a_{t'}\mid s_{t'}) \label{cancel} \end{align}\)
Plug equation \(\ref{cancel}\) into \(\ref{diff_J}\), we have
\[\begin{align} \nabla_{\theta}J(\theta) = \sum_{t=1}^{T}\mathbb{E}_{p_{\theta}(\tau_t)}\sum_{t'=1}^{t}\nabla_{\theta}\log \pi_{\theta}(a_{t'}\mid s_{t'}) r(s_t, a_t)\end{align}\]and this can be approximated by sample trajectories: \(\begin{align} \nabla_{\theta}J(\theta) &= \sum_{t=1}^{T} \frac1N \sum_{i=1}^{N}\sum_{t'=1}^{t}\nabla_{\theta}\log \pi_{\theta}(a_{t'}^i\mid s_{t'}^i) r_t^i \label{1}\\ &= \frac1N \sum_{i=1}^{N}\sum_{t=1}^{T}\sum_{t'=1}^{t}\nabla_{\theta}\log \pi_{\theta}(a_{t'}^i\mid s_{t'}^i) r_t^i \label{2}\\ &= \frac1N \sum_{i=1}^{N}\sum_{t'=1}^{T}\sum_{t=t'}^{T}\nabla_{\theta}\log \pi_{\theta}(a_{t'}^i\mid s_{t'}^i) r_t^i \label{3}\\ &= \frac1N \sum_{i=1}^{N}\sum_{t=1}^{T}\sum_{t'=t}^{T}\nabla_{\theta}\log \pi_{\theta}(a_{t}^i\mid s_{t}^i) r_{t'}^i \label{4}\\ &= \frac1N \sum_{i=1}^{N}\sum_{t=1}^{T}\nabla_{\theta}\log \pi_{\theta}(a_{t}^i\mid s_{t}^i) \sum_{t'=t}^{T}r_{t'}^i \label{5}\\ \end{align}\)
The above equations seems scary, but they are exchange of summation order (equation \(\ref{1}\), \(\ref{2}\), \(\ref{3}\)) and change of notations (equation \(\ref{4}\)). Now let’s recall the gradient that doesn’t incorporate causality:
\[\begin{equation}\label{orig_gradient}\frac1N \sum_{i=1}^{N}\sum_{t=1}^{T}\nabla_{\theta}\log \pi_{\theta}(a_{t}^i\mid s_{t}^i) \sum_{t'=1}^{T}r_{t'}^i\end{equation}\]Note that gradient that incorporates causality (equation \(\ref{5}\)) are really the same as the gradient that doesn’t incorporate causality (equation \(\ref{orig_gradient}\)), but with past rewards in the reward summation subtracted. This gives less terms in the summation, and therefore gives smaller variance.
We can also reduce the variance of by introducing a baseline.
To simplify the notation, let’s use \(r(\tau)\) to denote \(\sum_{t=1}^{T}r(s_t, a_t)\). We want to use Monte Carlo estimate to approximate the gradient
\[\begin{equation}\label{orig_grad}\nabla_{\theta}J(\theta) = \mathbb{E}_{p_{\theta}(\tau)}\nabla_{\theta}\log p_{\theta}(\tau)r(\tau)\end{equation}\]To introduce baseline, we simply change the gradient to be
\[\begin{equation}\label{base_grad}\nabla_{\theta}J(\theta) = \mathbb{E}_{p_{\theta}(\tau)}\nabla_{\theta}\log p_{\theta}(\tau)\left(r(\tau)-b\right)\end{equation}\]Where \(b\) is the baseline, which is not a function of \(\tau\).
You might ask: 1. Is the gradient with the basedline added the same as the original gradient? 2. will the baseline lead to a lower variance and if so, what \(b\) gives lowest variance?
The answer to the first question is yes. To see that, we just need a bit of calculation:
\[\begin{align*} \mathbb{E}_{p_{\theta}(\tau)}\nabla_{\theta}\log p_{\theta}(\tau)b &= b\mathbb{E}_{p_{\theta}(\tau)}\nabla_{\theta}\log p_{\theta}(\tau) \\ &= b\int_{\tau}p_{\theta}(\tau)\nabla_{\theta}\log p_{\theta}(\tau)\text{d}\tau \\ &= b\int_{\tau}p_{\theta}(\tau)\frac{\nabla_{\theta}p_{\theta}(\tau)}{p_{\theta}(\tau)} \text{d}\tau\\ &= b\nabla_{\theta}\int_{\tau}p_{\theta}(\tau) \\ &= 0 \end{align*}\]To answer the second question, we need to do a bit more calculation. Let’s first calculate the variance!
With the variance formula \(\mathbb{V}(x) = \mathbb{E}x^2 + (\mathbb{E}x)^2\), we have
\[\begin{equation} \mathbb{V}\left[\nabla_{\theta}\log p_{\theta}(\tau)(r(\tau)-b)\right] = \mathbb{E}(\nabla_{\theta}\log p_{\theta}(\tau)(r(\tau)-b))^2 + (\mathbb{E}\nabla_{\theta}\log p_{\theta}(\tau)(r(\tau)-b))^2 \end{equation}\]Let’s minimize this term, note that it’s a quadratic function of \(b\) and therefore to minimize it w.r.t \(b\), we just need to find the stationary point. Also note that the second term is just \((\mathbb{E}\nabla_{\theta}\log p_{\theta}(\tau)r(\tau))^2\), so when we take the derivative w.r.t \(b\) this term will be \(0\). The derivative is:
\[\begin{align*} \nabla_{b}\mathbb{V}\left[\nabla_{\theta}\log p_{\theta}(\tau)(r(\tau)-b)\right] &= \nabla_{\theta}\mathbb{E}\nabla_{\theta}(\log p_{\theta}(\tau))^2(r(\tau)^2-2br(\tau) + b^2) \\ &= -2\mathbb{E}\nabla_{\theta}(\log p_{\theta}(\tau))^2r(\tau) + 2b\mathbb{E}\nabla_{\theta}(\log p_{\theta}(\tau))^2 \end{align*}\]set it to be \(0\), we have
\[\begin{equation} b^* = \frac{\mathbb{E}\nabla_{\theta}(\log p_{\theta}(\tau))^2r(\tau)}{\mathbb{E}\nabla_{\theta}(\log p_{\theta}(\tau))^2} \end{equation}\]\(b^*\) can be interpreted as the expected reward weighted by gradient. When set \(b\) to be \(0\), we recover the original gradient. In RL, people usually just use unweighted expected reward, i.e. \(b=\mathbb{E}r(\tau)\). So during training, the gradient with baseline is
\[\begin{equation} \nabla_{\theta}j(\theta) = \frac1N \sum_{i=1}^N\nabla_{\theta}p_{\theta}(\tau^i) (r(\tau^i) - b) \end{equation}\]where \(b = \frac1N \sum_{i=1}^Nr(\tau^i)\). Note that we actually don’t know if this baseline can reduce variance as it is probably not \(b^*\), but in practice people find it can stablize the training.
The use causality and baseline are the two most common techniques to reduce the variance policy gradient. However, there are other techniques, in the next lecture, we will introduce actor-critic algorithm, which can be view as a low variance variant of policy gradient method.
In this section, we study how to make policy gradient more efficient, and the idea is to make the algorithm from on-policy to off-policy.
We will start from the original policy gradient and use the idea of important sampling to make the expectation to be no longer over the current policy. Sepcifically, we want the expectation to be over some old policy \(\pi_{\theta}\) i.e. \(\pi_{\theta}\) is many gradient steps before the current policy \(\pi_{\theta'}\).
\[\begin{align} \nabla_{\theta'}J(\theta') &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}\sum_{t=1}^T\nabla_{\theta'}\log \pi_{\theta'}(a_t\mid s_t) r(s_t, a_t) \label{start}\\ &= \sum_{t=1}^{T}\mathbb{E}_{s_t, a_t \sim p_{\theta'}(s_t, a_t)}\nabla_{\theta'}\log \pi_{\theta'}(a_t\mid s_t) r(s_t, a_t) \label{switch} \\ &= \sum_{t=1}^{T}\int_{s_t, a_t}p_{\theta'}(s_t, a_t)\nabla_{\theta'}\log \pi_{\theta'}(a_t\mid s_t) r(s_t, a_t) \text{d}(s_t, a_t) \label{integral}\\ &= \sum_{t=1}^{T}\int_{s_t, a_t}p_{\theta'}(s_t, a_t)\frac{p_{\theta}(s_t, a_t)}{p_{\theta}(s_t, a_t)}\nabla_{\theta'}\log \pi_{\theta'}(a_t\mid s_t) r(s_t, a_t) \text{d}(s_t, a_t) \\ &= \sum_{t=1}^{T}\mathbb{E}_{s_t, a_t \sim p_{\theta}(s_t, a_t)}\frac{p_{\theta'}(s_t, a_t)}{p_{\theta}(s_t, a_t)}\nabla_{\theta'}\log \pi_{\theta'}(a_t\mid s_t) r(s_t, a_t) \label{importance_sampling} \\ &= \sum_{t=1}^{T}\mathbb{E}_{s_t, a_t \sim p_{\theta'}(s_t, a_t)}\frac{p_{\theta'}(s_t)\pi_{\theta'}(a_t\mid s_t)}{p_{\theta}(s_t)\pi_{\theta}(a_t\mid s_t)}\nabla_{\theta'}\log \pi_{\theta'}(a_t\mid s_t) r(s_t, a_t) \label{marginal}\\ &\approx \sum_{t=1}^{T}\mathbb{E}_{s_t, a_t \sim p_{\theta}(s_t, a_t)}\enclose{downdiagonalstrike}{\frac{p_{\theta'}(s_t)}{p_{\theta}(s_t)}}\frac{\pi_{\theta'}(a_t\mid s_t)}{\pi_{\theta}(a_t\mid s_t)}\nabla_{\theta'}\log \pi_{\theta'}(a_t\mid s_t) r(s_t, a_t) \label{cross_out}\\ &= \sum_{t=1}^{T}\mathbb{E}_{s_t, a_t \sim p_{\theta}(s_t, a_t)}\frac{\pi_{\theta'}(a_t\mid s_t)}{\pi_{\theta}(a_t\mid s_t)}\nabla_{\theta'}\log \pi_{\theta'}(a_t\mid s_t) r(s_t, a_t) \label{res}\\ \end{align}\]From equation \(\ref{start} \text{ to } \ref{switch}\), we switch the summation and expectation sign and also set the expectation to be over marginal distribution of \((s_t, a_t)\) rather than the whole trajectory. From \(\ref{switch}\) to \(\ref{importance_sampling}\), we use importance sampling to change the underlying distribution of the expectation, where \(\theta\) is the parameter of some old policy (e.g. the policy that is many gradient steps before the current policy). This step makes the gradient off-policy, as samples are from the old policy rather than the current policy. Note that the distribution \(p_{\theta}(s_t, a_t)\) or \(p_{\theta'}(s_t, a_t)\) is unknown if we don’t know the transition model, therefore we write them as a product of state marginal \(p_{\theta'}(s_t)\) and policy \(\pi_{\theta'}(a_t\mid s_t)\) in \(\ref{marginal}\), and then cross out the state marginal in \(\ref{cross_out}\). Just crossing out the term will lead to an systematic estimation error on the gradient estimation, but we will see in later lecture when we introduce advanced policy gradient methods that the error is bounded when the gap between \(\pi_{\theta}\) and \(\pi_{\theta'}\) are not too big.
Finally we have equation \(\ref{res}\) which is off-policy policy gradient. An intuitive explanation of this is that the off-policy policy gradient is on-policy policy gradient but each term is weighted by \(\frac{\pi_{\theta'}(a_t\mid s_t)}{\pi_{\theta}(a_t\mid s_t)}\).
The Monte Carlo estimate of the off-policy policy gradient is
\[\begin{equation} \nabla_{\theta'}J(\theta') = \frac1N \sum_{i=1}^{N}\sum_{t=1}^{T}\frac{\pi_{\theta'}(a^i_t\mid s^i_t)}{\pi_{\theta}(a^i_t\mid s^i_t)}\nabla_{\theta'}\log \pi_{\theta'}(a^i_t\mid s^i_t) r^i_t \end{equation}\]Where the trajectories are sampled from old policy \(p_{\theta}(\tau)\).
We can also use the log derivative trick to write the off-policy policy gradient as
\[\begin{equation}\label{off_use} \nabla_{\theta'}J(\theta') = \frac1N \sum_{i=1}^{N}\sum_{t=1}^{T}\frac{\nabla_{\theta'}\pi_{\theta'}(a^i_t\mid s^i_t)}{\pi_{\theta}(a^i_t\mid s^i_t)}r^i_t \end{equation}\]This will be useful when we implement the algorithm (see next section).
Policy gradient using gradient ascent to optimize the objective
\[J(\theta) = \frac1N \sum_{i=1}^{N}\sum_{t=1}^Tr^i_t\]Since we parameterize policy \(\pi_{\theta}\) using a neural netowrk, we will use an automatic differentiation package like TensorFlow or PyTorch to calculate the gradient and update weights. However, autodiff package will be default not get the policy gradient that we derived but try to backprop through samples, which will not work as we don’t know the trajectory distribution and the reward function. How do we specify the gradient we want autodiff package to use when it updates weights?
One solution is to derive the gradient manually and explicitly write it in our code. But this can be very tedious and essentially bring us to pre-autodiff era.
Luckily, we use a pseudo-objective to trick autodiff package to derive the policy gradient during backprop and use it to update weights. Recall that the policy gradient is
\[\begin{equation*} \nabla_{\theta}J(\theta) = \frac1N \sum_{i=1}^{N}\sum_{t=1}^{T}\nabla_{\theta}\log \pi_{\theta}(a^i_t\mid s^i_t) r^i_t \end{equation*}\]The objective we provide to autodiff packages is
\[\begin{equation*} J(\theta) = \frac1N \sum_{i=1}^{N}\sum_{t=1}^{T}\log \pi_{\theta}(a^i_t\mid s^i_t) r^i_t \end{equation*}\]The the derivative of it is exactly the policy gradient.
For off-policy policy gradient, similarly, to get gradient \(\ref{off_use}\), the pseudo-objective is we provide in the code is
\[\begin{equation*} \nabla_{\theta}J(\theta) = \frac1N \sum_{i=1}^{N}\sum_{t=1}^{T}\frac{\pi_{\theta}(a^i_t\mid s^i_t)}{\pi_{\theta'}(a^i_t\mid s^i_t)}r^i_t \end{equation*}\]
Lastly, here are some recommended papers on policy gradient methods