Deep RL 10 Model-based Reinforcement Learning
Previous lecture is mainly about how to plan actions to take when the dynamics is known. In this lecture, we study how to learn the dynamics. We will also introduce how to incorporate planning in the model learning process and therefore form a complete decision making algorithm.
Again, most of the algorithms will be introduced in the context of deterministic dynamics, i.e.
1 Basic Model-based RLPermalink
How to learn a model, the most direct way is supervised learning. Similar to the idea used before, we run a random policy to get transitions, and then fit a neural net to the transition:
- run base policy
(e.g. random policy) to collect - learn dynamics model
to minimize - plan through
to choose actions.
Where in step 3, we can use CEM, MCTS, LQR etc.
Does this work? Well, in some cases. For example, if we have a full physics model of the dynamics and only need to fit a few parameters, this method can work. But still, some care should be taken to design a good base policy.
In general, however, this doesn’t workm and the reason is very similar to the one we encountered in imitation learning — distribution shift.
The data we used to learn the dynamics comes from the trajectory distribution induced by random policy

How to deal with it? Same as how DAgger deals with distribution shift in imitation learning, we just need to make sure that the training data comes from the current dynamics (current policy). These lead to the first practical model-based RL algorithm:
- run base policy
(e.g. random policy) to collect - learn dynamics model
to minimize - plan through
to choose actions. - execute those actions and add the resulting transitions
to . Go to step 2.
However, even though the data is updating based on the learned dynamics, as long as we are replanning, it will always induce a new trajectory distribution which will be a little different from the previous distribution. In another word, the distribution shift will always exist. Therefore, as we plan through
We can improve this algorithm by only execute the first planned action, and observe the next state that this action leads to, and then replan start from that state. and then take the first action etc. In a word, at each step, we only take the first planned action and observe the state and then replan from there. Because at each time step, we always take the action based on the actual state, this is more reliable than executing the whole plan actions all in one go. The algorithm is
- run base policy
(e.g. random policy) to collect - learn dynamics model
to minimize - plan through
to choose actions. - execute the first planned action and add the resulting transition
to . If reach the predefined maximal number of planning steps, go to step 2; else, Go to step 3.
This algorithm is call Model Predictive Control or MPC. Replanning at each time step can drastically increase the computation load, so people sometimes choose to shorten the time horizon of the trajectory. While this might lead to a decrease in the quality of actions, since we are constantly replanning, we can take the cost that individual plans is less perfect.
2 Uncertainty-Aware Model-based RLPermalink
Since we plan actions replying on the fitted dynamics, whether or not the dynamics is a good representation of the world is crucial. When we use high capacity model like neural networks, we usually need to feed it with a lot of data in order to get a good fit. But in model-based RL, we usually don’t have a lot of data at the beginning, in fact, we can only have some bad data (generated by running some random policy), and then if we use neural network to fit the dynamics, it will overfit the data, and not have a good representation of the good part of the world. This will lead the algorithm to take bad actions, which can lead to bad states, which can then lead to neural net dynamics trained only on trajectories and thus it’s predictions on good states in unreliable, which again lead to algorithm to take bad actions…… This seems to be a chicken-and-egg problem, but if you think about it, the origin is that planning on unconfident state prediction can lead to bad actions.
The solution is to quantify uncertainty of the model, and take into consideration this uncertain in planning.
First of all, it’s important to know that uncertainty of a model is not the same thing as the probability of the model’s prediction on some state. Uncertainty is not about the setting where dynamics is noisy, but about the setting where we don’t know what the dynamics are.
The way to avoid taking risky actions on uncertain state is to plan based on expected expected reward. Wait, what is it? Yes, this is not a typo, the first expected is with respect to the model uncertainty, and the second expected is with respect to trajectory distribution. Mathematically, the objective is
Having an uncertainty-aware formulation, the next steps are:
- how to get
- how to actually plan actions to optimize this objective
2.1 Uncertainty-Aware Neural NetworksPermalink
In this subsection we discuss how to get
The first approach is Bayesian Neural Networks, or BNN. To consider the problem from a Bayesian perspective, we can first rethink our original approach, i.e. what is it that we are estimating when doing supervised training in step 2 in MPC? (Here we write it slightly differently for illustration)
learn dynamics model
to minimize
The
Adopting the Bayesian approach, we want to estimate the posterior distribution
However, this calculation is usually intractable. In neural network setting, people usually resort to variance inference, which approaximates the intractable true
We define variational posterior to be fully factorized Gaussian:
Where
The second approach, which is conceptually simpler and usually works better than BNN, is boostrap ensembles. The idea is to train many independent neural dynamics models, and average them. Mathematically, we learn independent neural network parameters
Where
But how do we get the
In practice, people find that for neural dynamics, it is not necessary to resample the data. What people do is just train neural nets with same dataset but set different random seed. The use of SGD will make each neural net sufficiently independent.
2.2 Plan with UncertaintyPermalink
Having uncertainty-aware dynamics i.e. a distribution over dynamics. It’s very natural to derive an uncertainty-aware MPC algorithm. Recall that in the MPC algorithm, we plan using the objective
Now the objective has changed to
With this, we can write out the uncertainty-aware MPC algorithm:
- run base policy
(e.g. random policy) to collect - estimate the posterior distirbution of dynamics parameters
- sample
dynamics from - plan through the ensemble dynamics to choose actions.
- execute the first planned action and add the resulting transition
to . If reach the predefined maximal number of planning steps, go to step 2; else, Go to step 3.
You might notice that this algorithm seems do not use the objective i.e. equation
3 Model-Based RL with ImagesPermalink
Previously we’ve been assuming that state is obserable, because we’ve been using transitions
- High dimensionality. We are fitting
, if is image, then the dimension is , which can be very large in many cases and thus accurate prediction is very difficult. - Redundancy. Many parts of the images can stay unchanged during the whole process, this leads a redundancy in the data.
- Partial observability. There are things that static images can not directly represent, such as speed and acceleration, although you might derive this from the image, but that requires extra potentially nontrivial effort and might not be accurate.
We will now introduce the state-space model that models POMDPs, which treats states as latent variables and model observation using distributions conditioned on states.
Let’s recall how dynamics is learned when we assume states are observable. We parameterize the dynamics using a neural net with parameter
Note that we slightly abuse the notation for clarity, for example
And solve for
Now consider state unobservable, We have:
Where
We maximize equation
One issue is that by Bayes’ rule,
and
Plug this in the objective equation
We maximize this to find
Lastly, if we want to plan using iLQR or plan better, we usually also want to model the cost function, it can be modeled as a deterministic function like
Lastly, I want to point out that sometimes it’s difficult to build a compact state space for the observations, and directly modeling observations and making prediction on future observations can actually work better. I.e. instead of modeling