This brings us to those two huge problems of algorithms,
the on-policy and off-policy ones.
You might want to remember those terms.
So, let's kind of recap on how they
work because the main tuition is what we just covered.
The on-policy aggression is like SARSA,
they assume that their experience comes from the agents himself,
and they try to improve agents policy right down this online stop.
So, the agent plays and then it improves and plays again.
And what you want to do is you want to get optimal Strategies as quickly as possible.
Off-policy algorithms like Q-learning,
as an example, they have slightly relax situation.
They don't assume that the sessions you're obtained, you're training on,
are the ones that you're going to use when the agent is going to
finally get kind of unchained and applied to the actual problem.
For example, you may train your agents to,
well, to find a policy for a bipedal robot to walk forward.
And this case, for example,
you might use a 3D simulation in which you
train your Q function with any exploration you want.
So, you can use this cheap virtual reality simulation which you train your robots,
and in which you are not suffering any cost from your robot falling down.
So, you can set a latch epsilon,
you can train for a long time without any regret,
except for maybe CPU power.
And then you want your agent to train not how to find,
not how to behave optimally with
this exploration latch Epson but how to find an optimal policy one.
It's always asked to be an optimal action.
So, basically, it first trains for
this Epsilon based or maybe Boltzmann based exploration.
And then the exploration goes away and it gets to pick optimal actions all the time.
Another possible solution with off-policy algorithms in which Q-learning is
also kind of a good way to improve
your enforce learning agent is where you are
training your agent on a policy which is different from his current policy.
For example, you're trying
to teach your self-driving car to drive in a particular situation,
and you pre-train it not on
its own actions but on actions of a driver of an actual human pilot.
In this case, you're probably saving yourself for a lot of money and maybe some miles,
something to use in fact.
The issue here is that at the beginning,
your agent is not that close optimal policies
that you're ready to trust him control of an actual car.
So you have actions that are not always optimal
because humans are not having the perfect reaction time,
but they are sometimes optional and you want your agent
to change how to behave optimally from a human,
kind of in theory or input.
This can also be easily traced from the chief worlds but
this is actually one where you can
just improve through the premise of your algorithms later on.
So, again have a Q-learning SARSA and expect value SARSA.
So we just covered how Q-learning and
SARSA relate to those on-policy, off-policy duologies.
Now, the only thing your remaining is how expected value SARSA works.
The question here is, can you train expected value SARSA in an off-policy scenario.
Can you train on actions that are not,
if not the actions taken under its current policy?
Maybe. Can you adjust it for, maybe,
training on human data,
pre-training on before training on its own sessions.
Well, right. You can.
The issue with expected value SARSA that there is expectation,
and you are free to set the per viewed distribution for the expectation anyway you want.
If you set a, well,
an expectation over what resembles the human per views of taking actions,
if you train up with human policy and if you use its probabilities to pick an action,
I will probably get an on-policy algorithm.
If you take Epsilon grading policy and set
an epsilon to something small that
you're actually going to use after you play the algorithm,
or even zero, you'll get something that's a lot like Q-learning,
but it accounts for a different policy,
one that picks optimal action that said probability.
And basically it's the most universal version.
You set an expectation,
if you set up a beautiful optimal action of one,
you'll get expected value SARSA exactly equal to Q-learning, and therefore, off-policy.
Okay, so, there's also this neat question of whether the Crossentropy method,
the first reinforcement training methods we ever studied relates to honor of policy.
And this is slightly controversial but I want you to take a try at this one.
Well again, it's kind of strange but the issue here is that
the present method technically requires you to sample sessions from your current policy.
You can of course modify it in some way to allow it for some different strategies.
But if you change it on a policy which is clearly sub-optimal,
if you always pick samples from that policy,
then there's no way you're going to improve the,
well, the selected elite sessions based on that.
So technically it's on policy only.
Now let's see how we're going to exploit this issue of on-policy,
off-policy algorithms to get some benefits for our practical problems.
It's a very famous sticking in the first Q-learning,
in fact you might have heard about it.
The name is experience replay,
and most people associated with new latch for
based method but it's kind of generic. It can be applied anywhere.
The idea here is that you can train your agent not
just on immediate state actual rewards next state players.
You can actually record its previous interaction,
and change intuitively on sessions that are samples from this large pool.
So you're playing a game, say again you're trying
to make that can be strong but to go forward without falling,
and instead of making one update on every step,
what you do is you record saying 10,000 previous steps,
and you sample a hundred random transitions there from this pool,
this huge cylinder here.
And you make usual Q value update,
the QSA equals alpha times new Q value plus one month of the times old Q value,
given the states actions,
your worse next state samples from the pool.
Of course those samples if you record your 10,000 iterations,
are going to be probably worse than your current policy.
So, if you just learn to walk upright,
then 10,000 iterations before,
you probably haven't learned that yet.
And you won't get samples where you're walking upright.
But other ways, this allows you to cheat your way
into a hundred times more frequent updates.
So, making hundreds, kind of,
hundreds of virtual updates per one review.
This gives you a lot of profit where sampling
were getting you FA Eros Prime that is actually very expensive.
For example, if you're using actual physical car or robot to get
FA Eros Prime by actually moving it into a physical environment.
This is actually a part of your bonus assignments,
so you'll have a more detailed description of it later.
Finally, this idea of experience replay is going to be very
popular along the neural network based deep reinforcement learning methods.
We'll study those matters in the next week. Until then.