All right. Let's start with where we stopped in
the last course and have a quick recap about market decision processes,
Bellman equation and their relation to reinforcement learning.
After we go over these topics to refresh our memories in this lesson, in the next lesson,
we will spend some time converting one of the most famous classical financial problem
into a Markov Decision Process problem that we
will use to test different reinforcement learning algorithms.
So, to recap, reinforcement learning deals with
an agent that interacts with the environment in the setting of
a sequential decision making by choosing
optimal actions among many possible actions at each step of such process.
In our first course, we referred to such tasks of machinery learning as action tasks.
The agent perceives the environment by having
information about the state St of the environment.
The environment may have some complex dynamics therefore
reinforcement learning tasks should involve some planning and forecasting to the future.
More to these, actions AT that the agents should pick
at each step to optimize its longer term goals,
may themselves impact the state of the environment.
And this creates a feedback loop in which
the current agent's action AT may change the next state of
the system which in history may have an impact on
what action the agent will need to pick at the next time step.
The presence of such feedback loop is unique to reinforcement learning.
No feedback loops ever appear in supervised or unsupervised learning.
And this is because there is no question of optimization of actions in these settings,
as action in these tasks is always the same.
For example, in unsupervised learning,
our task may be to cluster data.
And clearly, in this case,
the data does not care about how we or an agent looks at it.
So, there is no feedback loop.
We also talked about two possible settings for reinforcement learning.
Online reinforcement learning proceeds in real time.
In this setting, an agent directly interacts with
the environment and chooses it's actions every time step,
once it gets information about the new state of the environment.
A vacuum cleaning robot would be a good example of online reinforcement learning.
Another possible setting is called batch mode or off-line reinforcement learning.
In this case, the agent does not have an on-demand access to the environment.
Instead, it only has access to some data that stores
a history of interaction of some other agent or a human with this environment.
This data should contain records of states of the environment,
actions taken, and the rewards received for each time step in the history.
Now, in both comes to various types of environments,
we talked about two possible approaches.
If the environment is completely observable,
its dynamics can be modeled as a Markov Process.
Markov processes are characterized by a short memory.
The future in these models depends not on the whole history,
but only on the current state.
The second possibility is a partially absorbed environment
where some variables that are important for the dynamics are not observable.
As we discussed in the last course such situations
can be modeled using dynamic latent variable models.
For example, hidden Markov models.
In this course, we will be primarily concerned with fully observable systems.
So, we will stick to Markov processes for a while.
Now, as we outlined in the last course,
proper mathematical formalism that incorporates agents actions into
some Markov Dynamics for the environment is called
Markov Decision Processes or MDPs for short.
Let's go over this framework once again.
Here, you see a diagram describing a Markov Decision Process.
The blue circles show the evolving state of the system ST, discrete time steps.
These states are connected by arrows,
they represent causality relations.
We have only one arrow that enters each blue circle
from a previous blue circle which emphasizes a Markov property of the dynamics,
which means that each next state depends only on
the previous state but not on the whole history of previous states.
The green circles denote actions AT taken by the agent.
The upward pointing arrows denote rewards
RT received by the agent upon taking actions AT.
Now, in mathematical terms,
Markov Decision Process is characterized by the following elements.
First, we have space of states S,
so that each observed state ST belongs to the space.
The space S can be discrete or continuous.
Second, there are actions AT that belong to a space of actions called A.
Next MDP needs to know transition probabilities speed that define probabilities of
next states S sub T plus one given
a previous state ST and an action AT taken in this state.
Further, we need a reward function R that gives
a reward received in a given state upon taken a given action.
So, it maps cross product of spaces S and A onto a real-world number.
And finally, MDP needs to specify a discount factor gamma,
which is a number between zero and one.
We need the discount factor gamma,
to compute the total cumulative reward given by the sum of whole single step rewards,
where each next term gets an extra power of gamma in the sum.
This means that the discount factor for an MDP plays a similar role to
a discount factor in finance as it reflects the time value of rewards.
This means that getting a larger reward now and a smaller reward later is
preferred to getting a smaller reward now and a larger reward later.
The discount factor just controls by how
much the first scenario is preferable to the second one.
Now, the goal in a Markov Decision Process problem or in reinforcement learning,
is to maximize the expected total cumulative reward.
And this is achieved by a proper choice of a decision policy
that should prescribe how the agents should act in each possible state of the world.
But note, that this task should be solved now as we need to know the value function now.
We can only know the current state of the system but not its future.
This means that we have to decide now on how we are going to act in the future in
all possible future scenarios for the environment so that
whenever each the expected cumulative reward would be maximized.
But please know that we set on average.
Our decision policy may be good on average while having a high risk
of occasionally producing big failure that is very low value function.
This is why the standard approach of reinforcement learning that focuses on
the expected cumulative reward is sometimes called risk-neutral reinforcement learning.
It's a risk-neutral because it does not look at risk of a given policy.
Other versions of reinforcement learning called risk-sensitive reinforcement learning
look at some higher moments of the resulting distribution of cumulative rewards,
in addition to looking only at
its mean value as is done in a conventional risk-neutral reinforcement learning.
And this might be helpful in certain situations.
So, I encourage you to take a mental note of the mere availability of such approaches.
But for the rest of this course,
we will be dealing with the standard formulation
of reinforcement learning that focuses on maximizing
the mean cumulative return or in
other words it looks for action policies that are good only on average.