The action that we select on time-step t,

is the greedy action with probability

one minus epsilon or

is a random action with probability epsilon.

We can demonstrate the effectiveness of

Epsilon-Greedy action selection on

the 10-arm Testbed from the textbook.

The 10-arm banner problem has 10 actions

here shown along the X-axis.

On the Y-axis, we'll show the distribution of rewards.

Each reward is sampled from a normal distribution with

some mean q star of a and variance one.

Each q star of a is drawn from

a normal distribution with mean zero and variance one.

Each time we run the 10-arm Testbed q star

will be redrawn from a normal distribution.

Notice how much randomness

is involved in this experiment.

Q star is randomly sampled from a normal distribution.

The rewards are randomly sampled based on q star,

and the actions are randomly taken on exploration steps.

To fairly compared different algorithmic choices,

we will need to perform many independent runs.

Let's take a look at a single run of

an Epsilon-Greedy agent in the 10-arm Testbed.

For this agent, we set epsilon equal to 0.1.

The time-step is on the X-axis.

The Y-axis is the reward received on that time-step.

Notice how noisy this curve is with several sharp peaks.

We see a slight upward trend

in the center of these peaks.

But with this amount of noise,

it is hard to make any conclusions.

If we run another agent with a different random seed,

we get another noisy curve,

and a third run.

For every time-step, we can take the average of each

of these three rewards we've seen on that time-step.

This will produce a single line that represents

the average reward received for each of these three runs.

For example, here's what happens when we

take 20 such runs and average the reward.

Notice that the spikes on

the curve are much less pronounced.

If we take 100 runs and average them together,

we see a much more clear shape to the curve.

There's a noticeable increase in reward

in the first 200 steps,

and doing this with 2,000 runs,

we get a clear picture of the performance of this method.

What this result says is that

this way of behaving obtains

this much reward in

expectation across possible stochastic outcomes.

Throughout this specialization, we

used the average performance over

many independent runs to make scientific comparisons.

Without using summary statistics like the average,

it will be difficult to make

fair comparisons between algorithms.

In this example, the random seed is

the only difference between

runs of this Epsilon-Greedy agent,

and yet the performance looks quite different.

Okay, let's go back to the experiment

on the 10-arm Testbed.

We are going to compare Epsilon-Greedy agents

with different epsilons.

Let's start by investigating

the average reward obtained by each algorithm.

The time-step is shown on the X-axis.

The average reward over 2,000 runs is on the Y-axis.

Here's the performance with epsilon equal to zero.

When epsilon is zero, the agent performs

no exploration steps only greedy steps.

Here's the performance of epsilon equals 0.01.

Notice that the average reward keeps improving over time.

This method explores one percent of the time,

and eventually converges to take in

the optimal action 99.1 percent of the time.

The agent with epsilon equal to 0.1 obtains

more reward earlier on average than the other methods.

But it plateaus after 300 steps.

In this lecture, we discussed the trade-off

between exploration and exploitation.

We also discussed Epsilon-Greedy,

which is a simple method for

balancing exploration and exploitation.

Later, we will discuss more sophisticated methods

for choosing between exploration and exploitation.