0:03

So, that's the binomial distribution. Let's talk about the most famous of all

distributions and probably the most handy of all distributions is the so called

normal or, or Gaussian distribution. The term Gaussian comes, the great

mathematician, Gauss. And it's kind of interesting to note,

Gauss didn't invent the normal distribution.

The invention of the normal distribution is kind of a debated topic.

For example, Bernoulli had used something not unlike the Gaussian distribution as a

probabilistic inequality not formalizing it as a density.

If you're interested in this, the book by Stephen Stigler on the history of

Statistics actually has a nice summary of exactly where and when and who came up

with the Gaussian distribution. But it's clear that Gauss was instrumental

in the early development and use of the Gaussian distribution.

So, a random variable is said to follow a normal or Gaussian distribution with

parameters mu and sigma squared if the density looks like this, two pi sigma

squared to the minus one half e to the negative x minus mu squared over two sigma

squared. And so, this density, it looks like a bell

and it's centered at mu. And sigma squared sort of controls how

flat or peaked it is. And so, it turns out that, that, mu is

exactly the mean distribution and sigma squared is exactly the variance of this

distribution. So, you'll only need two parameters, a

shift parameter and a scale parameter to characterize a normal distribution.

So, we might write that x is this little squiggle and N mu sigma squared as just

sort of short hand for saying that a random variable follows a normal

distribution with mean, mu, and variance sigma squared.

And, in fact, one instance of the normal distribution is sort of the kind of root

instance from which other sorts are derived and that's why mu is equal to zero

and sigma equals one. And so, we will call that the standard

normal distribution. It's centered at zero and its variance is

one and so all other normal distributions are simple shifts and rescaling of the

standard normal distribution. But then again, you could pick a different

root, maybe mu equal five and sigma equal two, but it wouldn't be quite as

convenient. You could still get every other

distribution from that one by shifting and scaling appropriately, but it wouldn't be

as convenient. This is the most convenient way to define

a sort of route of the normal distribution.

The standard normal density is so common that we, we often reserve a Greek letter

for it. So, the lower case phi we usually use for

the normal density, and the upper case Phi, we would use for the normal

distribution function. And standard normal random variables are

often labeled with a z and you sometimes do even hear, introductory statistics

textbooks and so on and refer to them as z-variables or z-distributors, or

something like that and that's because this notation has become so common.

Here's the normal distribution. It looks like a bell.

That's how it gets its name the bell-shaped curve.

And, sort of here, I've drawn reference lines at one standard deviation, two

standard deviations, and three standard deviations.

One above and, and negatives being below and positives being above.

Now again, here, so the, the one, right, because this is a standard normal

distribution, right, the one represents one standard deviation away from the mean.

Here, the mean is zero. One is one standard deviation away from

the mean, two is two standard deviations away from the mean, and three is three

standard deviations away from the mean. Instead of thinking of these numbers as

just z values in the denominator, if we think about them in the units of the

original data, right, and this is representing one standard deviation from

the mean, two standard deviations from the mean, and three standard deviations from

the mean, it doesn't matter whether we're talking about a standard normal or a

nonstandard normal. They all are going to follow the same

rules. So, about 68 percent of the distribution

is going to lie within one standard deviation, about 95 percent is going to

lie within two standard deviations, i.e., between -two and +two.

And about, almost all the distribution, about 99 percent of it is going to lie

within three standard deviations. We can get from a nonstandard normal to a

standard normal very easily. So, if x is normal with mean mu and

variance sigma squared, then z equal to x minus mu over sigma is, in fact, standard

normal. Now, you could at least, given the

information from this class, check immediately that z has the right mean and

variance. So, if you take the expected value of z,

you get the expected value of x minus mu divided by sigma.

You can pull the sigma out, and then you have expected value of x minus mu, which

is just zero because that's expected value of x minus expected value of mu.

And mu is not random so that's just mu and mu is defined as expected value of x, so

that's just zero. Then, the same thing with the variant.

If you take the variance of z, you get the variance of x minus mu divided by sigma,

right? So, if we pull the sigma out of the

variance, it becomes a sigma squared, and we have variance x minus mu.

And we learn to rule with variances that if we shift the random variable by a

constant, say, in this case, subtracting out mu, it doesn't change the variance at

all. So, we get a variance of x divided by

sigma squared. The variance of x is sigma squared so we

get sigma squared divided by sigma squared, which is one.

So, at the bare minimum, we can check that z has mean zero and variance one.

By the way, there was nothing intrinsic to the normal distribution that, that

occurred in that calculation, right? So, we've also just learned an interesting

fact, which is that take any random variable, subtract off its population

mean, and divide by its standard deviation and the result is a random variable that

has mean zero and variance one. In this case, in addition, if x happens to

be normal, then z also happens to be normal.

Similarly, we can just take this equation where z equals x minus mu over sigma and

we can multiply by sigma then add the mu and get that.

If we were to take a standard normal, say, z, scale it by sigma and then, shift it by

mu, then we wind up with a nonstandard normal.

You know, the top calculation goes from a nonstandard normal and converts it into a

standard normal then the bottom equation starts with nonstandard normal and

converts it to a normal. Another interesting fact is that the

nonstandard normal density can just be obtained as plugging into the standard

normal density. So, if you take the standard normal

density phi and instead of just plugging in z to it, say, you plug in x minus mu

over sigma, and then divide the whole thing by sigma, then that is exactly the

nonstandard normal density. And this is a kind of a way to generate,

just kind of an interesting aside. Here, mu is a shift parameter.

So, all mu does is shifts the distribution to the left or the right, right?

Just like whenever you subtract a constant from an argument in a mathematical

function. It's just moving the function to the left

and the right. And then, sigma is a scale factor.

And so, basically, whenever you take a kernel density, some density, I guess it

works for any density but it makes most sense to do with a density with mean zero

and variance one. And then, you create a new family where

you're plugging in x minus mu over sigma and then divide the density by sigma, you

wind up with the new family of densities that now have mean mu and variance sigma

squared. So, this is kind of an interesting way of

taking a root density with mean zero and variance one and then creating a whole

family of densities that have mean mu and variance sigma squared, they are usually

called location scale families. And any rate, we are only interested in

this case in the normal distribution and this formula right here is exactly how you

can go from the standard normal density and use it to create a nonstandard normal

density by plugging into its formula. Let's just talk about some basic facts

about the normal distribution that you should memorize.

So, about 68%, 95%, and 99 percent of the normal density lies within one, two, and

three standard deviations of the mean, respectively, and it's symmetric about mu.

So, for example, take one standard deviation.

About 34%, one-half of 68 percent lies from within one standard deviation on the

positive side and about 34 percent lies within one standard deviation below the

mean, So, each of these numbers split equally to above the mean versus below the

mean. And then, there's certain quantiles of the

normal distribution that are, are kind of common to have memorized.

So, -1.28, -1.645, -1.96, and -2.33 are the tenth, fifth, two-point fifth, and

first percentiles of the standard normal distribution.

And then again, by symmetry, so, if we just flip it around, right?

So if, if -1.28 is the tenth percentile, then 1.28 has to be the 90th percentile.

So, by symmetry, 1.28, 1.645, 1.96, 2.33 are the 90th, 95th, 97.5th and 99th

percentile of the standard normal distribution.

One in specific I want to point out that you really need to memorize is 1.96.

The reason it's useful is it's the point so that you could take -1.96 and +1.96,

the probability of lying outside of that range, right, below -1.96 or above +1.96,

well that's five%. So, 2.5 percent below it and 2.5 percent

above it, so that's five%. So , the probability of lying between

1.96, -1.96 and +1.96, is 95%. And so, at any rate, it's used to do

things like create confidence intervals in these other entities that are very useful

in Statistics and people have kind of stuck with 95 percent as a reasonable

benchmark for confidence intervals. And five percent is a reasonable cut off

for a statistical test, and if you're doing two-sided, you need to account for

both sides and so you, you use 1.96 and then, the other fact is that 1.96 is close

enough to two that we just round up. So, a lot of times, things like confidence

intervals, you might hear people talking about, we'll just add and subtract two

standard errors. They're getting that two from this 1.96

right here. So anyway, that one in specific you should

memorize, but you should probably just memorize all of them.

Let's go through some simple examples. So, we'll go through two and you should

just be able to do lots of these, after I go through two.

So, lets take an example. What's the 95th percentile of a normal

distribution with mean mu and variance sigma squared?

So, recall, what do we want to sell for if we want a percentile?

Well, we want the point x NOT, but the probability that a random variable from

that distribution x, being less than or equal to x NOT turns out to be 95 percent

or 0.95. Okay.

And so, you know, it's kind of hard to work with nonstandard normals so the

probability that x being less or equal to x NOT is 0.95.

Well, why don't we subtract out mu from both sides of this equation and divide by

sigma from both sides of this equation? And on the left-hand side of this

inequality, x minus mu over sigma, well, that's just a, a z random variable, now, a

standard normal random variable. So, the probability that x is less than or

equal to x NOT, is the same as the probability that a standard normal is less

then or equal to x NOT minus mu over sigma and we want that to be 0.95.

Well, if you go back to my previous slide, 0.95 95th percentile of the standard

normal is 1.645. So, we just need this number, x minus mu

over sigma to be equal to 1.645 to make this equation work.

And so, let's just set it equal to 1.645, right?

And then, solve for x NOT so we get x NOT equals mu plus sigma times 1.645.

So now, you know, you could ask lots of questions with specific values of mu and

sigma. But you'll wind up with the same exact

calculation. And here, in fact, you know, we used 1.645

because we wanted the 95th percentile. But, in general, x NOT is going to be

equal to mu plus sigma z NOT, where z NOt is the appropriate standard normal

quantile that you want. And then, you can just get them very

easily. You know, the other thing I would mention

too is you should be able to do these calculations more than anything just so

you've kind of internalized what quantiles from distributions are and how to sort of

go back and forth between standard and nonstandard normals and the kind of ideas

of location scale densities and that sort of thing.

In reality and practice, you know, it's pretty easy to get these quantiles because

for example, in r you would just type in q norm 0.95 and then give it a mean and a

variance. Or if your wanted, if you did q norm 0.95

without a mean and a variance, it'll return 1.645 and you can do the remainder

of the calculation yourself, but even that's a little bit obnoxious so you can

just plug in a mu and a sigma. So, these calculations aren't so necessary

from a practical point of view even very rudimentary calculators will give you

normal quartiles, nonstandard normal quartiles.

The hope is that you'll kind of understand, you know, the probability

manipulations. You'll understand, you know, what a

quantile means. You'll understand, you know, what the

goals of these problems are. And you'll understand sort of how to go

backwards between the standard and nonstandard normal.

That's kind of what we're going for here. It's kind of clear, I think everyone

agrees that you can very easily just look these things up without having to, to

bother with any of these calculations. Let's go with another easy calculation.

What's the probability that a normal mu, sigma squared random variable is two

standard deviations above the mean? So, in other words, we want to know the

probability that x is greater than mu plus two sigma.

Well, again, do the same trick where we subtract off mu and sigma from both sides

and we just get the, the answer that that's the probability that a standard

normal is bigger than two. And that's about, 2.5%.

And, so you can see the kind of rule here. If you want to know the probability that a

random variable is bigger than any specific number, or smaller than any

specific number or between any two numbers, instead of take those numbers and

convert them into standard deviations from the mean, right?

And that can, of course, be fractional. It could be 1.12 standard deviations from

the mean or whatever. And the way you do that is by subtracting

off mu and dividing by sigma and then, revert that calculation to a standard

normal calculation. So, if you wanted to know what's the

probability that a random variable is bigger than say, let's say, 3.1, just to

pick out a random complicated sounding number.

Let's suppose you're talking about the height of a kid and you want to, you know,

say, what's the probability of, of being taller than 3.1 feet.

What you would need is the population mean mu and the standard deviation sigma, take

3.1, subtract off mu, divide by sigma. Now, you've just converted that quantity

3.1, which is in feet, right, to standard deviation units.

And then, you can just do the remainder of the calculation using the, the standard

normal. So, I would hope that you could kinda

familiarize yourself with these calculations.

And I recognize that, in a sense, they're kind of ridiculous to do because you can

get them from the computer so quickly. And we'll give you the R code that you

need to do these calculations very quickly on the computer.

But I think it's actually worth doing them by hand so just to get used to working

with densities, to get used to what these calculations refer to.

So, let me just catalog some properties of the normal distribution, a lot is known

about the normal distribution. And so, I'll outline some of the simpler

stuff, and, and some of the stuff, the letter points, we probably won't get to in

this class, but I thought I'd just at least say.

So, at any rate, the normal distribution is symmetric and it's peaked about its

mean, which means that the population mean associated with this normal distribution,

the median, and the mode are all equal right at that peak.

A constant times a normally distributed random variable is also normally

distributed. And you can tell me what happens to the

mean and the variance if, say, x is a normal random variable, what distribution

does a times x have if I'm going to tell you that it's normal, what's the resulting

mean and variance? It turns out that sums of normally

distributed random variables are again normally distributed.

And this is true regardless of the dependent structure of the data.

So, if the random variables are jointly normally distributed.

It's important that they are jointly normally distributed.

They could be independent, they could not be independent, but they need to be

jointly normally distributed. The sums or any linear function of the

normal random variables turns out to be normally distributed.

And again, you can calculate the mean and the variance.

Sample means of normally distributed random variables are again normally

distributed. Again, this, this is true regardless of

whether or not they're jointly normal and possibly dependent, or if they're simply a

bunch of independent normal random variables, this is true of sample means.

However, let me just jump to point seven. It also turns out that if you have

independent identically distributed observations, properly normalized sample

means, their distribution will look like a Gaussian distribution, not entirely but

pretty much regardless of the underlying distribution that the data comes from.

So, take as an example, if you roll a die and look at what the distribution of a die

roll looks like, it doesn't look like very Gaussian it looks like a uniform

distribution on the numbers one to six. Now, take a die, roll it ten times, take

the average, and then repeat that process over and over again and think about what's

the distribution of this average of die rolls.

Well, it turns out it'll look quite Gaussian.

It'll look very normal. At any rate, that's the rule, is that

random variables, properly normalized, with some conditions that we're probably

going to gloss over will limit to a normal distribution.

And that's how the normal distribution became the sort of Swiss army knife of

distributions is that, pretty much anything you can relate back to a mean of

independent things , tends to look normalish in distribution.

And mathematically, formally, if they're independently and identically distributed

in the, you normalize the mean in the correct way, then, then you get exactly

the standard normal distribution. That is an incredibly useful result, an

incredibly useful result. It's a very historically important result

called the central limit theorem. So, lets see, back to point five.

If you take a standard normal and square it, you wind up with something that's

called a chi-squared distribution, you might of heard of that before.

And if you take a standard or a nonstandard normally distributed random

variable and exponentiate it, take e^x, where x is normal, then you wind up with

something that's log-normal. Log-normal is kind of a bit of a pain in

the butt in terms of its name. A log-normal means take the log of a

log-normal and it becomes normal. It doesn't mean the log of normal random

variable. It's a little annoying fact, right?

And you can't log a normal random variable, by the way, because there's a

nonzero probability that it's negative and you can't take the log of a negative

number. The name makes it sound like a log normal

is the log of a normal. It's not.

Log-normal means take the log of mean and then I'm normal.

Okay. Let's talk about ML properties associated

with normal random variables. If you ever bunch of IID normal mu sigma

squared random variables, and let's assume you know the variance.

So, let's ignore the variance for the moment.

Then, the likely to associate with mu is, is written now right here.

You just take the product of the likelihoods for each of the individual

observations. And so, you wind up with two pi sigma

squared to the -one-half e^-xi minus mu squared over two sigma squared.

If you move that product into the exponent, you get minus summation i equals

one to N, xi minus mu squared over two sigma squared.

Remember, we're assuming that the variance is known.

So, the two pi sigma squared to the minus N over two, that you would have gotten, we

can just throw that out, right? Because remember, the likelihood doesn't

care about factors of proportionality that don't depend on mu.

In this case, cuz mu is the parameter we're interested in.

By the way, this little symbol right here, this proportion two symbol is what I mean.

That means I dropped out things it's proportional to.

I dropped out things that are not related to mu.

And I'll try and use that symbol carefully where it's contextually obvious what I

mean, what variable I'm considering important.

Okay, so, let's just expand out this square and you get summation xi squared

over two sigma squared plus mu summation xi over sigma squared minus N mu squared

over two sigma squared. Now, this first term, negative summation

xi squared over two sigma squared, again, that doesn't depend on mu.

So, we can just throw it out, right? It's e to that power times e to the,

latter two powers. So, that first part is a multiplicative

factor that we can just chuck. Then the other thing here is it's a little

annoying to write summation xi. Why don't we write that as nx bar, right?

Because if you, take x bar, the sample average and multiply it by N, you get the

sum. Okay, so, the, the likelihood works out to

be mu nx bar over sigma square minus n mu squared over two sigma squared.

So, that's the likelihood. Let's ask ourselves what's the ML estimate

from mu when sigma squared is known. Well, as we almost always do, the

likelihood is kind of annoying to work with, so why don't we work with the log

likelihood? We take the log from the previous page,

and we get mu nx bar sigma squared minus mu squared over two sigma squared.

If you differentiate this with respect to mu, you wind up with this equation right

here, which is clearly solved that x bar equal to mu and so what it tells us is

that x bar is the ML estimate of mu. So, if your data is normally distributed,

your estimate of the population mean is the sample mean.

That makes a lot of sense. We would hope that the result would kind

of work out that way. But also notice because this calculation

didn't depend on sigma, this is also the ML estimate when sigma is unknown.

It's not just the ML estimate when sigma is known.

So, we know what our ML estimate of mu is. Let me just tell you what the ML estimate

for sigma squared is. The ML estimate for sigma squared works

out to be summation xi minus x bar squared over N.

And you might recognize this as the sample variance, but instead of our standard

trick of dividing by N - one, we're now dividing by N which is a, a little

frustrating that there's this kind of mixed message that the maximum likelihood

estimate for sigma squared is the so-called biased estimate of the variance

rather than the unbiased one where you divide by N - one.

Now, notice as N increases, this is irrelevant, right?

The factor that disputes the two estimates is N - one / n. And that factor goes to

one as N gets larger and larger. So, I've had several colleagues tell me

that they would actually just prefer this estimate, this maximum likelihood

estimate. And their argument is something along the

lines of, well, the N - one estimate is unbiased but this one has a lower

variance. And what they mean is this is the biased

version of the sample variance. It's only a function of random variables

so it, itself is a random variable, and as a random variable, as a mean and a

variance. The fact that it's mean is not exactly

sigma squared means that it's biased. But it has a variance and its variance is

slightly smaller than the variance of the unbiased version of the sample variance.

And so, this is an example that pops up all the time in Statistics, that you can

trade bias verses variance. In this case, one variance estimate is

slightly biased, but will give you a lower variance.

Another one is unbiased, but the variance estimate, itself has a larger variance,

and it's very frequent in Statistics that you have this kind of trade off, you can

pick one as you increase the bias, you tend to decrease the variance and vice

versa. So, the other thing I wanted to mention

was here, we've kind of separated out inference from mu and inference for sigma.

If you wanted to do kind of full likelihood inference then you have exactly

a bivariate likelihood, a likelihood that depends on mu and sigma.

And it's a little bit difficult to visualize, but it is just a surface,

right? Where you have mu on one axis, sigma on

another axis, and the likelihood on the vertical axis, then it would just be a

likelihood surface instead of likelihood function.

And, it's a little bit hard to visualize these kind of 3D looking things.

So, there are methods for getting rid of sigma and looking at just the likelihood

associated with mu, and getting rid of mu and looking at the likelihood of just for

sigma and later on we'll discuss methods for them.

But for the time being, it's not terribly important.

What I would hope you would remember is that if you assume that your data is

normally distributed, then, you know, we gave you the likelihood for mu if your sum

sigma is known. We calculated that the ML estimate of, of

mu was, in fact, x bar, and that the ML estimate of sigma squared was, you know,

pretty much the sample variance. You know, off by a little bit from the

standard sample variance, but pretty much the sample variance.

And then, you know, the ML estimate of sigma, not sigma square but of sigma, is

just the square root of our estimate for, ML estimate for sigma squared.

Well, that's the, end of our whirlwind tour of probably the two most important

distributions. There are some other ones that we'll cover

later. Next lecture, we're going to travel to a

place called Asymptopia. And everything's much nicer in Asymptopia,

and so I think you'll quite like it there.