0:05

So, how do we go about using these probability functions to characterize

Â the extent of uncertainty and the distributions,

Â from which our random variables can occur?

Â Well, first question we're going to ask ourselves is,

Â what kind of data are we dealing with?

Â Is it context where there are specific values that,

Â it's a small number of values, and only those particular values can occur?

Â Or are we dealing with a range of data, and any value in that range is possible?

Â That's the case of the discrete versus the continuous.

Â 0:38

Once we've identified the kind of data that we're dealing with,

Â we're going to choose an appropriate distribution.

Â And, often times, what we're trying to do, is come up with an approximation.

Â We're not trying to identify exactly the right distribution,

Â we just want to be able to approximate the shape of the data that we're observing to

Â a reasonable degree of accuracy.

Â 1:15

And once we've estimated those parameters, we want to go back and say,

Â here's what I'm predicting, here's what I'm approximating,

Â using this probability distribution, does it actually fit my data?

Â Is it a reasonable approximation for the data that I'm working with?

Â 1:32

All right, so,

Â let's talk a little bit about the characteristics of a normal distribution.

Â This is the familiar bell curve, it's a symmetric distribution,

Â there's a mode in the center.

Â The two parameters that we need are the mean and the standard deviation.

Â And it's convenient to use this, because we can use Excel, or

Â other statistical tools, to calculate the average of our data.

Â And to calculate our standard deviation, and

Â those are the parameters that are used to characterize that normal distribution.

Â Chances are, you've seen the normal distribution applied previously,

Â if you've done any work with linear regression.

Â Whether it's financial analysis, forecasting work,

Â marketing mixed modeling, you're often

Â making the assumption that the data does come from a normal distribution.

Â 2:18

Linear regression is based on using a normal distribution.

Â If you've done work on statistics where you're doing sampling,

Â sampling is based on the normal distribution as well.

Â So for example, if we're calculating confidence intervals,

Â that's based on assumptions from the normal distribution.

Â So, again, chances are you've seen it,

Â even if you weren't aware of what the normal distribution was,

Â what the parameters were, what the exact equations are.

Â 2:46

Just to give you a sense for what the normal distribution looks like,

Â the blue and red curves on this plot are two different normal distributions,

Â both have a mean of 10, but where they differ, is in their standard deviation.

Â The blue curve has a much tighter distribution, so

Â a lower standard deviation compared to the red.

Â And so, you see that it's much more clustered around its average,

Â whereas the red has a lot more dispersion.

Â For anyone who's interested,

Â this is the mathematical equation to produce that normal distribution.

Â 3:23

Ultimately, what the software is doing in the background,

Â we'll take a look at this when we actually try to fit normal distributions to data,

Â is, it's looking for the best values of mu and

Â sigma, looking for the best values to fit the data.

Â 3:41

Right, for the time being we're going to take the values as given, but

Â that's the mathematical expression underlying this curve.

Â All right, there's a very nice property of the normal distribution

Â that relates to dispersion, we can think of it as an empirical

Â rule where it relates to 68%, 95%, 99.7%,

Â within 1, 2, and 3 standard deviations of the mean.

Â That's how much of the data, or that's how much of the distribution,

Â is contained in a particular range.

Â So if we're to look at the range, plus or minus 1 standard deviation,

Â 68% of the distribution is contained in that range.

Â We go out to 2 standard deviations, 95% of the data is contained in that range.

Â Go out to 3 standard deviations,

Â 99.7% of the data contained within plus or minus 3 standard deviations.

Â Doesn't mean that you're never going to see an observation outside of that,

Â but you've only got a 0.3% chance of seeing that.

Â All right, this is a little bit of a cheat sheet for you,

Â when it comes to the normal distribution and Excel.

Â 4:53

All right, so

Â let's walk through a couple of the commands that we have available to us.

Â If we know the mean and the standard deviation of the normal distribution,

Â if I want to know what's the probability of observing an outcome

Â less than a particular value, so, this shaded region here.

Â How likely am I to observe a value coming from the normal distribution less than k?

Â 5:19

Well, that's where the command =NORM.DIST is going to be used.

Â In terms of what you have to input into Excel, =NORM.DIST(,

Â you're going to specify what the value of k is,

Â you're going to specify the mean and the standard deviation.

Â And, if I'm interested in the mass that is less than or equal to the value of k,

Â I'm going to include the statement TRUE.

Â If I don't include that statement, or I specify FALSE,

Â what that's giving me is just the height of this function,

Â and that's not going to be of too much interest to us.

Â All right, so that's if I'm looking for less than a particular value.

Â Another way that we might look at it, though, is I want to find that value.

Â For example, I want to find the value of k, such that,

Â let's say, this region corresponds to 5%.

Â Well, if that's the case, I'm going to use the norm inverse command, or

Â NORM.INV, where I'm going to specify that percentile, the 5% level,

Â the mean, mu, and the standard deviation, sigma.

Â And that's going to return the value of k.

Â So, these two functions, essentially going to be the opposite of each other.

Â One of them says, you give me the value of k, I'll return for you that probability.

Â The other says, you give me the probability, I'll return for

Â you the value of k.

Â 6:46

If you're dealing with, what's referred to as, the standard normal distribution,

Â mean zero standard deviation of 1, yeah, it's a little bit easier.

Â We use the standard normal, NORM.S,

Â for both the distribution command, and the inverse command.

Â 7:04

Another convenient feature,

Â just to make you aware of it, is the STANDARDIZE command.

Â If you have If we have data that we want to standardize turn it

Â in to a standard normal distribution, what this calculation is going to do for

Â us is subtract the mean,divide by the standard deviation.

Â So it's going to standardize this data, by standardizing this data,

Â if you recall back to any statistics classes you might have taken,

Â it's going to correspond to those z tables that we

Â used to have to look up in the back of the statistics textbooks.

Â 7:42

All right, so if we're trying to predict demand for a particular flight route,

Â and it turns out that historically it's followed a normal distribution,

Â mean of 500, standard deviation of 100.

Â All right, well, an airline might want to know how likely is it,

Â that if I allocate 600 seats, that's going to be enough to meet demand?

Â All right, so,

Â how likely is it that we've offered enough seats to meet people's demand?

Â In this case, we're going to use the 68-95-99.7 Rule to come up with this.

Â All right, so let's think about what this involves doing.

Â Demand of 600 is 100 units, or

Â 1 standard deviation more than the average.

Â All right, well the chances of that average being enough,

Â well that's 50%, because the normal distribution is symmetric.

Â All right, so there's a 50% chance that demand is less than the mean of 500, so

Â now we've gotta figure out, how likely is it that demand is between 500 and 600?

Â Well, that's where the 68-95-99.7 Rule comes into play.

Â 68% percent of the distribution is contained within 1 standard deviation

Â of the normal distribution, that's a characteristic of the distribution.

Â So minus 1 standard deviation to plus 1 standard deviation is 68%.

Â What we want is from 0, the mean, up to 1 standard deviation,

Â essentially half of that width.

Â Well, half of the 68% is going to be 34%, so

Â I've got 50% chance that demand is less than 500, and

Â I've got a 34% chance that demand is between 500 and 600.

Â Combined, that's going to tell me 84% chance that demand is

Â going to be met by offering 600 seats.

Â All right, so again, with the visual to work with,

Â here's our center point, this is 50%of the distribution.

Â Between 500 and 600, or between 400 and

Â 600, we got a total of 68%.

Â So between 500 and 600, this is going to be 34% of the distribution.

Â 10:12

Okay, another way that we can look at this, and

Â hopefully this is something that we never have to do going forward,

Â given that we've got Excel available to us, or other statistics programs.

Â But we could try to look up the z values that correspond to particular things.

Â The z value of 1 means, how much of the distribution

Â is contained below 1 standard deviation above

Â the mean, and it corresponds to that 84%.

Â All right, so if I were to look, again, the first decimal of

Â the z score over here is 1.0, second decimal over here, and

Â that's where I'm just looking up that 84% value from.

Â