0:13

We'll think of parameters of the distribution and the goodness of the.

For example the Chi-Square test.

Welcome to session three of the third week of Modeling Risk and Realities.

I'm Senthil Veeraraghavan again.

I'm a faculty at Operations, Information and

Decisions Department at the Wharton School.

So far we looked at data in visualization in session one, and

session two we looked at to the variety discrete and continuous distributions.

In this session we're going to focus on how well a distribution fits.

We will look at hypothesis testing and goodness of fit.

2:51

In such cases, we recommend computer softwares to evaluate these tests,

however, we will run the Chi-Square test for two common distributions.

The normal distribution and the uniform distribution for our data sets.

What's a Chi-Square test?

Chi-Square tests the following null hypothesis

against an alternate hypothesis.

The null hypothesis could be the studied data comes from a random variable that

follows a specified distribution, such as a normal distribution or

a uniform distribution.

3:44

In this test, you can disprove that the data came from a specific distribution.

But you cannot prove that it came from that distribution.

You can disprove that it came from a normal distribution, but it cannot

convincingly categorically prove that it did come form a normal distribution.

5:05

Suppose you have ten buckets and you're trying to fit a normal distribution

that has two parameters, mean and standard deviation.

The degrees of freedom here is 10- 2- 1 which is 7.

For each Chi-Square test with some degree of freedom, we can predict the null

hypothesis with some confidence which could be set at 99% or 95% etc.

5:41

We will explore Chi-Square tests on our two datasets.

Dataset1_histogram and Dataset2_histogram parts.

We used the Dataset1 in section one, and we generated the histogram.

The histogram gave us two curves, the pdf, which is the probability density function,

which we saw in week two, session two, and that's given in the blue bar chart.

And also, the cumulative distribution function,

which gives us accumulated values, and that's given in the red code.

Just visualizing the pdfs, such as a uniform distribution.

It looks that the pdf is pretty flat, and therefore, we run a Chi-Square test for

uniform distribution, and for the Chi-Square test, for

the minimum distribution, we're going to use min and max values for the data.

6:42

So, what uniform distribution are we going to use?

We're going to use your uniform distribution with minimum value of 0.09

from the data set, and the maximum value 99.87 that we saw from the dataset.

So there are two parameters to the uniform distribution.

7:10

We have 7 degrees of freedom because there are 10 bins and

2 parameters and 10- 2- 1 = 7.

And in the Excel video, I show you how to generate the Chi-Square test.

We have the Dataset1 histogram file now, and

we are going to now look at how to test a Chi-Square test for our data.

Using a theoretical distribution,

a uniform distribution, with minimum of

0.09 and maximum of 99.87.

For that, we need first to generate the theoretical cdf.

8:00

If you recall the Formula and the discussion we had in session two.

We can do this by picking the value that we're interested in.

In this case it's 10 minus the minimum value.

And I want to fix it.

Divided by the maximum value

minus the minimum value.

And we need to fix everything there except for the first terms.

So we have this.

8:36

And let me calculate the theoretical CDF all the way through,

except the maximum point is not 100.

We want to make sure that maximum point is not 100, but 99.87.

So that's our maximum value, so that gives us 1.

So we have accumulator distribution function.

Let's write that in percentages so that it's easy to view it.

From this,

we can also generate the theoretical probability of being in the bin.

9:17

So the bin probability is, For

the first bin, it's between zero and ten so it says exactly that.

For the second bin alone, this is the accumulator for the first and

the second bin, so to calculate what's the probability, theoretically,

of falling within the second bin, you just take the second bin minus the first bin.

So that gets you 10%.

And we can do this for all the calculations here, all the way through,

and we get the bin theoretical probabilities.

So, give me a data set, and any data point in that random variable

distribution has a theoretical probability of falling in in the lowest bin at 9.98%.

And the probability of falling in the second bin is 10.02%, and so on.

So this distribution is almost uniform.

And the theoretical distribution is uniform but

the bin is cut off at .09 and 99.87.

So let's actually compare what's going to happen for

frequency for around 250 points.

250 points, they're going to fall into this bin with these probabilities.

So 250 multiplied by the bin theoretical probability

gets you 24 points are likely to be in the bin.

And so on and so forth for all calculations.

So theoretically speaking, you should get about 24.8 points in the first bin.

The second bin you should have about 25 points, 25 points, and so on.

The first and the last bin are slightly smaller because they're getting cut off

not at 100, but 99.87, so you have a slightly smaller probability of bin here.

And so now we have the theoretical frequency

11:21

and the actual frequency in the data set.

Now we can run the chi-square test.

So I am going to to write chi-square test here.

And we can write a number for the chi-square test, it's a formula.

It's chisq.test.

Then you get to choose the actual frequency range and

the theoretical frequency range.

12:00

And you close it, you get about 0.0127,

just releasing it to three decimals, it's 0.013.

And that's the chi-square value you're going to use and

there are 70 degrees freedom on this.

So now we have done the chi-squared test trying to fit uniform

distribution on our data.

Let's go ahead and look at the table and

see whether we are able to reject the null hypothesis.

We generate the chi-square values, and the chi-square test gives us a value of 0.013.

12:38

Now we can lookup that value for

degrees of freedom in the tables that I provided you.

For example, follow the web link and

you'll find the following, we fail to reject the null hypothesis.

That is, we fail to reject the hypothesis that the data comes from

a uniform distribution with a high degree of confidence.

13:00

Remember, we cannot prove for sure that the data comes from

a uniform distribution but we have failed to reject the null

hypothesis that the data comes from a uniform distribution.

Remember, chi-square test is a one-sided test.

Now let's look at data set 2.

On data set 2, the figure gives us a histogram with pdf in the blue bars.

pdf is a probability density function.

And the CDF in the red curve.

The CDF is a cumulative distribution function.

And the visualization of the pdf tells us it looks like a normal distribution.

13:47

Hence, let's fit a normal distribution on this data set.

We run a chi-squared test for normal distribution using the average and

the standard deviation from the data.

In data set 2, We will look at the goodness

of fit of a normal distribution for the sample average of 47.2, and

the standard deviation of 15.78, which we calculated from the data set.

14:23

Again, the degrees of freedom is 7.

We run the chi-squared test as we see in the Excel video.

In the data set histogram, file we have the histogram

that we generated in the first session of this week.

We have a histogram that looks like a bell curve.

It suggests we should check our normal distribution.

So we're going to test for normal distribution in our data and

see whether the normal distribution has a good fit with our data set.

And chi-squared test is a goodness of fit test.

15:03

To do that, the first step you're going to derive the CDF of the normal distribution,

theoretical, and we saw from formulas in session two where we derive this.

So we'll just use that normdist.

And we can pick a value x, and

we're going to pick the mean of the normal distribution.

We're going to pick as 47.20.

And the standard deviation is 15.78.

Let's fix those by choosing F4, and

then the last option is whether to go to cumulative or probability.

We want the cumulative, so press one, or write true, or choose cumulative.

15:42

We get the value 0.001.

And we take it all the way to the last cell which

gets us to 0.9999589, which is pretty close to one.

But, it's not exactly one because the normal

distribution Has a tail going to infinity.

Once we have the CDF, we need to, this gives us the cumulative value or

the sum of all these bars up to that point.

To calculate, what is the bucket, what is the bar that fills into that bin,

we need to subtract two adjacent cumulative values.

So that's what we're going to do in the next column.

So we figure what is the probability of falling in each bin, theoretically.

16:36

The first bin is just the value of falling, the first bin,

up to the first bin and so that's M4, 0.001.

The value of falling in, the probability of falling in

the second bin is the probability of being below the second bin but

above the first bin, so M5 minus M4.

And we do this for every value up to the last point and

we get the probability values.

To make better visual sense of this,

I'm going to convert this to percentages, and you can see that.

17:18

Pick the data at random.

Where is it going to fall?

It has a 25% chance of falling in the middle or

18% chance of falling in the mid ranges,

whereas 0.14% chance of falling in the lowest bin or 0.29%

chance of falling in the highest bin, so it gives the shape like a bell curve.

17:47

That is supposed to be theoretically calculated.

And the bin frequency theoretically calculated is as follows.

We have 250 data points,

and each of the 250 data points has some probability of falling in this bin.

We just multiply that by the probability.

And we have 0.34 for the first bin, and we can take it all the way to the last bin.

So, we should expect about 61 values,

frequencies in the middle and very little towards the edges.

So, let's compare the actual frequency with the theoretical frequency.

The theoretical frequency is not in whole numbers.

18:31

But this is our chi-square test.

Now we can run our chi-square test.

And the way to run it is,

chisq.test, chi-square test.

Pick the actual range of frequencies.

Pick the theoretical range of frequencies, and

you have the chi-square value, 0.8851.

Now we have seven degrees of freedom in our chi-square test.

We'll see that soon in the PowerPoint presentation.

We take this value of the chi-square test and

we're going to check whether normal distribution uses a good fit.

19:30

Again, the precise value doesn't matter.

We look at the link for degrees of freedom, we find that,

we fail to reject the null hypothesis that the data came from a normal distribution.

20:43

If this maximal difference value is low then the fit is very good.

Which means that we are comparing two columns in ascending order and

the gap between the two columns is never very high, then this is a good fit.

Typically a maximal value of 0.03 or

0.04 or even lower is considered as a very good fit.

21:09

Modeling Using Continuous Distributions.

As you can see, depending on the size and the nature of the data,

modeling reality using continuous distributions and

choosing the correct distribution that fits our data is a challenging task.

21:35

Hence, in real life, often simulation is used, and

that will be our focus in week four.

Anyway, congrats on ending week three and best wishes for week four.

I'm Senthil Veeraraghavan, a faculty in Operations, Information and

Decisions Department.

And you can follow me @senthil_veer.

We've just completed week three of the course.