Correlation is finding a relationship between two or more sets of data.

Measures of strength such as strong, moderate,

or weak and direction (positive or negative) of the relationship.

For a refresher on correlation,

please review the Six Sigma Tools on improving

control correlation regression in the yellow belt series.

The procedure for calculating the correlation coefficient.

Number one, calculate the mean for all x values (or X

bar) and the mean for all y values (or Y Bar).

Number two, calculate the standard deviation for

all x values s_sub_x and the standard deviation of the Y values s_sub_y.

Number three, calculate x minus x bar and y minus y bar for each pair of x,

y, and then multiply these differences together.

And number four, get the sum by adding all of these products or differences together.

Number five, divide the sum by s_sub_x times s_sub_y.

And number six, divide the results of step

five by n-1 where n is the number of the x, y pairs.

We'll be using this formula,

but we will approach the answer in sections within a table which we will develop.

Let's do an example of calculating the correlation coefficient

with x and y given in the table,

we can calculate x times y,

x squared and y squared.

Within total the columns as shown, sigma x,

sigma y, sigma x times y,

sigma x squared and sigma y squared.

We then place the results into the equation to yield r equals 0.972.

This result indicates a high positive correlation between the hours of

exercise or the independent variable

and the weight loss experienced or the dependent variable.

We want to test for statistical significance using the t-test

to see if the high R value we just calculated is significant.

The steps are; the initial conditions for this t-test

Is that the means of the response variables and the distribution of

the y variables is considered normal and independent with equal standard deviations.

There are other tests for this but it's beyond the scope of this lesson.

Next, the side the significance level.

The alpha could be .05, .10, .001 etc..

Most tests use an alpha of .05.

Number 3, develop a hypothesis to be tested.

H_0 and H_1 that can be tested as either left-tailed,

right-tailed or two-tailed test.

Step 4 the critical values are obtained from

the t-table using the appendix in the back of your textbook.

Or you can look it up on the Internet.

Use n minus 1 degrees of freedom,

plus or minus t_sub_alpha over two for two-tailed,

a negative t_sub_alpha for a left--tailed,

and a positive t_sub_alpha for the right-tailed test.

The test statistic is given by the following formula: t equals r over

the square root of 1 minus r squared over n minus two.

The n minus two is called the degrees of freedom.

For number 6, compare this test statistic with the critical value obtained in step 4.

Reject the null hypothesis if the test statistics

negative t_sub_alpha is less then the critical value for left-tailed test,

or positive t_sub_alpha is greater than the critical value for a right-tailed test.

If not, do not reject the null hypothesis.

And number 7, state the conclusion in terms of the problem context.

Let's do an example from our weight loss example we found that r equals 0.972.

So let's test this for statistical significance.

Our null hypothesis is H_0 with t less than or equal to t_sub_c,

t_sub_c is our critical value or we're saying that

exercise hours do not help weight loss or may even have a negative effect.

The alternative hypothesis H_sub_1 is when t is greater than t_sub_c.

Or we're saying maybe the exercise hours contributes to weight loss.

Okay, let's calculate the test the statistics t.

From the calculations, we get t equals 8.273.

The critical t statistic t_sub_c equals 2.132 from our t-table,

and it's based on our degrees of freedom of df equals n minus 2,

and alpha equals 0.5.

And a one-tailed test because the null hypothesis states less than or equal to.

So now we compare t equals 8.273 and t_sub_c equals 2.132.

Let's include our hypothesis statement again,

comparing t equals 8.273 with t_sub_c equals 2.13,

we see that t is greater than t_sub_c.

So we reject the null hypothesis H_0 and conclude that there is indeed

a positive correlation between the hours spent on exercise and weight loss.

You can also use Microsoft Excel to calculate the correlation coefficient r.

In a cell enter =CORREL(array1, array2)

where array1 is the data for x and array2 is the data for y.

Using our weight loss example, =CORREL(A2:A7,B2:B7).

We find the resultant of point 0.972717.

Here is the way it looks in Microsoft Excel.

The correlation chart was added to the right by highlighting

the two arrays and clicking on the scatter plot chart.

You can see that the dots are trending in a positive manner and that the 0.9727

correlation coefficient shows the dots accumulating closely

to an imaginary line if placed onto the chart.