In this video, we'll study analysis of variance.

Assume that we want to analyze

a categorical variable and see

the correlation among different categories.

For example, consider the car dataset.

The question we may ask is how different categories

the make feature as

a categorical variable has impact on the price?

The diagram shows the average price

of different vehicle makes.

We do see a trend of

increasing prices as we move right along the graph,

but which category in the Make Feature has the most and

which one has the least impact

on the car price prediction?

To analyze categorical variables

such as the make variable,

we can use a method such as the ANOVA method.

ANOVA is a statistical test that

stands for Analysis of Variance.

ANOVA can be used to find the correlation

between different groups of a categorical variable.

According to the car dataset,

we can use ANOVA to see if there is any difference in

mean price for the different car

makes such a Subaru and Honda.

The ANOVA test returns two values,

the F-test score and the p-value.

The F-test calculates the ratio of variation between

the groups mean over

the variation within each of the sample groups.

The p-value shows whether

the obtained result is statistically significant.

Without going too deep into the details,

the F-test calculates the ratio

of variation between groups

means over the variation

within each of the sample group means.

This diagram illustrates a case where

the F-test score would be small because as we can see,

the variation of the prices in each group of data is

way larger than the differences

between the average values of each group.

Looking at this diagram,

assume that group one is Honda and group two is Subaru,

both are the Make Feature categories.

Since the F-score is small,

the correlation between price as

the target variable and the groupings is weak.

In the second diagram,

we see a case where the F-test score would be large.

The variation between the averages of the two groups is

comparable to the variations within the two groups.

Assume that group one is Jaguar and group two is Honda,

both are the Make Feature categories.

Since the F-score is large,

thus the correlation is strong in this case.

Getting back to our example,

the bar chart shows

the average price for different

categories, the Make Feature.

As we can see from the bar chart,

we expect a small F-score between Honda's and

Subaru because there is

a small difference between the average prices.

On the other hand,

we can expect a large F-value between Honda's and

Jaguars because the differences

between the prices is very significant.

However, from this chart,

we do not know the exact variances.

So let's perform an ANOVA test

to see if our intuition is correct.

In the first line, we extract the make and price data.

Then, we'll group the data by different makes.

The ANOVA test can be performed in Python using

the f_oneway method as

the built-in function of the SAI PI package.

We pass in the price data of

the two car make groups that we want to compare,

and it calculates the ANOVA results.

The results confirm what we guessed at first.

The prices between Hondas and

Subarus are not significantly different,

as the F-test score is less than one,

and p-value is larger than 0.05.

We can do the same for Honda and Jaguar.

The prices between Hondas and Jaguars are

significantly different since the F-score is very large.

F equals 401, and the p-value is larger than 0.05.

All in all, we can say that there's

a strong correlation between a categorical variable and

other variables if the ANOVA test gives us

a large F-test value and a small p-value.