So it's time for some more vegetables.

Let's get started.

So in the example we looked,

the issue was not just about what we included in the regression,

but also what we did not include in the regression.

And this is a common issue with multiple regression.

So what happens is - let's go back to the example we looked at.

We had included feature and display and that does affect sales in a positive way.

What we did not do was exclude price.

So when we excluded price,

what happens is the effect of price is absorbed by feature and display.

Now this could happen if the retailer put

something on feature and display if the price is low.

But price, as we know,

also has a negative relationship with units sold.

So price affects both the independent variable that we include - feature

and display - and also the dependent variable that we include - sales.

And that causes problems like in the example we saw;

the influence of feature and display was increased;

then it really should be.

Now let's dig deep into this with another example that I hope explains this even better.

Now let's shift a little bit away from marketing and think about just compensation.

We all want to get paid more, of course.

That's just human nature, including me.

But here's the problem;

there is some research which shows that tall people make more money.

Well, for someone like me,

who is challenged in the department of height,

that's a little issue.

So is it really true that tall people make more money?

Let's see whether - where this research gets this information from.

Now, we looked in the survey of American adults in 1994.

This is a famous example that is replicated in a lot of books that I'm quoting here.

So they looked at data of earnings that people made,

how much money people made,

their height and also they had gender,

whether they were men or women.

Now let's look at what the regression says when you have this data from '94.

Now if you look at the left half here,

where we just looked at earnings and height,

what do we see here?

We see the effect of height.

P-value is significant, height has a high coefficient,

R-squared of about nine percent.

So if you just look at the first regression to the left here,

you would think that oh,

tall people make more money.

But what are we missing here?

We are missing gender.

We have basically mixed apples and oranges.

Right? So what happens when we include gender?

Well, the regression we added dummy variable called male,

which is equal to one if the data represents somebody who has gender as male, a man.

What happens then? You see that the effect of height is reduced.

It goes from 1,262 to 443,

P-value gets to about only 0.02,

but the effect of gender is very significant.

And R-Squared is almost the same.

So what happened here?

Basically, we were looking at a gender pay gap issue,

not that tall people are making more money.

It is true that women are, in general,

shorter than men and they also make lower than men.

And that's what was the real reason for this big gap, not the height.

But when we did not include gender,

what happens is you see that all the effect of

gender is absorbed by height and you go around thinking that hey,

tall people make more money.

And that is basically called the omitted variable bias,

because we have omitted the effect of male and included the effect of height.

So another way to look at this variable bias is looking

in the correlation matrix and you can see earnings

here on the y axis and the correlation of

height is 0.25 and correlation of gender is also 0.29.

But the interesting part is look at the correlation of height and gender - positive.

So there is a high correlation between gender and height,

which is driving all these results.

And if we go back to our example of how we look at omitted variable bias,

we also call it the Z variable problem.

We did not include the Z variable male and we

attributed more than necessary credit to height.

And this issue comes up a lot when you are looking at multiple regressions.

So we have to remember that it is not only

important to know what you include in the regression,

it is also important to know what you excluded in

the regression and what effect that might have on your inferences.

And also to know what relationship that the excluded variables

have with the included variables in the regression and the dependent variable.