Sampling variability and the essential limit theorem
should not be new concepts to you anymore.
However, in this unit we're shifting the focus away
from numerical variables and focusing on categorical variables only.
So, in this video, we're going to start by
talking about the sampling distribution for a sample proportion.
because remember, when we're dealing with categorical variables, the parameter
of interest is no longer a mean but a proportion.
And we're also going to define the central limit theorem for
proportions, which is very similar to what we've seen before
but a different measure of the standard error as expected.
And we're going to walk through the conditions for
the, that central limit theorem to hold as well.
Let's revisit quickly what we mean by a sampling distribution.
Say you have a population of interest and you take a random sample from it.
And based on that random sample, you calculate a sample statistic
If in that sample the variable of interest is a categorical
variable, the sample statistic is going to be a sample proportion.
Then we take another sample, and also calculate the sample proportion from that.
And then another one, and then another one.
And this goes on for a long time, because we
want to think about taking as many samples as we can.
The distributions of the observations with
in each sample is called sample distributions.
However, when we look
at the distribution of the sample statistics,
this is what we call our sampling distribution.
And remember that these two are not the same thing at all.
In the sample distributions, the observations are individual.
Let's say people or cases, whatever it is that your
sampling verses in a sampling distribution the observations are sampled statistics.
Let's give a little bit more concrete example, say we want to estimate
the proportion of smokers in the world.
So our population is our world population, and capital N is going
to be our population size, so this is everybody in the world.
And our parameter of interest is p, the
proportion, the true proportion of smokers in the world.
If we actually had data from the entire population, we could calculate this
p as the number of smokers in the world divided by the total
population size.
But we don't have data from every single person in the world, so
we're going to think, so let's say that you're taking many samples from this.
So the idea here is not necessarily
a realistic situation where you're doing data
analysis per se, but we're trying to
illustrate what we mean by a sampling distribution.
So you start with the first country on the roster, Afghanistan, and you sample 1000
people from Afghanistan.
And you ask each individual one are you a smoker or
not, and record a yes or a not for each individual person.
Then so on and so forth, you go to many countries.
Let's say you take another ra, random sample of 1000 from
the U.S. again, asking each person are you a smoker or not?
And recording a yes or a no for them.