A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 1: Estimation, Inference, & Interpretation

188 ratings

Johns Hopkins University

188 ratings

A conceptual and interpretive public health approach to some of the most commonly used methods from basic statistics.

From the lesson

Module 3A: Sampling Variability and Confidence Intervals

Understanding sampling variability is the key to defining the uncertainty in any given sample/samples based estimate from a single study. In this module, sampling variability is explicitly defined and explored through simulations. The resulting patterns from these simulations will give rise to a mathematical results that is the underpinning of all statistical interval estimation and inference: the central limit theorem. This result will used to create 95% confidence intervals for population means, proportions and rates from the results of a single random sample.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So thus far in the course, we have estimated quantities and

taken them as being our best estimate for some unknown truth.

For example, we used the sample mean to estimate the

mean of the larger population from which the sample was taken.

But we don't know the true population mean.

So I've consistently alluded to the idea that we would have to grapple with the

uncertainty that comes from using such sample

based estimates as our best guess for some

larger population quantity.

And today's the day we're going to get started with formalizing this.

In this course, we are espousing what is called a frequentist philosophy of

statistics, which boils down to the idea that any sample we get randomly

from a population is one of many possible random samples we could have

gotten just by chance because the process

we're using involves chance, the random sampling.

And, getting a handle on the potential variability and characteristics

of a sample, across different samples from the same population.

For example, how does the sample mean based on the sample

size 50 from some population vary across different samples of size 50?

This will help us get and understand the uncertainty in the sample based estimates.

Like sample means, proportions, and incidence rates.

So to do this we're going to need to establish and look at characterizing

what is called the sampling distribution

for our statistic or statistics of interest.

So in this first section we worked

to define the notion of a sampling distribution.

Okay, in these next set of lecture sets we're

going to talk about a very important idea in statistics.

Something that will help us take the estimates and associations

we've developed at the sample level, and relate

them to the unknown truth we're trying to estimate.

And we're going to be talking about this thing called the sampling distribution.

So to set it up, we're going to use this lecture

section simply to define the sampling distribution of a sample statistic.

So we want to contend now that we've laid out

ways to summarize information in single samples of continuous data.

Binary data and timed event data and also how to compare samples of such data

types by looking at differences in means,

risk differences, relative risks and incident rate ratios.

We have discussed how to do this, and we have discussed

how these sample estimates are not necessarily the truth that we want to

get at, the population truth, but it's the best we can do based on.

The imperfect sample we have from our populations or, of interest.

So ultimately it is important to recognize the potentials in, uncertainty in a sample

based estimate of our quantity as it

relates to the unknown truth it is estimating.

So understanding sample based estimates and how they vary

across random samples and same size from the same population.

Will give a framework for taking the estimate we have and coupling it with

some measure on, of uncertainty to ultimately

make a statement about the unknown truth.

This set of lectures, starting with this, where we define the sampling distribution.

And through

lecture sections B through D, we'll also characterize and

estimate the theoretical sampling distribution of a sample statistic.

For example, a sample mean proportion or incidence rate.

And ultimately, what this will allow us to do

is create an interval describing a plausible range of

values for the unknown truth that we can only

estimate using the results from a single random sample or

a random samples if we're making a comparison.

And this type of interval that we'll get

into detail about shortly is called a confidence interval.

So I've been talking about this idea of

uncertainly in sample based estimates, and alluding to it.

But what do we really mean when

we talk abut uncertainty in sample based estimates,

also commonly be referred as sampling variability, and

we'll use that term throughout the course and

you'll hear it in other settings.

Let's just take an example, suppose I was studying something

about the one year olds in Nepal, and one of the

things I wanted to characterize about this population was the

distribution of the heights of one year old children in Nepal.

But certainly because of budget, time, and logistical limitations, there's no

way I could actually measure all one year olds at any

given time in Nepal to collect all these data.

So what I'm going to have to do is take a sample.

And suppose I can, I'm doing a small study, and I

can only afford to recruit ten children and measure their heights.

And so I take a sample of ten children, and the mean height I get for these ten

children is 68 centimeters. Suppose somewhere else, in the same

region, a colleague of mine unbeknownst to me has the same idea.

And he or she also has limited budget and

limited time so takes the se, random sample of

size ten from this population and ends up with

ten different children than I did just by chance.

And gets a sample mean estimate of 71 centimeters.

Meanwhile, in another part of the region, there's another

researcher that neither of the first two of us know.

And he's doing a study to estimate the

distribution of heights for Nepali children, one year old.

And he gets a sample of ten children, and

when he takes their mean heights, it's 66 centimeters.

And so on. Suppose this is happening all over Nepal.

Well, what we're seeing here, and this isn't how real

research is done, usually there'd be one researcher taking one sample.

But this is just to illustrate the principle of sampling variability.

You can see, that these mean estimates do

not, are not identical across the samples nor will

we expect them to be because we're taking random

samples of small size from, from much larger population.

We don't expect necessarily to get the same children

in each of the samples.

So this sort of illustrates the principle of sampling

variability in an estimate, based on a sample data.

Suppose we had done this again, or another group of

researchers did the same thing, but they took larger samples.

Well, we would still expect to see variability in their estimated means

across the samples.

But if we compared the variability in the means

based on sample of 50 children, the variability of means

based on samples of ten children, the variability in

the means based on 50 would tend to be lesser.

And we'll demonstrate that in another section.

Here's another example.

Let's talk about the Baltimore mayoral election.

Suppose there's two candidates, I'll just refer to them as A and B.

And we're interested in candidate A and his or her chance of winning the election.

We're working for her campaign.

Suppose we only have limited resources in the

beginning of the campaign because donations haven't come in.

An so we take a toll, a poll of ten persons,

who are registered voters, and ask them, do they plan to vote for candidate A.

An the proportion we get, who say yes, is, 60%, six out of ten.

That sounds good, that sounds good for Kennedy

[UNKNOWN].

And I could go to him or her and say look, you're polling favorably,

60% of Balitmore voters say they'd vote for you based on a random sample.

But when he or she found out that sample

is based on ten people, they wouldn't be particularly excited.

Why not?

Well, maybe another pollster from the newspaper does

a study on ten randomly selected Baltimore residents.

And their results show that four of the ten in the sample

vote for an estimated proportion of 40%, which would lose the election.

Now, you can expect these proportions, which are based on ten people at a time,

to have a fair amount of variation just because they're not very stable estimates.

One person changes their vote and the proportion goes up or down by 10%.

Each voter has large influence

over the estimated sample proportion. Contrast that

if we then got some donations rolled in, we're able to go out and poll 500 people.

Well, there may be variation in the estimates

based on different samples of size 500 but we'll

show that systematically they are less variable than the

samples based on, or the results based on samples

of size ten.

But theoretically, you know, maybe we view this, and we estimate that

46% of the people will vote for candidate A, which isn't so good.

But maybe somebody from the newspaper does a sample,

and estimates that 49%, and so on and so forth.

There's still going to be variation in these estimates, just because

we don't have the same 500 voters in each sample.

And this variation, now we're not measuring

variation in individual responses to the question,

but in the summaries on individual responses

across different samples of the same size.

So how can we formalize this?

How can we actually formalize this definition of sampling variability?

Well this quantity that I promised we would define,

the sampling distribution of a sample statistic provides the answer.

The sampling distribution of a sample statistic is a theoretical distribution,

one that we'll never actually observe or be able to create by

brute force, that describes all possible values of a sample statistic

from random samples of the same size, taken from the same population.

So let's talk about the theoretical sampling

distribution of sample mean heights of random samples

of, say, 50 Nepali children who are 12 months old.

In reality, any researcher studying this population.

We wanted to study 50 children, we would take one

random sample sized 50, but

the theoretical sampling distribution would occur

if somebody, or what it represents is the process of taking

all possible random samples of size 50 from this large population

of 12 month old Nepalese children. Taking all possible samples

of size 50. So maybe sample one has 50 children.

And the mean height in this group is 69 centimeters.

Sample two has 50 children. And the mean height in

this group of 50 is 70.4 centimeters.

In sample three, that's 50 children and the mean height in this group

is 67.6 centimeters, so on and so forth.

And if we were to exhaust all possible

random samples, of which there might be close

to infinite, and compute the sample means for

all possible unique random samples of size 50.

And then plot those in a histogram,

those mean values. So this histogram

would show the distribution, not of individual heights of

children in any one of the samples, but it

would show the distribution of the summary measures, the

sample means, across the different samples of size 50.

And that would be our theoretical sampling distribution.

And we'll see by computer simulation in the next set

of lectures, what the resulting distributions tend to look like.

Obviously, we would never do this process in real life.

This is where we'd end up if we did.

Similarly, we can think about generating theoretical sampling distribution of

a sample proportion of people voting for candidate A from random

samples of 100 city, Baltimore city residents.

And that would be if we enumerated all possible

random samples with 100 people, and surveyed 100 people

at a time in each of our samples as

to whether they'd vote for candidate A or not.

And, we get differing estimates depending on the sample we are looking at.

And, if we did this for all possible random sub-samples,

of 100 people from the population of Baltimore City residents.

And, then generated a histogram, generate a histogram of

those sample proportions across the,

probably, hundreds of thousands of random samples that we could do that were unique.

That distribution would be the theoretical sampling distribution of

the sample proportion of persons who would vote

for candidate A, based on a sample size 100.

And

again, in section c, we'll show how the sampling distribution

of sample proportions tends to look as a function of sample size.

So again, we're just getting started here for the sampling

distribution that I've been speaking of is the theoretical entity.

It can't be observed directly or exactly specified.

And in real life research, we're only ever going to take

one sample from each population under study.

We would never take thousands and thousands of samples of 100

people at a time to understand the variability in our sample statistic.

So, in lecture sections B through E, these will serve to further demonstrate

and define sampling distributions, first by

detailing the results of some computer simulations.

To do a better job of drawing the pictures I tried to draw in the previous slide.

And with these simulations, which are just to

illustrate a point, we'll empirically show some consistent properties

of these sampling distributions, regardless of the sample statistic

whose behavior we're looking at, mean, proportion, incidence rate.

And then we'll unveil a mathematical property called the central

limit theorem that will allow us to generalize

the results we've seen in some specific examples.

In other words, this will allow us to

say, in advance, without any data, what the sampling

behavior of the statistic we're using estimate some

group would look like across all possible random samples.

And we can take that knowledge and couple it with the results

from any single random sample from our

population to estimate the characteristics of this distribution.

And this will allows us to specify the sampling

distribution from a single sample of data and then ultimately

use that coupled with our estimated quantity to make

an interval statement about the truth we're trying to study.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.