0:08

Welcome to the last of the lectures on cluster sampling on saving money in

surveys, our unit three.

Where we're going to be looking at an issue concerning

a factor that drives all of our designs.

The size of the subsample.

We were just looking at, for example, in lecture four what happens to our

projected standard errors at different levels of b, the subsample size?

Where does b come from?

What considerations ought to go in to choosing a value of b?

The subsample size that's appropriate for our design.

We already know from that discussion that there is a part of this that will involve

costs.

We know that as we change b and if we have a fixed sample size or

a fixed budget, as we increase b, the number of clusters has to go down.

And our costs go down.

But that means that as we're increasing b, our design effects are going up, but

our costs are going down.

They're going in different directions.

1:10

And as the design effects go down, as b goes down, then our variances go down.

Alternatively, if we decrease b,

our design effects go in the opposite direction.

And our cost go in the opposite direction.

We can increase the number of clusters and increase our costs.

So there is a trade off, a balance between cost factors,

how many clusters to take, and variances based on b,

in our model 1 plus b minus 1 times rho for our variances.

So what we're going to do in this particular lecture is deal with

a cost model.

We're going to get a little more sophisticated about

specifying the cost structure of our process.

And a variance model, which we already know, though I'm going to express it in

a more elaborate, more detailed way than we've done up until now.

And we are going to take the two of those and

combine them to come up with an optimal subsample size.

One that gives us good properties.

The best property that we could imagine would be that it is going to give us

a sampling variance that is as small as possible among all choices of b,

all possible subsample sizes.

And then we're going to back out of this whole process,

out of that cost model the number of clusters that we've got.

And we've got a little demonstration to talk about and

some of the features there that go with that.

3:41

Now, we know as b goes up or down, the design effect goes up or down as well.

That's a direct effect.

And as a result so does the sampling variance.

Because as b goes up, the design effect goes up,

the sampling variance goes up in our cluster sampling case.

As b goes down, [LAUGH] our design effect goes down,

our sampling variance goes down.

But we've also seen that as b goes up or down, a goes in the opposite direction.

If we have a fixed sample size, 1200, and we change our b from 40 to 20.

By changing b from 40 to 20,

we have to increase the number of clusters to maintain that cluster size.

If we have a fixed budget, which is more often what we have.

A fixed budget, a fixed amount of money to do our survey.

As b goes up, the number of clusters has to go down,

as we adjust relative to the total amount of money that's available.

4:36

So there is a cost error trade-off that we've mentioned

in cluster sample design that's going on here between a and b.

B being focused on variances, design effects and its impact on variances.

A being focused on cost.

So we need to choose a value of b and a as long as we don't exceed the budget.

What combinations of a and b are available?

Is there one of those that's optimal?

Well, what would be the best?

If we have a bunch of a's and b's that we can choose in pairs, that meet our

cost constraint or our sample size constraint, then what else is there?

Well, is there some combination of those that meets our budget constraint,

meets our sample size constraint, but gives us smaller sampling variance?

As a matter of fact, is there one that gives us the smallest sampling

variance among all those combinations?

That would be optimum.

Minimize that sampling variance.

5:35

So there is, it turns out an optimal choice of a and b for

any particular problem.

But its formulated in the context now of a cost structure and a variance structure.

We already know the variance structure, we don't know the cost model.

So here, let's look at the cost model that's shown in red on this line.

A cost model that has several components.

On the left-hand side of the equation here is the overall cost that's available for

data collection.

I say budget available.

It's the budget available to collect our interviews, to process the data and so on.

Now it isn't the actual total amount of money that we might receive for our study.

Because our total amount of money that we receive

often includes costs that don't depend on sample size.

So the cost of the sampling statistician to design the sample

probably doesn't vary a lot by sample size it's a fixed input.

The heat and

light for our building that probably isn't going to depend on sample size.

There's going to be a range of sample sizes within which that's fairly constant.

So we're going to to take out more or less those constant costs and

have available funds for our data collection.

We're going to take out overhead costs.

Secondly then on the right-hand side there are two components there.

An a times c of a.

6:57

And an a times b times c of b.

Now let's deal with the first one.

A times c of a.

A is the number of clusters that we're going to select.

C of a is the cost per cluster.

That cost component for clusters primarily consist of travel.

If we do indeed have widespread units that we've gotta go to.

And preparation costs for the sample.

Now the travel, by the way, includes not only the cost of transportation,

but also staff time while they're traveling.

And it can involve multiple visits to a cluster.

So, if we were sampling school children and sampling them in schools,

we're going to have to figure in multiple visits to the school for

such things as contacting the principal.

To see whether or not they're willing to allow us to come into the school and

do our data collection.

Possibly having to talk to the superintendent for

the school district to make a decision about The process.

Going back to the principle in identifying a list of the classrooms that are there,

and making a selection of classrooms.

Going to each of the classrooms and visiting with the teachers,

which might require a couple of visits.

Also going back and collecting data from the children which might involve,

because of illness and absences, and other kinds of things,

multiple visits from the children.

All that's buried in there.

And those costs per cluster can be considerably larger than the cost

per observation within a cluster.

That's what we're worried about.

That's a big component that can easily inflate our costs substantially.

But we've got to keep it within the constraint.

To the extent that we do more clusters,

we have less money to do things within clusters.

The second component, b times c sub b.

c sub b is the cost per observation within a cluster.

It's dominated by interviewing cost,

whether that happens to be asking questions or

providing a self administered form, or some other way of collecting the data.

And c sub b is multiplied by the number of observations,

b, that we're going to use in our selection.

And then that's multiplied across the clusters that we have.

All right, that's our cost model.

It's not actually the way costs are recorded.

We don't actually record these.

But it's a style,

an approach to cost modeling that fits in nicely with the variance model.

Because the sampling variance that we've got involves a sampling variance that

has two components, a simple random sampling variance and a design effect.

Now you'll recognize in this red formula here, in square brackets on the right-hand

side, at the end of the right-hand side are design effect.

1 + (b-1)roh.

So there our variance depends on b.

But the first term, which has the (1- f)p(1-p)/n- 1,

is written with ab to represent n and

is the number of clusters times the number of elements per cluster.

And so here now, we see our variance

model includes not only the subsample size but also the number of clusters.

We know that as a goes up that variance is going to go down.

If a goes down, we do fewer clusters, then that sampling variance could go up.

But it's more complicated than that,

because we also have b in that denominator and we have b in the design effect.

So we need some way of combining considerations for

this variance with the cost model.

Now one way we could do this is very mechanically.

We could choose alternative values of a and b.

We could calculate the variance that goes with each combination.

And calculate the cost model and build a spreadsheet where we have a column of

costs and we have a column of variances that go with them with driving a and

b levels for each row, each alternative.

Simulate what's going to happen.

And then we could watch and see as cost and

variance are going in different directions, is there a balance point where

we get to a point where we've got the smallest variance among all alternatives?

That would be fine to do.

As a matter of fact, many people do that kind of thing today because spreadsheets

are so easy to work with.

But really there was an earlier day when spreadsheets weren't as available,

or weren't available at all and one just simply solved

what's the balance between cost and error in finding an optimum.

And there is an approach here that can take a fixed cost.

Let's start with a budget.

We have so much money.

And we're now going to take that fixed money and

see what's the small sampling variance we can get by varying a and b.

12:00

And they all operate in a way if we think about it

that actually makes sense in terms of our design.

So for example, consider c sub a, c sub a going up or

down changes b going up or down.

As c sub a goes up, b goes up.

If our cost of getting to the cluster increases,

perhaps what we've got is a fuel shortage.

And if the cost for fuel is going up, then then means that when we get there what

this says is, when we get there, you should do more observations.

It's costing you more money to get there.

Do more data collection, since it costs you more to get there.

c sub b operates in a a very similar fashion.

c sub b, maybe our data collection costs involves administering a test,

a standardized instrument.

And that standardized instrument has some kind of a fee

associated with the per unit administration.

And suppose that fee has just gone up.

They've got more and more demand for this test.

They're increasing the cost.

So now all of a sudden, our instrument costs have gone up.

Our costs for

doing a single observation within the cluster on average have gone up.

And this says if that's the case, when you get there, don't do as many observations.

They're just more expensive for us and we should do fewer.

So c sub a in the numerator and c sub b in the denominator makes sense.

They're telling us in a quantitative way here how we should come out

with respect to the number of observations per cluster.

13:31

But there's that last factor that depends on roh, and

this one's a little bit trickier to deal with because roh is in a functional form.

1- roh/roh.

Let's try out some values of roh to see how it varies.

Let's try a smaller value of roh and a larger value of roh and

see what happens to that ratio.

So, for example, if roh were 0.01, a small amount of homogeneity,

then 1- roh/roh is 1- 0.01/0.01, or 0.99 over 0.01, 99.

In contrast to when roh is 0.05,

now we end up with 0.95 divided by 0.05 or 19.

That is, as roh increases, that factor goes down.

If we go from a variable that has a small of amount of homogeneity to a variable

that has a large amount of homogeneity,

what we're going to see is that the sub sample size goes down.

We should take fewer observations per cluster.

More homogeneity within a cluster, take fewer observations within.

Which is what we think we should do anyway.

Because with more homogeneity, each successive observation that we're getting

from the cluster is giving us less and less new information because it's

correlated with what's going on in earlier selections.

So that operates in the right direction.

Okay, so we've got a single quantitative statement about the value of b.

Wait a minute.

What about a?

How do we get back to a?

Well, now what we've done is derive the value from b.

We know our fixed budget going back to our cost model in our second bullet here.

C minus C0 is that fixed budget that we've got.

We know c sub a and c sub b.

And now we also know b optimum.

That allows us to solve for a.

Because now c sub a + b optimum c sub b is the actual cost per cluster.

The amount of money that it's going to cost to get to the cluster, to select it,

to get everything organized, to get there and all the preparation plus the cost

of collecting the data within the cluster, taking b observations that cost c sub b.

And so if that cost is a certain amount, and we only have a certain budget,

if we take that budget and divide by that cost,

we're going to have the total number of clusters that we need.

For example, suppose that c sub a was $65, and c sub b was $25.

These are numbers that I have from a past study that I did where we

calculated these costs based on budget considerations.

And so the cost per cluster was about $65.

The cost per element was actually pretty high in that particular case, $25.00.

But nonetheless, we could use that if we knew a value of rho and

for a single variable, suppose rho was 0.05.

Actually, in that particular study, we had a range of variables that we were using,

several variables that we were interested in.

And it turned out that the average value of rho was 0.05, and

that's what we decided to use.

Multipurpose design, something we're going to talk about in our next unit.

And so we used an average value of rho in that particular case.

And the combination of c sub a, c sub b, and

rho at 0.05 lead to a subsample size of 7.05.

Now that would say that what we should do then is take 41.38 clusters for

a budget of $10,000 to do our data collection.

41.38 doesn't work, we can't do fractions of a cluster.

But what we can do is round that to a number, in this case we rounded it down to

41, and I would go back and recalculate a b that fit my budget.

Because the b value there, that 7.05, yes it is the optimum.

But if I had to increase it a little bit because I've decreased the number of

clusters to meet my budget exactly, spend up every dollar that's available.

Then I'm still going to be close to the optimum because of that square root

function.

So if I have to increase it to 7.1, 7.15, that's not a bad outcome.

I'm very close to the optimum and

it turns out that that optimum's very flat in that region, around that 7.05.

So even if I go to 7.2 or down to 6.95, I'm still very close to the optimum.

And so I can meet my budget and

be close to the optimum with this particular kind of a design.

20:02

Well, that's all we have time to do about cluster sampling here.

Our next unit is about stratified sampling,

what we labeled here Being more efficient.

Taking auxiliary information that we've had in the frame available to us and

using it in the sample selection in a way to give us a sample that's more

representative.

That has better properties, that's more believable.

Something that's more acceptable to people by using that auxiliary information and

not ignoring it.

And using that to have an outcome that might be

more administratively convenient for us to do.

That's we're we going to look at in unit four,

when we talk about stratified sampling.

Please join us then as we move ahead with the lectures there,

and I look forward to seeing you in unit four.

Thank you.