Hello, and welcome again to our lectures on sampling people, records and networks. This is our fourth lecture in the first unit that we're dealing with, the first of six units. And in this particular case, we're going to be dealing with the topic of randomization. We're talked about sampling and why one would sample at all in the last lecture but here we're going to deal with how we draw the sample and in particular. Particular the use of chance selections, and why we should even do it at all. It sounds scientific, I know, to use chance selections. It sounds like it's fair, it sounds like it's the right kind of thing, and it's not a bad reason to do probability sampling, as the label will give to this. But it's a more complex procedure than Informal methods of sampling, where we're just recruiting subjects. And so, we're going to incur added expense to do this. What do we get with it? What do we get with probability sampling and how valuable is it? That's what we're going to look at in this lecture and the next couple of lectures. And then, we're going to continue our discussion by actually doing some applications, although we'll do one here in this lecture, as well. We're going to discuss this issue in four parts here and you'll see our display in the upper left now. We're going to first look at what random numbers look like and then we're going to talk about how they might be used in the selection, and then kind of a peculiar question, should I put it back? And I'll tell you more about what I mean by that, and then an example of sampling people from a list. The idea of random numbers, they come in many different forms. Here's ten random numbers. These random numbers are all between zero and one. You can see with the decimal, they've got five digits. They can have more. These were generated by a system that gave us the random numbers or the numbers between zero and one. And any number between zero and one was equally likely. Now, if you want to specify in terms of these five digits or one or three, that's fine. But the idea is that any number within that range has the same chance of being generated as any other. Those ten random numbers there are from what's call the uniform distribution. Uniform, equal, uniform probability of selections for all those numbers, there's ten of then. Now, you can't distinguish then from a not random number frankly. One of the impossible for us to generate the same number very unlikely but the same number that first number 0.49018. Could have been generated ten consecutive times and no one would we have random numbers, so there's no limitation on these, there's no restriction on the random numbers. All we know is that every number from zero to one has an equal chance of being selected and we've done a five digit representation here. Here's another set of random numbers. These are single digit numbers. The digits 0, 1, 2, 3 through 9. There are 10 possibilities in every one of the 50 locations that we see. It's just a string, a string of 50 numbers. And those 50 numbers, 0 through 9 are all represented in them. But in any given location, there could be a number from zero to nine. There also from the uniform distribution but a different way of thinking about it. They are the numbers from zero to nine all equally likely at any given location. 50 different random numbers each, equal chance of anyone of those digits. And so, we now have a sequence of random digits as well. Well, we had a sequence before but here's one that's a little more concrete 50 digits in sequence. Sort of like we could have had a sequence of digits that choose between one and two, where one represents a coin tossing ahead, and two represents a tail. All of these kinds of things are possible ways of representing these things. Now, they can be grouped in many different ways. These 10 digits, these 50 digits can also be that same 50 digits grouped in sets of 5. They don't have to be strong out like that. They've been grouped in a way that's more convenient, a little more convenient to read, list in blocks of five and often times are random digits are generated in just such a way, long, long strings of them. And then, they are written out in blocks like this just to make it more convenient to see what the numbers are. We can also use random digits that aren't uniformly distributed. For example, these digits here come from a generation system where they tend to generate numbers that are closer to the middle. Closer to the middle, to the numbers from zero to one. And they tend to be more frequent towards the middle. Well, that sounds biases, I mean, why would anybody do that. Well, it's because this comes from a distribution that occurs in naturally. We see this kind of distribution arising in practice, and so we want to mimic what's going on in the real world by using digits that are coming from a distribution that follows that normal distribution. By the way, the world normal here, that I've used with a capital letter, think of it not as normal in terms of that's what everything should be, but more as a standard, a standard distribution because it does occur so often and it one that is widely used in statistical practice. So it comes more from the French than it does from the English normal meaning, kind of middle. Here's 4,500 random digits. They're grouped into blocks of five, I know it's really hard to see, right? But nonetheless you get the idea, they're those chunks of five. There's 50 rows and 90 columns, 18 blocks of 5 going across those columns, 4,500 digits. Why would someone display this way? Well, it maybe that sometimes we need five digit numbers or two digit or one digit depending on the problem the application and having them grouped this way makes it easier for us to see them, to read them, use some sets of numbers for two digits and some for three. These are random number tables appear in text books. And my mentor a man named Leslie Kish had a table of this in his textbook. And he made a trip to the People's Republic of China a long time ago to talk with them about doing chance selections and demographics studies, studies of population. And while he was there to honor him, as a sociologist and a statistician, one of the students went out and took the random numbers in his table in his textbook and they printed them on a T-shirt. And he brought these back. He was so thrilled with this. So he brought them back and hand them out to the students and colleagues. And my wife has one. It's quite a fashion statement, just digits. Grouped in sets of five. Actually, some people think they look like zip codes, but that's neither here nor there. It's not necessarily good fashion, but it does represent a way of looking at these. Here's another representation of these. This is from one of my favorites, a book published by the Rand Corporation. The Rand Corporation, a think tank during the warriors and even today they do a lot of work on defense contracts. And following World War II they published a book called A Million Random Digits so now we're talking about a sequence of a million digits from zero to nine, repeated going on and on and they group them into blocks of five. They group them into blocks of 500 pages with 2,500 digits on a page. With 50 rows and 50 columns on a page. And then, they printed them. Now, why would anybody do something like this? It had a serious purpose. It also generated some fun things, but the serious purpose was they were demonstrating that they had a random number generator that would generate numbers that looked random. Now, they were actually convinced that the the generator was truly random, so they call it pseudo random number generator. The English pronunciation of the Greek meaning false, and the random number generator they want demonstrate that is coming really close to a random process. So they did it a million times and the counted how many zeros, how many ones, and so on. In effect the only text in the book is the first page. A short table showing how many ones, how many twos, and so on with it. And the rest of it are the digits. Now, some people have had some fun with this. There actually is a book review of this Million Random Digits out there on Amazon. And from the Freakonmetrics blog, they quoted some of these and I just wanted to share them with you. For example, someone reading this wrote, to whom would I write to report typographical errors? I noticed that the first 7 on the third line on page 48 should be a 3. The seven that's printed there now isn't random. Other than that this is really an excellent book. Now, how they knew that? I'm not quite sure. I think they were just giving us a little poke at such an odd collection. There was another one that came in and said such a terrific reference work with so many terrific random digits. It's a shame they didn't sort them to make it easier to find the one you're looking for. Of course, if they sorted them, they wouldn't be in random order anymore. So we'd have 100,000 ones zeroes and 100,000 ones in order. Well, they're just poking fun at it. My favorite was this one. I took a class in Statistics in college. I used this book to help me select random phone numbers for a poll I was conducting for my class project. The most popular household cleanser in the greater Siouxland area is Bon Ami, by the way. One of those phone calls was answered by a woman who is now my wife. We've been happily married for ten years. Well, that kind of unusual use random digits to find that slots but then the list for flex some of chance how to that process. But back to our random digits. We can use this to make our selections now. This kind of thing could be use selected examples by taking a frame. Remember what we said about frames a few lectures ago. Here's a frame. This is just a list. This is a list of faculty. We don't have the names for the faculty. We have a sequence number. We have an ID number, an eight digit ID number. We know what college, what division they're in. We know their sex and their rank and salary information. Now, we're going to draw sample from these. These actually came from the faculty at the University of Michigan. The whole list has 370, this is just the first 25, and the idea would be that we're going to grab some random numbers from the table and match them with the ID numbers of the faculty. For example, we could try and do it with the eight digit ID number. Well, that would be a little bit complicated. We'd need eight digit numbers. And then, 8 digits, 370, man, we're going to generate a lot of 8 digit numbers that don't match up at all. So that's probably a very inefficient way to do this. But we would be better off using the sequence number, and so the sequence number would allow us to generate three digit numbers. From 001 up to 370, any random digit, 3 digit sequence that we might choose from one of those tables matched up to these would be a selection. So here's another tip in random numbers. And I told you there's lots of them out there. But here's one that actually comes from another source than the one that I've shown you so far. And they're grouped in sets of five and rows, and we need to start choosing from these numbers that we match up to that list. Where do we start? It doesn't matter where we start. I suppose you could close your eyes and drop a pencil on the page. I know that's actually written about in some textbooks, it's not necessary. Start in the upper left. Whatever you like to do. But do something that's systematic so that you use this numbers and use them in a way that you can keep track on what you've done. You can let me illustrate. So for example, let's look at this first number here. A three digit number is what we need. Now, I know that's hard to see but the first three digits are of the first three digits from the first five digit block. 579, we're going to match that up to our list. Except that we don't have a 579 on the list. It only goes by sequence number to 370. We'll cross it off. The next number is 341. I'm just going to go down the columns, those first three columns. I'm ignoring the last two digits in the block of five, that's fine. 341 is in the list, and that becomes our first selection, so I've circled it. And continuing down, that means that there's my first sample selection. Continuing down, 019 is the next one. That's also on the list. Notice the leading 0. So we need to get 19s in there or 1s or 2s and we use sequences that involve 0s, as well, as leading digits. So 19 becomes one of our selections. We go to the next one, 253, 253 is another one of our selections. And 238 and 694 are just in for more than one time. I did a bunch of them here. Now, you notice that I have gone down the column, below the bottom of the page this actually goes on and on to get our sample. And when I get to the bottom of the page. The bottom of the columns that were 50 rows. I still didn't have enough for my sample. So I went back to the next set of three. Well, in this case, the next block of five and three. That was just the way I want through the list. You could have chosen to use the last two of the first five and the first of the next six, the next five it doesn't matter. Just long as you keep track of it and identify what you've done. And there's our sample. 341, 19, 253, all the way through 291. That's the random sample that we've selected for our purposes. One question that comes up, it would have been possible to have gotten same number more than once, as a matter of fact in the actual sequence that I used, one number did come up. I think it was number two, three, eight came up twice. Well, what do we do with that? Should we select it twice or not? Now, this raises a question about putting it back and there's two alternatives here obviously. We keep it or we discard, we drop it. If we keep it when it's drawn a second time, that's what's known as with replacement sampling, and we're going to go over this again several times. If we drop it, it's known as without replacement. With replacement means that once we've chosen it, we put it back in the list and make it eligible for selection. Without replacement means we take it out of the list and make it ineligible for subsequent selection. And this has an impact on some of the things that we're going to be looking at. The preference is to drop it. It leads to better samples. But better samples not because of the appearance of the samples, although this is what some people might think, but because of the properties with respect to the quality of the data that we get. Back to the sample. Here's the whole sample. Here's the 20, they're now number from 1 to 20 in terms of the order that they selected there you can see the randomness, I suppose of the list of the random sequence. We went through and we also put in there the incomes. Just to illustrate and calculate it then the incomes, these are in thousands of dollars. And we can see that the mean income of this sample was $78,600, a mean. That's the result, that's what we wanted. So we have an example of sampling people from a list and then calculating a result from that list. What we're going to do next is look at what happens when we randomize. Well, we think we've just seen it. No, there's some consequences here that we need to understand statistically. And that's what we're going to look at in lesson five. About what happens to the quality of our data when we randomize, and how we assess that quality.