In order to describe the distribution of a quantitive variable you also need precise numerical descriptions of the center and spread. >> The mode is a kind of average. There are three kinds of average and each one tells us something different. So we need to make sure we understand what each average means. >> When we use the term average, we mean one of three things usually, either we mean the mean average. Or we mean the modal average. Or we mean the median average. It's very easy to understand the difference between these, especially if you've played darts before. After two lots of three darts and in my sixth dance I scored a two, three threes, a 12 and a 13. Now lets see if we can work out the mean, the median and the mode. First of all the mean. We take the total of all the six scores and divide by the number of observations, and that's the mean. If we want the modal score we simply look for the most common score, the most common number of observations. If we want the median score, we write the scores down in ascending order, and then look for the middle value. There's a slight problem here that we have an even number of observations, so we take the two middle values, and work out the mean average of those two. So for my not very good dart playing, scores were two, three, three, three, 12, 13. The mean is 2+3+3+3+12+13 divided by 6, 36/6 = 6. The mode is 3. The median since we have an even number of observations, is 3 + 3, the middle two observations, divided by 2, which equals 3. Notice, if the dart player had scored say, 19 instead of 13. The mean increases to 7, but the mode and the median score is unchanged. >> So lets briefly review numerical measures of center. Intuitively speaking, the numerical measure of center is telling us what is a typical value of a variable's distribution. The three main numerical measures of the center of the distribution are the mode, the median, and the mean. So far, when we looked at the shape of the distribution we identified the mode as the value where the distribution has a peak. And we saw examples when distributions have one mode, that is a unimodal distribution, or two modes, a bimodal distribution. In other words, so far we identified the mode visually from the histogram. Looking at our histograms again, we can easily see the mode. It's the most common occurring value in the distribution. The median, that is the midpoint of the distribution, is the number such that half of the observations fall above and half fall below? We find the median by ordering the data from the smallest to the largest. Consider when N, the number of observations is even or odd. If N is odd the median is the center observation in the ordered list, when the number of observations is even the median is the mean or average of the value of the two center observations. The mean, of course, can be calculated by adding up the values for all the observations and dividing by the number of observations in order to generate a mean average. Our goal here is to describe the distribution. How would you describe these two distributions of exam scores? Both distributions are centered at 70. The mean of both distributions is approximately 70. But the distributions are really quite different. The first distribution has much larger variability and scores compared to the second one. In order to describe a distribution, we need to supplement the graphical display, not only with the measure of centre, but also with the measure of the variability or spread of the distribution. >> There are several ways to describe spread. A commonly used measure is standard deviation. The idea behind the standard deviation is to quantify the spread of the distribution by measuring how far the observations are from their mean. The standard deviation gives the average or typical distance between a data point and the mean. In order to better understand standard deviation, it would be useful to see an example of how it's calculated. In practice of course, the software will be doing these calculations for us. [NOISE] >> Emergency medical services companies would like to estimate how many ambulance crews to keep on standby. Here are the number of ambulance calls over an eight hour period. To find the standard deviation of the number of hourly calls, first we would find the mean of our data. [SOUND] Next we would need to find the deviations from the mean. That is the difference between each observation in the mean. Since our mean is 9 we would subtract 9 from each of our observations. As a third step we would square each of these deviations. Next we average the squared deviations by adding them up and then dividing them by N-1, that is one less than the sample size, this average of the squared deviations is called the variance. The standard deviation of your variable is the square root of this variance. >> So why do we take square root? Note that 16 is the average of the squared deviations and therefore has different units of measurements. In this case, 16 is measured in squared number of ambulance calls, which obviously cannot be interpreted. We therefore take the square root in order to compensate for the fact that we've squared all of our deviations and also in order to go back to the original unit of measurement. Recall that the average number of emergency calls in an hour is nine. The interpretation of standard deviation equal to 4 is that, on average, the actual number of emergency calls each hour is 4 away from nine. Another way of saying this is that there's an average of 9 ambulance calls in each hour plus or minus 4. >> Since we're working with very large numbers of observations hand calculations of standard deviation really aren't feasible. Python will do all of these calculations for you, but it's important to know how to calculate standard deviations so you can make sense of your variability. For example, looking at a variables distribution in two different samples, you should be able to tell which has greater variability, that is, a larger standard deviation. To calculate the standard deviation and generate other descriptive statistics for a quantitative variable, we often use Python's describe function. Here is syntax for describing NUMCIGMO_EST as the quantitative variable. Desc1, which is the name that I am giving to the object that will store these calculations, equal to NUMCIGMO_EST. From our sub two data frame. Dot describe, followed by empty parentheses. I also title my output and ask Python to print the results. This provides a count, mean, standard deviation, minimum and maximum values and the 25th, 50th and 70th percentile values. So you can see that describe is extremely useful in better understanding important characteristics of this cigarettes smoked per month variable. We now know that young adult smokers in our sample smoke on average 320 cigarettes a month. In that the standard deviation is about 274 we can say that on average, young adult smokers smoked 320 per month. Plus or minus 274 cigarettes. So as you can see, there's an extremely large range in terms of cigarettes smoked, and a lot of variability on this variable. Very similar code can be used to calculate many of these statistics individually or to generate additional descriptive statistics. Here's additional code for generating the mean, standard deviation, minimum and maximum, median, and mode of a quantitative variable. Note that the count for this variable is 1,697 rather than the size of our sample of young adult smokers which was 1,706. This is because Python does not include those cases with missing or NaN data in these calculations. But what if we include a categorical variable when employing the describe function? Because we have previously defined TAB12MDX, our nicotine dependence variable is categorical. Adding describe syntax provides us with descriptive statistics appropriate for categorical data. That is count, number of unique values, the top or highest value and the frequency of that top value. If you would have failed to describe this variable as categorical, Python would still generate descriptive statistics. However, many would not make any sense. If you'll recall the nicotine dependence variable's represented with dummy codes. That is, yes is indicated with a 1 and no indicated with a 0. As you can see here we've got a standard deviation based on dummy codes of 1 and 0. Further, percentiles are listed representing yeses and nos rather than actual quantities. So again, it's very important to remember to use the appropriate descriptive statistics for both quantitative and categorical variables. For quantitative variables it's best to examine histograms, and then to supplement these with exact measures of shape, center, and spread. Categorical variables can often be described well with frequency distributions or with a bar chart.