We previously mentioned data reduction as being our objective when we're trying to simplify a very large dataset in more simplistic terms. Now, with data visualization through things like frequency tables, bar charts, and histograms among others, the goal in those cases is really to summarize the entire shape of the sample distribution visually. But that was looking at the whole distribution. Let's imagine now, we wish to focus on one key feature or one key attribute of those distributions in a single numerical value. Now, there are different ways we can achieve this. Here, we're going to consider so-called measures of central tendency or sometimes called measures of location. We're going to consider the mean, the median, and the mode. All three will have their relative advantages and disadvantages but each is really designed to give a representative summary about where the data is centered, what is its main location. Now, we need to introduce some statistical notation. Up to this point, we've considered a selection of different variables from things like age, maybe height, maybe weight. What we'd like to do now in order to introduce some formula is to use certain letters of the alphabet to represent different variables. Now, we could use any letter of the alphabet we so choose but a conventional choice is to use the letter X. Of course, for now, we're just focusing on one variable at a time. We're doing what we call univariate statistics, uni meaning one. Of course, if we wanted to extend this to multivariate statistics and consider several different variables simultaneously, then it would be wise to use different letters to represent those different variables. But let's keep it simple and stick with X. When we have a sample data set, we would have a number of observations or sample size. The conventional notation to represent a sample size is the lowercase letter N. We have our variable X, for example, income, age, height or weight. But now, if we wanted to denote the observations individually, we would attach a subscript numerical value to our X to represent different observations. For example, X1 would be the first observed value, X2 the second and we would continue up to X N, the Nth, i.e. the last value in our data set. Now note, these are simply representing the order in which the observations are collected, not necessarily from smallest to largest because we may not necessarily observe the values in that sort of ascending order. Now we have our notation, how would we calculate the so-called sample mean? Well, for this, we need to introduce a special symbol, the summation operator. Now, this is denoted by the Greek capital letter Sigma. Just think of this as a form of notational shorthand whereby if we use the Greek capital letter Sigma, the summation operator, this simply represents adding things up, specifically, adding things, which things? The things which follow that summation operator. If, for example, we wanted notation to represent the sum of all of our observed values, we would write the sum of the XIs. This I would be our index of summation and, to make it clear, how many values need to be added up, our summation operator would have the limits to this summation index, namely the minimum and maximum value for I. If we wanted to sum across all of the observed values in our data set, underneath the summation operator, we might write I is equal to 1 indicating the starting point for this index and above the summation operator, we would put the limiting value for I. So, if we wanted to sum across all N observations, then N would appear at the top. The sum of the XIs where I goes from 1 to N is simply a form of notational shorthand which says, add up all of those observations from the first up to the last, i.e. the Nth in our sample of size N. Now you are armed with this, we are now able to come up with our sort of first formal formula of the course namely for calculating the sample mean. Now, I'm sure many of you covered this concept back at high school and you may have remembered the definition of the sample mean in qualitative terms, i.e. add up all of the observations and divide by the number of observations. Well, clearly, as statisticians, we tend to dislike words, they can be quite cumbersome, we can often represent the same words in much simpler notational format. If we take the sum of the XIs and divide it by N, this is how we would calculate the sample mean. And because the sample mean is arguably our most useful descriptive or summary statistic, we often attribute to it and attach to it its own special notation namely that of X bars, so there's a bar above the top and we read it as X bar. If we took a very simple dataset, let's say, of three observations, 1, 4 and 7, and we wanted to calculate the sample mean of these, we would want to add up those observations and divide by the sample size of 3. Effectively, the sum of the XIs over N here will be the sum of the XIs, sum I goes from 1, the first observation, up to 3, the third, dividing by N which it would be equal to 3, so we are doing 1 plus 4 plus 7 divided by 3 and that is going to give us a value of 4. That would be our sample mean. Now, note here, I admit it is a very simplistic example, but we see here data reduction in action. Here, we've taken three values, the 1, the 4 and the 7 and we have reduced these into a single value, i.e. the mean which is being used as a measure of the central tendency. So, in what sense is the mean a good measure of central tendency? I mentioned the median and mode still to be formally introduced but it does indicate that there are more than one, well, there is more than one measure of central tendency. So, what is so good about the mean? Well, if we now consider deviations of the observations around the mean, a convenient property about X bar is that the sum of the deviations about the mean will be equal to zero. In this sense, the sample mean X bar does provide a very good summary of the data set because the positive and negative deviations about the mean cancel each other out when we sum across all of the observations. Now, that may not seem of particularly great interest to you at this stage but, trust me, this is a very useful result indeed. So, we have our sample mean. Now, briefly, let's just consider those others; the median and the mode. Well, the median is the midpoint when we have arranged our observations in order from smallest to largest. Remember that notation of X1 for the first observation, X2 for the second up to X N for the Nth. Well, suppose now, we wish to rearrange our data set in ascending order such that we begin with the smallest observed value which is not necessarily the first value we observe up to the largest, the maximum value in our data set. To, perhaps, distinguish the unordered observations from the ordered ones, we could introduce a slightly distinct form of notation for the arranged observations in ascending order. And this gives rise to the so-called order statistics. My way of denoting this is to put some small parentheses, some small brackets around those subscripts. So X1 with the 1 in parentheses would represent the smallest observed value. Alternatively, we may call it the minimum value of X observed up to X N in parentheses being the largest observed value or the maximum value of X. The median would simply be the midpoint of this audit's dataset. Now, clearly depending on whether we have an odd or even number of observations will affect whether or not there is an explicit midpoint to our dataset. If we had an odd number of observations, indeed, one can locate an exact midpoint within it and there's an unambiguous value for the median, i.e. one of our observed values. If, on the other hand, we had an even number of observations, there's not an explicit midpoint that we observe. What we would tend to do is take a simple average and we know what the average is, add up the observations divided by the number of observations. But here, we would simply take an average of the two values either side of this sort of hypothesized midpoint. So, it's effectively a mean but just of two values. The mean and the median, why might we prefer one to the other? Well, the mean, although it's very widely used, has a slight disadvantage in that it can be affected and is sensitive to any extreme observations in our data set. An alternative name for extreme observations might be a so-called outlier. Now, there are different ways of how we can formally define an outlier but for now, just think of it as a very extreme observation relative to everything else we observed. Remember, X bar is the sum of the XIs over N but, of course, that numerator term, the sum of the data is going to be affected by any single extreme observation which exists. We have the mean and median as two examples of averages. The mean being sensitive to outliers, the median not sensitive to outliers. Perhaps an interesting exercise for you is to look at the mainstream media and try and find some articles which are looking at income levels within a country, maybe trying to say something about the level of income inequality within a country. And see if the reporter has opted to use the mean or the median as an average measure of income in a particular country. Now, I would argue here, perhaps a good reporter would opt for the median over the mean because if we consider a more capitalist based economies, you would think about the highest earners in society. They may not be very many but they're going to be earning very vast salaries, have very high incomes. And, potentially, their incorporation in this mean measure as an average of income could be somewhat distorting. But because there are very few of them, perhaps they are not too representative of society as a whole and often, we tend to use the median income figure in such reports. I would just perhaps conclude with our third measure of central tendency, the mode. Now, we're only going to pay brief lip service to this one because it really is of less importance relative to the mean and the median. But remember that sort of builder-decorator analogy; different tools for different jobs. Well, we've seen with data visualization there are different types of data displayed which are appropriate depending on the levels of measurement as well as the number of variables present as well. Similarly, of our mean median and mode measures of central tendency, they are tending to work best depending on the levels of measurement we face. A mean, a median ignoring the, perhaps sensitivity or otherwise to outliers are particularly good when trying to summarize measurable variables. The mode, quite useful if we were trying to come up with the most frequently occurring level of a categorical variable because the definition of the mode is the most frequently occurring value within our data set. So, if you consider measurable variables which potentially could take many different values within the data set, the mode really is of perhaps limited use, but when dealing with categorical variables, for example, which is the most common nationality? Perhaps in a body of students was nationality being a nominal categorical variable, perhaps the mode provides a useful summary there where we can see that one nationality tends to be more frequently occurring in the student body relative to others. So, I appreciate that the wider statistical analysis, we can often be faced with a choice of different things to do. For example, where should we use the mean? Should we use the median? Should we use the mode? Often there may not never be a single right answer to this, there will always be tradeoffs involved. There will be relative advantages and disadvantages doing one thing or another. Perhaps a great skill I'd like you to take away from, perhaps, this entire course is to recognize that often, there is no single right answer but it is good for you to be able to evaluate the relative pros and cons, the relative merits and limitations to make a judgment call about what is the best thing to do. For example, if you're looking at income, should you summarize it? Perhaps with the mean or perhaps the median. Well, the choice is yours.