In this part of the course, we've been talking about ways in which we failed to make correct interpretations in our analysis. Some of these are related to how we think and how we make assumptions and inferences. But we can also get ourselves into trouble by choosing to show data and results in a way that either doesn't tell the whole story or worse, tells the wrong story entirely. In this video, we're going to cover a few common ways in which we distort information when we show it in a certain way. The cases we'll talk about are just examples. You'll find that we can be incredibly creative at finding new ways to distort data. But these should give you a sense for what to watch out for in your own work and that of your colleagues. The first one is what we like to call the Lie of Averages. We regularly use summary statistics such as average, medians and variances to describe data. And in many circumstances that's an entirely appropriate thing to do, but that's not always the case. A great illustration of how summary statistics can lead us to some wrong conclusions is something called Anscombe's quartet. Consider these four sets of data, each set has ten pairs of x and y values. It turns out that this four sets of data have almost identical statistical properties. The average of the x values is = to 9. The average of the y values is = to 7.5. The variance in the x values is = to 11. The variance in the y values is = to either 4.122 or 4.127. The correlation between x and y is 0.816. And the equation of the linear regression line calculated on each data set is y = 3.00 + 0.500x. If all we have were these summary statistics, we probably feel pretty comfortable concluding that these four groups are basically the same, right? But what happens when we actually plot the data and look at it visually? Now, things don't look so similar. In fact, it looks like there are four completely different patterns underlying the four data sets. Armed with this information, we'd actually had a really hard time concluding that the groups are the same. The point here is that how data is distributed can really make a huge difference in the conclusions we draw. This is particularly true in business. The date we examine in the business world often has very odd properties. We have outliers that are driven not by random events, but by real process problems. We see distributions in real that data that haves spikes, troughs and gaps. And we often encounter bimodal or multimodal distributions that can be hard to adequately describe using summary statistics. Here is a real world example using some actual purchase data. Specifically, this data shows the number of purchases in some period by price. This is an oddly shaped distribution. There are a number of peaks around common values like 39.95 and 995. There appear to be some gaps in some clusters of activity. Prices range from 0 to 479.95. How much insight will we gain simply by looking at summary statistics? Now, we don't mean to suggest that you shouldn't use averages and other statistics. Again, there are many cases where that's just fine. But it's usually not a bad idea to at least take a look at the distribution, to make sure you're not missing something. So that's what we mean by the Lie of Averages. The second distortion we'll talk about is what I refer to as Chart Myopia. Let's say we're in a regional sales review meeting and the following data is presented. What do we see? Well, it looks like sales declined quite a bit between May and August of 2016. We should probably figure out what's going on and see if we l can reverse the trend, right? Well, not so fast, what's missing? If you said scale or the y axis values, you're right. Let's add that and see how it changes the picture. Okay, so this adds some perspective. Sales haven't declined by half, but they've still gone down three months in a row. Maybe we should still figure out why that's happening even if the decline is smaller. What do you think? Well, suppose we actually use the same scale, but looked at the data over a wider range of time like this. When we look at the data as part of a longer series, we see that there's actually quite a bit of historical variation in sales. Now, it's not so clear that there's a trend here at all. Maybe the decline happened purely by chance. And in any case, that shorter run of sales is more or less in line with the sales we see over a wider period. You can probably see where this is going, but let's take it a step further just to drive the point home. Let's look at the complete time series of data, but on a full y-axis scale that starts at zero. This chart tells an entirely different story. From this perspective, it looks like there's no variation in sales at all. In fact, the sales trend looks completely flat and remarkably consistent. If we'd have shown this view first in our meeting, we'd have probably just move on to the next agenda item without a second thought. So the point of Chart Myopia is that if you zoom enough on anything, you can make insignificant things look significant. Of course, the converse can also be true. As analysts, we should strive to show things in a way that puts them in perspective and shows them in the right context. Let's move on to the third and last distortion we'll cover in this video. This one I'm going to call the Disembodied Numerator distortion. Suppose we're a healthcare system, where we're looking at the number of hospital acquired infections that occurred at each of our hospitals over a period of time. This is considered to be a key measure of quality in the delivery of health care. The lower the number of infections the better. We gathered together the hospital presidents to review the data, and we start with this chart. We immediately focused on Hospital B and their high number of infections relative to the other hospitals, especially Hospital A. We suggest that Hospital B study Hospital A to learn what their doing to achieve better results. So, what's missing in this discussion? Well, while the number of infections is important, we need to put it in the context of how many patients were actually at the hospital. What were really should be asking is, out of how many? So, let's add a little information to our chart. Namely, the number of patient admissions that happened over the same period of time. Now, we see that while Hospital B may have had the highest number of infections, they also had the highest number of admissions. Conversely, while Hospital A may have had the lowest number of infections, they also had the lowest number of admissions. So it's not quite as clear who's really doing well and who's not. However, when we add the actual infection rate to the chart, we start to get some meaningful insights. When we look at the infection rate we see that Hospitals B and C are actually performing best. While Hospital A is the worst performer with the highest infection rate, just the opposite of our original conclusion. This might seem like a pretty obvious example, but sometimes the impact of the Disembodied Numerator problem is not as straightforward. Here's a real example. In wireless companies, the number of customers acquisitions and customers defections are tracked very closely. In fact, it's not uncommon for companies to have daily reports which break down these numbers by region, customer type or product type. And these reports are used to monitor and act on sales and retention tactics. A real company had such a report. And to avoid the Disembodied Numerator problem, they not only reported the absolutely number of acquisition and defections, but also the acquisition rate and the defection rate relative to the size of the respective group. Again, region, customer type, and product prototype. So periodically leadership would observe a spike in defection rate and would quickly move to understand what happened and fix it. This all sounds very reasonable, except for one small nuance. The company in question had most customers on two year contracts. This meant that if a customer cancelled prior two years, they pay a significant penalty. So when do you think customers did tend to cancel? If you said right after their two years are up, you're right. In fact, if we were to look at monthly cancellation rates by customer tenure in months, we'd find something like this. Note the huge spike that happens right after the 24 month point. Let's think about our daily sales and cancellation report. Suppose we recently had a really big promotion that brought in larger than normal number of new customers, what would we expect to see in about two years? That's right, we'd expect to see a larger than normal number of cancellations. It turns out that while our company thought they had accounted for the Disembodied Numerator distortion, they really hadn't. We regularly discovered that causes of spikes in cancellation rates were actually spikes in acquisitions that happened two years ago. And more importantly this meant there was really no problem to solve. The company would have been better served looking at cancellations by tenure versus cancellations by region, customer type or product. Using the wrong denominator can be just as bad as having no denominator at all. The takeaway here is pretty simple, always ask, out of how many? And try to include rates whenever possible to present a clear and more complete picture of what's really going on. As we said at the start of the video, we can be really creative in how we distort data. However, we hope that by calling your attention to a few common distortions, you'll be able to avoid some presentation pitfalls. Let's quickly review what we covered. We learned about the Lie of Averages, or how using only summery statistics can misrepresent the underlying nuances of our data. We discussed Chart Myopia, or how our data can tell the wrong story if we focus our visualizations too narrowly. And we explored the Disembodied Numerator problem, where counts by themselves don't tell the whole story, unless we ask out of how many. Hope you enjoyed the video. And we'll see you next time.