So, for the final section of this module, we're going to work on understanding basic statistical tests. So, it's important to realize large association sizes do not necessarily mean meaningful associations. Considering example that I flip a coin three times and get three heads. But then my office mate flips the same coin three times and gets one head. So, I'm three times more likely to get ahead than my office mate. So, the relative risk is three. It's pretty big. It also, if we multiply now suggests a difference of 20 heads per a 100 coin flips. That is I'll get 20 more heads than my office mates over a 100 coin flips. Is this strong evidence that I'm better at flipping heads than my office mate? I don't think so, right? I only got three heads in a row. He got one head and two tails, seems like it could happen by chance. So, in order to help us make this call between whether or not this is strong evidence or not in a formal systematic way, we want to test the statistical significance. So, perhaps the most common measure of statistical significance out there, you'll see are p-values. p-values measure the strength of evidence versus no association or the null hypothesis, and this is important. We're not saying how big the associations is we're not saying what direction it is. All we're saying is this is the evidence we have that's against there being no association at all. So, what the p-value represents precisely is the probability of seeing an effect as larger observed or greater if we've repeated the experiment a large number of times and there was not a true association. That is if the no association hypothesis was true, and we repeated the experiments over and over and over and over again, the p-value tells us how often we would see a result as extreme as the one we saw or more extreme. Typically, values less than 0.05 are considered to be significant. That is we call it significant when we see a result that we would expect to see, that result are larger only once in every 20 times we did the experiment. But it's important to note that this is not a yes or a no decision. It's often presented that the p-value of 0.05. If you're on one side of that, you are significant, if you're other side they're not. Really different p-values represent different strengths of evidence and at different times, or for different tasks we might want different strengths of evidence. So, one common way of calculating p-values is the Chi-Square test, and this is particularly useful because it can be used to test tabular data and tell us if there is a significant relationship in that data. So, the Chi-Square test tells us how different tabulated data is then is expected by chance. It's looking at the difference between what we see in individual cells and the marginal totals, and this will become clear in a little bit when I talk about it. It's a good test to know about because it's implemented in Microsoft Excel and many other commonly used office spreadsheets as well as all statistical software. It does give evidence that a significant association exists, but as I said before, not the direction of that association. So, let's go through an example to understand this Chi-Square test. The table on the right shows the expected values for sick men and women. So, it shows that we're expecting out of 50 men, 15 maybe sick and 35 to be not sick, and that of women, 15 to be sick and 35 to be not sick, and this is based on the fact that we have 50 men, 50 women and we have 30 sick people, and 70 not sick people. So, if we have a table where those inner cells have values closest to as expected numbers here 14 sick men instead of 15, 36 not sick men instead of 35, 17 sick women instead of 15, and 33 not sick women instead of 35. If we have the similar values, we get a p-value that's big. Here is 0.66, so this is not significant. Now, imagine those values were very, very different than what we expected based on those marginal totals. Based on those 50 men and 50 women and 36 and 70 not sick people. So, here we have five sick men, and 45 not sick, and 25 sick women, and 25 not sick women. So, you can see there's very different rates of being sick in the men and women. So, for this test, the p-value is less than 0.0001, and we say that this is a very highly significant result. So, let's subject some of John Snow's work to this Chi-square test. It's important to remember John Snow didn't have these statistical tools when he conducted his investigation. But if he had, he would have seen a very significant relationship. So, here we're comparing people who lived in districts served by Lambeth and Southwark can Vauxhall, those who lived in districts served by Southwark and Vauxhall only and those who lived in districts served by Southwark and Vauxhall, and Kent. If we run the Chi-Square test on this table, we get a p-value of 0.0002. So, they're suggesting there is a strong evidence for a significant relationship between the profile of district water providers and cholera. So, hopefully going through the Chi-Square test in detail gives you a sense of what p-values mean, and there are a lot of other common sources of p-values, they're used to compare different types of data. Perhaps the most common way to calculate a p-value other than the Chi-Square test is the t-test which calculates the difference in means between two groups. There are one-tailed or one-sided in two-tailed or two-sided versions of the t-test, and the two-tailed version is more conservative, and it's what most journals require. Analysis of variance or ANOVA is typically used to evaluate the results of statistical experiment. So, you'll find this a lot when you're looking at the laboratory literature and laboratory results. More and more these days regression models that look at complex relationships or sometimes simple relationships between given variables and observations produced p-values for the different effect estimates that they create. The Mann-Whitney U test produces p-values per comparing groups without actually calculating their mean. So, it allows us to look at differences between groups. But it's a little more subtle than the t-test because it's not based on comparison of their means. But overall, there's as many ways to calculate p-values as there are ways to analyze data. So, p-values have recently become controversial with many people saying they shouldn't be used at all, but I want to iterate that p-values are not inherently bad. They're just misused. But as I said because of this misuse, some journals no longer allow them and many people are discouraging their use altogether. But they're still useful test and they should take a few pointers to avoid their misuse. First, there's no magical value that makes something significant. The 0.05 value you see all the time is just a rule of thumb. There is nothing necessarily particularly special about it. In most statistical tests, p-values are only comparing against an alternative of no association, that is, they don't compare against another hypothesis. So, sometimes we see a significant p-value and we take it to mean that this hypothesis is true compared to all other hypotheses out there that would have some association but are different, but that's not the case. It's just saying that this hypothesis has more evidence than there being no association at all. Just because something is significant doesn't mean it's meaningful. Suppose, somebody came in with a study that showed that there was a significant result suggesting that there was a one in one million increased chance of death of you dying over the entire course of your life, if you drink coffee with cream everyday. Well, this is a statistically significant result. The association is there, but is one in a million and big enough increasing the chance of death that we'd actually consider stopping drinking our coffee with cream because of the association? Maybe not and finally and perhaps most importantly, if you do an experiment enough times, you'll eventually get a significant result. So, if you're reporting p-values, you have to decide how many times you're going to do the experiment, or how many times you're going do the study, do it, and report your p-values. You can't keep doing the experiment or study until you get a significant result because just by the laws of random chance, you eventually will. So, an alternative way to measure the strength of an statistical association is to calculate confidence intervals. So, confidence intervals give a measure of the likely range of values. Some measure of association might take based on our data. So, it improves over the p-value by giving us a range of supported values not just a binary answer about significance. It doesn't just say, "Oh, there's probably an association versus no association." It says, "There's probably an association and it lies within this range. Confidence intervals are most commonly reported as 95 percent confidence intervals but just like with p-values, you could pick a bunch of different ranges of confidence intervals based on your purpose and how confident you want to be. So, the appropriate interpretation of confidence intervals is very subtle and even very experienced epidemiologist and statisticians sometimes get it wrong. So, a confidence interval is an interval that if it was calculated the same way in multiple experiments or studies recovered the true value of the parameter 95 percent of the time. In practice though, it can be thought of as giving the range of values that are not significantly different from our mean estimate at the 0.5 level, if our confidence intervals are 95 percent confidence interval. In other words, it tells us the range of values we can't confidently exclude. So, here's an illustration to help you think a little bit more about the difference between p-values and confidence intervals. So, if we see a p-value of 0.05, so our common threshold for statistical significance for our risk difference would say is on this line, it would be saying, we are 95 percent confident that the true value lies somewhere in these shaded regions outside of a risk difference of zero. That is our p-value has tested for the true value being no difference, and we're highly confident or 95 percent confidence that that's not the true value. In contrast, our 95 percent confidence interval will identify some range of values that are different from zero, if it doesn't overlap zero where our data is supporting the true parameter value likely being. So, before we wrap up, let's bring everything we've done in this module together. So, think about our specific hypothesis from John Snow's study that people who get their water from the less contaminated part of the Thames upstream of London are less likely to die from cholera than those who get their water from the more contaminating part of the Thames downstream of London. Our estimates of relative risk and risk difference strongly suggests this hypothesis is true. The relative risk is 8.4 and the risk difference is 2.8 deaths per 100. But as I said, those alone don't tell us the string from the evidence. They just tell us the size of the difference. So, we can use the confidence intervals we calculated before that showed that this is a statistically significant and meaningful result. The relative risks of those people who get the 95 percent confidence interval on the relative risk of death for those who get their water from downstream of London is 6.8 to 10.3. So, it's nowhere near one. So, there's almost definitely an effect there, and the 95 percent confidence interval on the risk difference is 2.6 to 3.0 deaths per 100 people. So, we think that we're 95 percent confidence that the increased number of deaths per 100 among people who get their water from South of London is between 2.6 and 3.0 people per 100. To go over some of the key points from this section. Large measures of association are not necessarily strong evidence of a true association, and some small measures of association are evidence of a true association. Statistical tests let us test them against a null hypothesis of no association. Low p-values support that some association exists but there's no magic number that "proves" that the association exists. Confidence intervals provide more insight into the range of associations that are consistent with the evidence, but they still can be misused. Very importantly, statistical tests measures associations in the data and are stronger as there is more data. So, if there's a systematic bias, for instance, we selected the wrong comparison group, more data we'll just give us stronger evidence of the wrong conclusion. As an exercise here, use Excel or an online Chi-squared to calculator to calculate the p-values for the results of me betting my friend, John, that I can flip more heads than him, and think about the situation where I flipped four heads out of five tries and John flips three heads out of five tries, or one where I flipped nine heads out of 10 tries, and John flips four heads out of nine, or the one where I flipped 18 heads out of 20 tries, and John flips 10 heads out of 20. So, would any of these give you strong enough evidence of me flipping more heads to accuse me of cheating? Some online Chi-squared test calculators you might try are listed here. To summarize this module, a precise hypothesis is the key to testing an epidemiologic idea. Testing precise hypotheses require comparisons between appropriate groups. Measures of relative risk and risk differences can help us quantify associations and statistical tests help us determine if these association provide meaningful evidence for or against a hypothesis. So, this concludes our module on weighing evidence and identifying causes. Thank you for listening.