[MUSIC] In this lecture, we'll take a closer look at controlling your Type 2 error rate. If you design a study and you're going to go through all the effort to collect data, you want to make sure that the probability that you'll find an effect if it's actually there is high enough to make the data collection worthwhile. Now, remember that the Type 2 error is a situation where you're saying that there's nothing when there's actually something to be observed. So you're concluding that there's no support for your hypothesis whereas if you would have had enough statistical power, you would have said that there's actually a true effect. Let's take a look at a classic question that's asked by Tversky and Kahneman where they find that people actually don't really know what the probability is that a study will yield an informative answer. They ask researchers to think about the following situation. It is "known" that an effect exists in the population. So we're sampling from a situation where there is a true effect to be observed. Then, there's a pilot study where a difference was observed between two groups. You have some statistics: a sample size of 22 for example, a mean and a standard deviation. And you have a second group with 23 people, also a mean - that's a little bit higher - and another standard deviation. Now, you've observed a statistically significant effect in this pilot study. The p-value is smaller than 0.05. The question that Tversky and Kahneman ask is: let's say that you set out to repeat this study in exactly the same way - the same sample size and everything. What's now the probability that you'll observe a statistically significant effect? Take a moment to think about it. How would you answer this question? What Tversky and Kahneman found is that most people say that there's a 95% probability that a replication study will again give a statistically significant result. There seems to be some confusion in what the p-value means. Apparently, people take it as an indicator that it's now 95% probable that there is a true effect. And then, the following study will also observe this effect. But the correct answer is that you have to perform a power analysis. In this graph, we can see different curves that reflect the probability that we'll find a statistically significant effect if there is a true effect. So in other words, the statistical power that we have is a function of the effect size and the sample size. The larger the sample size, the higher the probability that we'll find us statistically significant effect. And the higher the effect size, the more probable it is that we'll find this effect even with smaller sample sizes. So taking a look at these power graphs before you design a study is a useful way to think of the number of people that you might need when you perform a data collection. Let's have one specific example. For example, you think you might examine an effect with a Cohen's D effect size of 0.5, which is considered a medium effect. Actually, in psychology if you throw all psychological research in one big pile, the average effect size is around 0.43. So this is a slightly optimistic estimate but still possible. Now, if you want to achieve 95% power in the study that you want to design, you actually need to collect 100 people in each condition in an independent t-test to achieve this level of 95% power that you want to get. So it's quite a substantial number of observations that you need to have a high probability of concluding that there actually is an effect when there is a true effect. So the Type 2 error rate in this case is only 5%. But getting it this low requires a large number of observations. If you design underpowered studies, and you don't think of the number of people you need, for example, to have high statistical power, then you're designing studies that have low informational value. If you do not find an effect, it's very difficult to say that you didn't find an effect because there is no true effect or that you didn't find an effect because you didn't have enough observations to have observed a statistically significant finding. If you design a study with high power, and thus with a low Type 2 error rate, you can consider these as severe tests. This is a very good test of you hypothesis. If you really have a very high probability of finding an effect if it's there. - so you have a very low Type 2 error - and you don't find an effect, well, that something to make you wonder if there's really an effect or not. So you see that this is a very informational result. If you have very low power, you don't learn a lot. So thinking about how to control your Type 2 errors is very important. In this case, we have a p-value distribution of about 95% power, and you see that in this case you have a very severe test. There is a true effect in our simulation. We have 95% power, so most of the time we will find a statistically significant result. It's very rare to find a higher p-value. Most of the time if there is something there, we will observe it. So, this is an example of a severe test situation where if we don't find a result, we should wonder if there's actually something to be observed. So, how can you increase your power? There are several ways, and the most attention is being given to increasing your sample size - that's of course a lot of work. It is a very good way to increase power. But it's not the only way. So let's discuss some alternative options. First of all, you can decrease the measurement error. This is a very important way of increasing the statistical power that you have by reducing the variability in your data. You can think about an IQ test. Instead of asking one question and then basing the IQ score on this one item, an IQ test has many items. Using this approach reduces measurement error and makes it easier to find differences between groups if they are really there. The second approach that you can consider is using a within-subjects design. This increases the statistical power especially when the correlation between the two measurements that you take is very high. In psychology, this is very often the case. Think about a reaction time study where you have people respond to certain images or certain stimuli that are either congruent or incongruent - maybe a Stroop test. We know that reaction times within an individual are very strongly correlated. So using a within-design in this case increases the statistical power of observing an effect. A third way that you can try is increasing the variability in the response options. This might make it easier to find differences if there's enough variability in the data. And you can think of the number of items in a scale, for example. If you have a scale with only three or maybe five items, there's not a lot of wiggle room for people to vary on even if there's a real difference. If you have a large enough scale - let's say seven items or nine items - there's more possibility to have varying responses, and it's easier to find differences if there are really there. The final way to consider is using a one-sided test. If you have a directional prediction, and you are not interested in effects that go in the opposite direction, then this is a good approach to increase the statistical power without any additional costs. People are not using these one-sided tests as often as they should, but they are a very efficient way to collect high-powered studies. This is a graph illustrating the difference between a one-sided test and a two-sided test if we're examining a Cohen's d of 0.5 in a one-sided t-test. And you can see that with a one-sided test you need about 26 people, whereas with a two-sided test you would need about 34 people. So the difference might be small, but if you have smaller effects it can become quite substantial. And it's just free if you have a directional prediction, and you specify this in advance before data collection. Now, there's some discussion in literature about which type of error is more severe. The Type 1 error where you say that there is something when there's actually nothing. Or the Type 2 error when you say that there's nothing when there's actually something. And it depends a little bit on how we organize our science. As long as there are replications of results, then Type 1 errors will be identified. If you try to replicate someone else's study, and this was a Type 1 error, it's very unlikely that you will find another Type 1 error. The probability of doing this will substantially decrease - it's very unlikely. So as long as we replicate each other's work enough, sufficiently, we will identify these Type 1 errors. And you might say, "Well, they won't impact the scientific literature too much". In this case a Type 2 error might be much more severe. If somebody tries out a new idea and there is nothing there, other people might give up on this idea. So a Type 2 error might be more severe depending on whether we do enough control and enough checking in a scientific literature to identify the Type 1 errors. As you see, it's very important to control your Type 2 errors as well. People are very upset about Type 1 errors in the literature nowadays, but you can see that it might be even more severe to miss out on a true effect because your study did not have a high enough informational value. You did not collect enough people to conclude that there is an effect when there's actually a true effect. So be sure that when you're designing your study, you evaluate the severity of the Type 1 error and the Type 2 error. And in certain situations, it might be that a Type 2 error is much more severe, and it deserves your attention. [MUSIC]