Now we start for the second section on Statistics and Bioinformatics Applied in Evidence-Based Approaches. Within this section, we will deal mainly with two questions. The first is the diagnostic test problem and how it is applied to method assessment in evidence-based toxicology, and then we will have a look at systematic review softwares and what features they should offer. In the beginning, I said that Method Assessment is one of the pillars in evidence-based toxicology, and that is very close to the so-called diagnostic test problem. So let's have a look a little bit more deeper into that. So the goal is to compare two classification, or diagnostic tests with a binary outcome which means yes, no, positive, negative. And you compare two tests against each other. One of the test is so-called reference test is normally the gold standard, which can be characterized as the best available and regulatory accepted test under reasonable conditions. So for example, the OECD Test Guideline 414 for prenatal developmental toxicity testing. So you're comparing the tests to be evaluated against the results from the gold standard, and you assume this is one of the very important assumption of the test that the results from the gold standard test are correct. The results of such a study, you can show in the two times two cross table. You can see in the first column, the true positives, and in the second column, the true negatives. And these are compared to the test to be evaluated in the first row, the positive tests from the test we evaluated, and then the second row, the negative test of the test we evaluated. And this is a little bit comparable to the statistical tests. You have two correct cells, a and d, which are the true positives and the true negative. And two wrong cells which are c and b, the false positive and the false negative. Now you can define several statistics based on this two times two table. So first of all, it's the Prevalence. The Prevalence is defined as the probability of positive true condition divided by all classifications that you have. So in this case, it is a + b. So a + b are the true positives, the sum of that divided by the sum of all four cells. Next one is the Sensitivity. So the Sensitivity is defined in the true positive group, which means we are now in the first column. And it's defined as the probability that if something is truly positive, it gets classified as truly positive by the test to be evaluated. Specificity is defined in the second column which means the true negatives, and it's defined as the probability of something is truly negative, it gets classified as negative by the test to be evaluated. And then you also have the so-called Positive Predictive Value and the Negative Predictive Value. Now, we are moving from the columns to the rows. So, the positive predictive value is defined in the first row. So, it's the probability that if something is positively classified by the test to be evaluated, it is truly positive. Negative predicting value, we are now in the second row. It is the probability that if something by the test to be evaluated is classified as negative, it is truly negative. And then the total Accuracy, which is simply the probability that the test we evaluating is making a right decision, which means a true positive or a true negative. Sensitivity and specificity are not prevalence dependent, because they are defined in the group of the true positives and true negatives. The negative predicting value and the positive predicting values however, change with the prevalence, and are therefore study dependent. You cannot simply transfer the estimated negative predictive value and positive predictive value from one study to a new study population which are may be in a different risk group and have a different prevalence. And I will give you an example on the next slide for that. So if the prevalence change, which means in a new study in a different population, an updated negative predictive value and positive predictive value can be calculated following Bayes' theorem given the new prevalence. And the two formulas are given there how you can do that. On the right side, you see a plot. So on the x axis, you see the prevalence is between zero to one, and three lines or curves which is accuracy, positive predictive value and negative predictive value. So, you can see that the positive predictive value of a test is negatively correlated with the prevalence. So, the smaller the prevalence, the more the positive predictive value goes down. The opposite it is for the negative predictive value. Another word about sensitivity, specificity, negative predictive value and positive predictive values, these are point estimates. And just to look at point estimates could be highly misleading. It is very important that you also, in addition, calculate confidence intervals to interpret the results. So for sensitivity and specificity, normally binomial proportion confidence intervals are used which are for example, Wilson scores, and Clopper-Pearson intervals, and for the negative predictive and positive predictive values, you normally calculates 10:00 o'clock at confidence intervals. The references for that are given on the slide. And later, I will also show you a free software implementation where you can calculate this confidence intervals. Maybe the last two slides were a little bit confusing, so now let's look at a specific example. So, let us assume for a second a new hypothetical test has been developed for detecting skin irritation potential in chemicals. Let's say they have developed this test with 100 chemicals. They found out that the sensitivity of the test was 95% and the specificity was also 95%. And the dataset at a prevalence of 50%, which means there were 50% irritants in it, which means 50 irritants and 50 non-irritants. And as you can see and you can also calculate it by yourself, this results in a positive predictive value of 95%, and also in a negative predictive value of 95%. And this was considered by the persons who develop the test and performed the study as acceptable. However, a reviewer pointed out, after some search and some regular tox databases, that the real prevalence is not 50%. The real prevalence is about 5%. And he calculated the new negative-predicting values and positive-predicting values, as you can see in the next line. And the outcome was pretty much surprising. The negative predicting value increases to 99.7% and the positive-predicting value was reduced to 50%. That means if the test classifies a chemical as irritant, the probability that this is correct was just 50%, which mainly means or essentially means you throw a coin. To illustrate that, look at the figure on the right side. We have 1,000 chemicals to be tested with a new test, and we know we have a prevalence of 5%, which means 5% of 1,000, which means we have 50 true irritants and 950 true non-irritants. Of the 50 true irritants on the left side, because of the sensitivity of 95%, remember the sensitivity is calculated in the group of the true irritants, you have around 47 positively classified chemicals, which means correctly positive classified, and three negative tested, which means wrong negatives. On the right side of the slide within the group of 950 true non-irritants, this is where the specificity is applied, which is also 95%, you result out of the 950 true non-irritants that 903 gets negative test results, which means they are truly classified as negative results. But you also get 47 which get positive tested, which are wrong positives. Which means we have, out of the about 94 positive tested results out of the 1000 chemicals tested, we have 47 correctly positive tested chemicals or classified chemicals, and 47 which are wrong positives. Which means, the positive predicting value is 50%, which is clearly not acceptable for our test. The outcome of a test is normally not a classification but a number which is measured on a continuous scale. So for example, hematocrit level in blood, which can be theoretically between zero to 100%, to predict iron deficiency anemia, which is a yes-no classification. So, you need to transform from a continuous scale on an ordinal scale or classification. In toxicology, this is called a prediction model. So you are the step from raw measurements of your essay to a classification or prediction, you need a prediction model. To optimize your prediction model and to understand how good your prediction model is, you normally do a Receiver Operating Curve, or so-called ROC. ROC is a graphical plot that illustrate the performance of a diagnostic test by varying the discrimination threshold from the lowest possible value in the data set, to the highest possible value in the data set. And you can see an example on the right side. On the X axis, you have one minus the specificity, and on the Y axis, the sensitivity. Both values can be reaching zero to 1, between zero to 100%. The ROC curve has to start at the point 0, 0, and they have to stop at the point 1, 1. Which means, with the lowest possible threshold, you have a specificity of the essay, which is one, by the sensitivity, which is zero. And on the other side, with the highest possible threshold, you have sensitivity, which is one, and the specificity, which is zero, which would be the point in the upper right corner of the graph. You see a dashed line in the middle? The dashed line is a random guess for which just the known prevalence and the data set is used. And the more far to the left upper sides of the random guess, the estimated drop curve is away, the better the test is. The perfect test would be on the upper left side, a single point. Just to give you an example, the point B, which is on the dashed line, is a random guess, but it's better than point C, because point C is below the dashed line. So it's even worse than a random guess. And point C is better than point A and point B. In the first section, I talked about the general steps to perform an evidence-based toxicology study. As you remember, in the second and third step, the systematic retrieval of the best evidence is available, it needs to be done by a systematic literature research, and the third step, the evidence is need to be assessed and critically evaluated. This you normally don't do with an Excel table, you use specialized softwares for that, which I call systematic review management softwares. And the softwares help you after your systematic literature review to search for potential evidences and their critical and systematic evaluation. So in general, the software should allow to import and upload the references, for example from PubMed or any other reference management softwares, it should also include the full text of the document and not just the abstracts. It should be able to create forms for the review assessments which means application of inclusion, exclusion rules, data extraction, risk of bias assessments, and then we'll talk about risk of bias in the next section. It should also provide logic checks of input validation, which means if you enter a number or something like that, the computer should check whether this number or letter or whatever is within, an allowed range. It should be able to assign reviewers including randomization procedure. And it should be able to monitor the results, that means the progress normally in the meta analysis or evidence-based approach has more than one person involved into that, which means that should help you to have a look into how far is the study. It should also identify review and data conflicts. So normally, more than one reviewer is looking at each evidence to exclude any possible effects of the reviewer on the evaluation of the study results. So data conflicts should be identified and the software should offer a way to deal with that. And it should also provide some statistics about how the review are doing in general. So this is called a kappa interrater reliability scoring, which is essentially it looks how well the results between the different reviewers are correlating. And also, of course, in the end, it should give you a possibility to export the data for further analysis and others have to stick with softwares, and it should allow you to do backups. Now, I would like to look a little bit generally into software packages which I used in evidence-based approaches. So, since you heard about the diagnostic test problem, there is a web page which provides a diagnostic test evaluation calculator, and that includes as well the confidence intervals and graphical display of the results, and the software is free. ROC curves and meta regression models are implemented in almost all classical statistical software packages like R, SPSS, SAS or STATA, and on the web page at the Columbia University libraries, you can find a very nice overview of software packages which can be used for preparing and maintaining systematic reviews. Last but not least, I would like to point out how important it is in evidence-based approach as a meta analysis that you consult a statistician before you start your work. And please provide an examples of studies, results, and raw data. What is the current knowledge in the field? What are the possible gaps? What are possible gold standard tests? What could be possible study acceptance criterias? Inclusion? Exclusion? What are possible controls, and whether there are any reference or guidance documents available. And the statistician, he or she, will help you to ensure that an appropriate question will be formulated, and appropriate study design and study protocol, which also includes sample size calculations. And he, she will also help you with practical procedures and the studies, which means how you keep your records, how the data should be collected, and how the studies should be reported, what are the appropriate effect measures in statistics that should be applied. And in general, the appropriate statistical procedures, and what could be the decision criterias to be used to decide which levels of evidence are needed to answer the question.