From the two by various graphic examples that we've covered. We filled in the left side of our graphing decisions flow chart. Each example showed situations when our response variable was categorical. Let's talk now about the right side of our flow chart when the response variable is quantitative. Will now change a research question using an example from the Gap Minder data set. Here, we're interested in the association between the percent of the population living in urban settings within each country. And the country's rates of internet use, that is the percent of people with access to the World Wide Web. Below, you can see a full description of these variables from the Gap Minder code book. For this research question both the response and explanatory variables are quantitative, a bar chart would not work here. The graph of choice would be a scatter plot, a scatter plot, by definition is a graph of plotted points that show the relationship between two quantitative variables. In a scatter plot, data for each observations, explanatory and response variable are plotted. This scatter plot shows a sample of 11 observations according to the relationship between height and weight. In the lower left hand side of the graph, we see plotted individuals with relatively low height and weight. In the upper right hand portion, we see individuals with relatively high height and weight. Returning to the Gap Minder data set, let's examine the relationship between percent of the population living in urban settings and the rate of Internet use. >> Since we're using a different data set will begin with a new program, it begins with the library import statements. Then we're going to load the Gap Minder data set, but when I run this code, I get an error message that reads ValueError unable to parse string. What this means is that when Python read in the Gap Minder data set it read an empty cells, which are missing values as blanks instead of NaN. This in turn, causes the error message when Python encounters an empty cell as it tries to convert the variable to numeric. By default, the pandas read CSV function should convert empty cells to NaN when it reads in the data, so this shouldn't happen. But there is simple fix if it does happen, you just need to add this line of code. Data is the name of our data frame, and data.replace tells python to replace in parentheses the pattern for empty cells to NaN. And regex equals true, tells pythons to make this replacement for every empty cell. Then you can rerun the code that converts the variable to numeric without any errors. And add describe statements in order to explain the central tendency and spread or variability of both urban rate and Internet use rate. We can see that for urban rate, the mean percent of the population living in urban settings is about 57%. The standard deviation is about 24%, suggesting that there is quite a bit of variability from country to country in terms of the proportion of the population living in urban settings. For internet use rate on average, about 35.6% of the population across these individual countries has access to the World Wide Web. Again, with a standard deviation of 27.8% there seems to be quite a bit of variability from country to country. But is there a relationship between these two variables? We can explore this question visually with a scatter plot, Python provides scatter plots through the use of the seaborne package. This time, we use the regplot function from the seaborne package. We name the quantitative explanatory variable for the x axis here, urbanrate and also the quantitative response variable for the y axis internet use rate. We define the data frame here called data where the variables can be found. For this example, I will also ask Python to suppress the line of best fit with fit_reg equal to false, since the default is to add this line. Again, with the x label function, we are able to label the x axis and the y label function the y axis, titles are created with the title function. To characterize the relationship that we see in a scatter plot, it can be helpful to also allow Python to draw a line of best fit through the observations, as a way of trying to determine how the dots lineup. That is, do they seem to line up in a positive or negative direction or with a positive or negative slope? An increasing slope, as we can see here between urban rate and internet use rate, indicates the relationship is positive. That is, higher values on one of the variables seems to be associated with higher values on the other, and lower values on one are associated with lower values on the other. The code is identical, but I drop the fit_reg equals false in order to display the line of best fit. Here, we see what looks like a positive relationship between urban rate and internet use rate. Here's another example from the Gap minder, exploring the relationship between income per person in each country and then each country's Internet use rate. Again, if considering a linear pattern, the relationship seems to be positive. That is, higher income is associated with higher internet use, lower income associated with lower Internet use. The strength of the relationship in a scatter plot is determined by how closely the data points follow the form. >> In this scatter plot, the data points follow the linear pattern quite closely, this is an example of a very strong relationship. In this other scatter plot, the points also follow the linear pattern, but much less closely, therefore, we can say that this is a weaker relationship. The format of the relationship is its general shape, when identifying the form, we try to find the simplest way to describe the shape of the scatter plot. There are many possible forms, as we saw, a positive or increasing relationship means that an increase in one of the variables is associated with an increase in the other. And negative or decreasing relationship means that an increase in one of the variables is associated with a decrease in the other, as shown in this central scatter plot. Not all relationships can be classified as either positive or negative, further, if you can't plausibly put a line through the dots. If the dots are just an amorphous cloud of specs on the graph, then there may be no relationship. >> For various reasons, a scatter plot is sometimes limited in its ability to allow us to evaluate a relationship visually. >> Here is a scatter plot for income per person by rate of HIV among 15 to 49 year olds. Since most countries have a low HIV rate per 100 people, the dots on the scatter plot seemed to clump in the lower part of the graph. So to try to get a better sense of whether or not there is a relationship between these two variables, we could try to categorize or group the explanatory variable income. We need to add the appropriate data management syntax to the program in order to create these categories INCOMEGRP4. I also include a value count statement so that we can examine the distribution of this new variable. After the program has been saved and run, we can see the distribution for income group. The four ordered groups we created show that there are 51 countries in the lowest income group. There are 51 countries in the next 25%, 50 in the next, and 51 countries in the highest 25% in terms of income. With this new categorical explanatory variable, we're now ready to create the last type of by various graph, that is the categorical to quantitative bar chart. The code we will use is identical to the code used for the categorical, too categorical graph, but what will be plotted on the y axis is the mean hivrate. In this bar chart, we can see differences in HIV rate based on countries income per person groups. And the relationship seems to be linear, though, as you can also see from the y axis differences between mean HIV rates for each income group are very small, that is less than 2%. Also, what linear relationship we do see seems to be negative, that is, higher HIV rates are seen in lower income countries compared to higher income countries. >> We've worked through each type of by various or two variable graph, highlighting when and how it should be used to visualize a relationship, now, let's just very briefly summarized. When visualizing a categorical to categorical relationship, we use a bar chart with explanatory categories on the x axis and the proportion of our response variable on the y axis. When visualizing a categorical to quantitative relationship, we use a bar chart with explanatory categories on the X axis, and the mean of our response variable on the Y axis. When visualizing a quantitative to quantitative relationship, we use a scatter plot, in which each observation is displayed according to the values of the explanatory and response variables. >> Use these basic guidelines as well as the graphing decisions flow chart to visualize the relationships between your own variables of interest.