From the two bivariate graphing examples that we've covered, we filled in the left side of our graphing decisions flow chart. Each example showed situations where our response variable was categorical. Let's talk now about the right side of our flow chart, when the response variable is quantitative. We'll now change our research question using an example from the gap minder data set. Here, we're interested in the association between the percent of the population living in urban setting within each country, in the country's rates of internet use, that is the percent of people with access to the world wide web. Below, you can see a full description of these variables from the Gapminder code book. For this research question, both the response and explanatory variables are quantitative. A bar chart would not work here. The graph of choice would be a scatterplot. A scatterplot by definition is a graph of plotted point that show the relationship between two quantitative variables. In a scatterplot, data for each observations explanatory and response variable are plotted. This scatterplot shows a sample of 11 observations according to the relationship between height and weight. In the lower left hand side of the graph, we see plotted individuals with relatively low height and weight. In the upper right hand portion, we see individuals with relatively high height and weight. Returning to the Gapminder data set, let's examine the relationship between percent of the population living in urban settings and the rate of internet use. >> Since we're using a different data set, we'll begin with a new program. It begins with the library import statements, then we're going to load the Gapminder data set. Next, I will set the variables that I will be working with to numeric and add describe statements in order to examine the central tendency and spread or variability of both urban rate and internet use rate. We can see that for urban rate, the mean percent of the population living in Urban settings is about 57%, the standard deviation is about 24%, suggesting that there is quite a bit of variability from country to country in terms of the population living in urban settings. For internet use rate, on average, about 35.6% of the population across these individual countries has access to the world wide web, again, with a standard deviation of 27.8%, there seems to be quite a bit of variability from country to country. But is there a relationship between these two variables? We can explore this question visually with a scatterplot. Python provides scatterplots through the use of the seaborn package. The time we use the regplot function from the seaborn package. We name the quantitative explanatory variable for the x axis, here, urban rate. And also the quantitative response variable for the y axis, internet use rate. We define the data frame, here called data. Where the variables can be found. For this example, I will also ask python to suppress the line of best fit with fit underlined reg equal to false. Since the default is to add this line. Again, with the x-label function, we are able to label the x-axis, and the y-label function, the y-axis. Titles are created with the title function. To characterize the relationship that we see in a scatter plot, it can be helpful to also allow Python to draw a line of best fit through the observations, as a way of trying to determine how the dots line up. That is, do they seem to line up in a positive or a negative direction? Or with a positive or negative slope? An increasing slope, as we can see here, between urban rate and Internet use rate, indicates the relationship is positive. That is, higher values on one of the variables seems to be associated with higher values on the other, and lower values on one are associated with lower values on the other. The code is identical but i drop the fit_reg = false in order to display the line of best fit. Here we see what looks like a positive relationship between urban rate and internet use rate. Here's another example from the gap minder exploring the relationship between income per person in each country, and then each country's internet use rate. Earlier in my program I also need to remember to set income per person as numeric. Again, if considering a linear pattern, the relationship seems to be positive. That is, higher income is associated with higher Internet use, lower income associated with lower Internet use. The strength of the relationship in a scatter plot, is determined by how closely the data points follow the form. >> In this scatter plot, the data points follow the linear pattern quite closely. This is an example of a very strong relationship. In this other scatter plot, the points also follow the linear pattern, but much less closely. Therefore, we can say that this is a weaker relationship. The form of the relationship is its general shape. When identifying the form, we try to find the simplest way to describe the shape of the scatter plot. There are many possible forms. As we saw, a positive, or increasing relationship means that an increase in one of the variables is associated with an increase in the other. A negative or decreasing relationship means that an increase in one of the variables is associated with a decrease in the other as shown in this central scatter plot. Not all relationships can be classified as either positive or negative. Further, if you can't plausibly put a line through the dots, if the dots are just an amorphous cloud of specks on the graph, then there may be no relationship. >> For various reasons, a scatter plot is sometimes limited in its ability to allow us to evaluate a relationship visually. >> Here is a scatterplot for income per person, by rate of HIV among 15 to 49 year olds. Since most countries have a low HIV rate per 100 people, the dots on the scatterplot seem to clump in the lower part of the graph. So to try to get a better sense of whether or not there is a relationship between these two variables, we could try to categorize or group the explanatory variable income. We need to add the appropriate data management syntax to the program in order to create these categories. INCOMEGRP4. I also include a value pound statement so that we can examine the distribution of this new variable. After the program has been saved and run, we can see the distribution for income group. The four ordered groups we created, show that there are 51 countries in the lowest income group. There are 51 countries in the next 25%, 50 in the next, and 51 countries in the highest 25% in terms of income. With this new categorical explanatory variable, we're now ready to create the last type of bivariate graph, that is the categorical to quantitative bar chart. The code we will use is identical to the code used for the categorical to categorical graph, but what will be plotted on the y-axis is the mean HIV rate. In this bar chart we can see differences in HIV rate based on countries' income-per-person groups. And the relationship seems to be linear. Though, as you can also see from the Y axis, differences between mean HIV rates for each income group are very small, that is, less than 2%. Also, what linear relationship we do see seems to be negative. That is, higher HIV rates are seen in lower income countries compared to higher income countries. We've worked through each type of bi variant or two variable graph highlighting when and how each should be used to visualize a relationship. Now, let's just very briefly summarize. When visualizing a categorical to categorical relationship, we use a bar chart with explanatory categories on the X axis. And the proportion of our response variable on the Y axis. When visualizing a categorical to quantitative relationship, we use a bar chart with explanatory categories on the X axis, and the mean of our response variable on the Y axis. When visualizing a quantitative to quantitative relationship we use a scatterplot, in which each observation is displayed according to the values of the explanatory and response variables. Use these basic guidelines, as well as the graphing decisions flow chat, to visualize the relationships between your own variables of interest.