The first question is, what type is the response variable? Is it categorical or quantitative? For our sample research question, the response or dependent variable is nicotine dependence, which is categorical. Next, we need to determine how many categories are in this response variable. Since nicotine dependence is coded one for yes or present and zero for no or absent, we have two categories in the response variable. The next question to ask is, what type is the explanatory variable? The explanatory or independent variable is number of cigarettes smoked per month. As we saw in the demonstration of histograms, this is a quantitative variable. Since it won't be visually meaningful to examine a bar chart with a quantitative explanatory variable on the y-axis, when our response variable is actually categorical. Before we start to graph, it's important to bin our explanatory variable into categories. That is, in order to visualize the relationship that we're interested in, we need to add some data management that will allow us to construct a C to C or categorical to categorical bar chart. Because by default, the Pandas library often displays an abbreviated list of rows and columns from our DataFrame and I know the number of values for num sig MO_EST is fairly long, I'm going to add additional set option statements following the library import syntax that requests the display of the maximum number of rows and columns. The default in Python, you may have noticed limits this display to a subset of the DataFrame. Including display max columns or rows, none removes that limit and allows all rows and columns to be displayed. Now after viewing the output, we can use the various cut functions from Python's Pandas library to group individuals in various ways. For example, into quartiles, roughly four equal groups in size. However, in this case, it seems a better decision might be to create more meaningful smoking groups based on specific quantities. Cigarette packs contain 20 cigarettes each. We're going to create a new variable that estimates the number of packs that each individual smokes per month rather than the number of cigarettes. This could be a step closer to a categorical variable that's meaningful. The new variable is packs per month, and it is set equal to the number of cigarettes smoked per month divided by 20. Then I add this new variable, packs per month to a by group output statement, so we can view this new frequency distribution. Packs per month is still a quantitative variable, but now we can more easily create groups based on number of packs smoked in a month. After examining the frequency distribution, I decide to create groupings that include those who've smoked less than one through five packs per month, six through 10 packs per month, 11 through 20, 21 through 30, and then 30 plus packs per month. To accomplish this, I add the following syntax to my program. When we add this new variable, pack category to a describe statement and then run the program, we can examine some basic descriptive statistics. But don't forget to let Python know that the new variable, pack category should be treated as categorical. We can also examine the frequency distribution for this new pack category variable representing packs of cigarettes smoked per month. Here we see the number of young adult smokers in each group. Back to our graphing decisions flow chart. Now that we've collapsed our Explanatory Quantitative variables smoking into categories. We're ready to make our C to C or category to category bar chart. When graphing the relationship between a categorical explanatory variable and a categorical response variable, we use the following code. This time, we use the catplot function from the seaborn package. We name the categorical explanatory variable for the x-axis here, pack category, and also the response variable for the y axis, tab 12 mdx. Define the data frame here sub 2, where the variables can be found. Kind equals Bar requests a bar chart and ci equals none suppresses error bars, which you will learn more about in our statistical tools course. Again, with the x label function we are able to label the x axis and with the y label function, the y-axis. Note that for a bivariate graph where our response variable is categorical, we will actually need to convert this categorical response variable back to numeric. This is because the bivariate graph displays a mean on the y-axis, which translates into the accurate proportion of individuals with nicotine dependence. Now how's that possible? Remember that our categorical response variable should not have more than two categories or levels. Those two categories should be coded 0 and 1, 0 represents no or negative observations, and 1 represents yes or positive observations. In this format, requesting the mean of our categorical response variable actually gives us the proportion of ones or positive observations. However, as you saw from our work with the describe function, Python will not calculate a mean once a variable has been set to categorical. I also need to add this syntax before requesting a bivariate graph, which transforms tab 12 mdx into a numeric variable. When I save and run this, I can examine our categorical by categorical bar chart. Pack category or explanatory variable is on the x-axis. This is by the rate or proportion of nicotine dependent individuals within each pack category along the y-axis. You can see from the graph among those smoking 1-5 packs per month, about 25 percent of those individuals are nicotine dependent. Among those smoking 6-10 packs a month, 50 percent are nicotine dependent. Among those smoking 11-20 packs a month, 58 percent are nicotine dependent. Among those smoking 21-30 packs per month, almost 70 percent are nicotine dependent. Among those smoking more than 30 packs a month, more than 70 percent are nicotine dependent, around 77 percent. We can also see that these rates form a pattern. That is, the more packs smoked per month, the higher the rate of nicotine dependence. In a graphical way, we're already seeing that there seems to be a relationship between smoking and nicotine dependence as we hypothesized. Looking at our graphing decisions chart, we can see the steps we've taken to generate a bivariate graph with a categorical response variable that has two categories and quantitative explanatory variable. We also discussed how to convert the quantitative explanatory variable to a categorical variable. A step which must be taken for the purposes of visualizing the relationship. If our explanatory variable was originally Categorical rather than quantitative, we could have skipped this step and just moved on to a categorical by categorical bar chart. What decisions need to be made if the response variable has more than two categories? In this case, we would need to collapse the response variable categories into two categories. To demonstrate this, we'll have to modify the research question. Let's modify the research question to look at the association between ethnicity and smoking stage. We'll create a response variable that categorizes young adult smokers into three groups. Non-daily Smokers, daily smokers, and those with nicotine dependence. These are the ethnic groups recorded in the NESARC codebook, along with the syntax that we can use to create a three-category smoking stage variable. I am naming a new variable here, smoke group, creating the temporary variable called row, where I ask Python to return specific values here, 1, 2, 3 under certain conditions. If TAB12MDX equals 1, if USFREQM0 equals 30, and a third group, which includes the rest of our young adult smokers who are neither nicotine dependent nor daily smokers. Notice that I have used elif, known as else-if, and also else. Using "if" in the first row draws on the whole sample. Else-if or elif draws on what remains of the sample after the ones, that is, those with nicotine dependence have been assigned a one, and else followed by a colon literally puts everyone else in category 3, after the nicotine dependent and daily smokers have already been assigned values of one and two. This sample can be described with these three smoking categories. This univariate bar chart shows that about 50 percent of the young adults sampled are nicotine dependent. About 30 percent are daily smokers without nicotine dependence, and almost 17 percent are non-daily smokers. However, to examine a relationship between this variable as the response variable and another, we need to collapse this to only two categories. To do this, we need to make some decisions. Here are two perfectly reasonable decisions that we could make. We could examine the association between ethnicity and daily versus non-daily smokers, or we could examine the association between ethnicity and nicotine dependent versus non-nicotine dependent individuals, thereby collapsing across these categories in some way. In either case, some data management needs to be added to the program. To collapse the response variable into daily versus non-daily smokers, we can use this syntax. If USFREQM0 equals 30, that is, if the individual smokes 30 days a month, then return a one for the new variable daily. Else-if, noted as elif USFREQM0 is not equal to 30, that is not equal to smoking 30 days per month, then return a zero for daily. Now we can again graph the relationship between our categorical explanatory variable, ethnic group, and this two level categorical response variable, daily smoking. Remember our categorical response variable should not have more than two categories or levels, and those two categories should be coded as one and zero. Zero representing no or negative and one representing yes or positive. In this format, requesting the mean of our categorical response variable actually gives us the proportion of ones for positive observations. Notice that the dummy codes for our ethnicity variable are not ordered on the x-axis. Because they are dummy codes, the graph is something difficult to interpret. You can rename categorical variable values for graphing first by changing the variable format to categorical if you haven't already done so. Second, by giving the variable ETHRACE2A new value labels with cat.rename_categorical. Here's our new graph. As you can see, the rate of daily smoking is somewhat lower among Native American and Hispanic young adults who have smoked in the past year compared to each of the other groups. Because our response variable was categorical with more than two categories, we needed to collapse it into only two categories, and because our explanatory variable, ethnicity was categorical, we created a categorical by categorical bar chart. Had our explanatory variable been quantitative, we would have needed to bin or collapse that variable into categories before creating the categorical by categorical bar chart.