[MUSIC] In public health research, many of the variables we want to examine the impact of are binary or categorical. For example, we often want to see if there are differences by gender, which is a binary variable, and we may want to adjust the model for categorical variables such as location or study site. So this lecture's in two parts, in the first part I'm going to show you how to include and interpret binary predicted variables into your regression model and in the second part I'll extend this to categorical variables. You will see now quite a few binary variables in the COP data set and the one we're going to examine now is gender, to assess if there are differences in walking distance between male and female COPD patients. In order to include gender in the model, you'll need to ensure that each category is assigned a numerical value. So often we label binary variables as zero and one, and zero will be the reference category. So we use zero and one because they provide easy interpretation for regression coefficients, and this can also help when interpreting interactions between variables, and I'll tell you more about interactions later in this course. So once gender's been assigned numerical codes it can then be included into the regression model and if you run a regression model in R on the outcome walking distance and include gender as a predictor, you'll see the following output. You can see in the results table, the regression coefficient for gender is 30.51 and the intercept term is 379.7. So what do these values mean? To help work this out it can be useful to write out the regression equation. The model we fitted is walking distance is equal to a constant value of 379.7 plus 30.5 times gender. In order to interpret these regression coefficients you'll need to check how gender has been coded in the data set. So just to recap the regression coefficient for gender represents the expected change in our outcome, walking distance, for a one unit increase in our predictor. In the data set you'll see that females are coded as zero and males are coded as one. So one unit increase in gender represents a change from female to male, and another way to look at this is that female is the reference category. So looking at the regression equation again, now that we know gender equals 1 when male, you can see that 30.5 is the additional predicted mean distance that males can walk compared to females. So what do you think the constant value of 379.7 in the equation represents? So you can use the model to calculate the predicted mean walking distance for males, so if we substitute the value 1 for males, and work this through, you can see the mean distance for males is estimated to be 410.1 meters. Substituting in the value zero in gender for females, you can see that the last term from the module would disappear leaving the constant value only. So the constant value is the predicted mean walking distance for females. But there's nothing special about how gender has been coded, and you could have easily labelled the values the other way around. So my question you is, if we had labelled males as zero and females as one and refit the model, would this change the regression coefficients? So if you alter the coding of gender, there are two changes that will occur. The gender coefficient now has male as a reference category, and therefore represents a change from male to female, so the magnitude of the regression coefficient for gender stays the same as the mean difference stays the same between the two sexes. But the sign of the coefficient will change, and the second change is the value of the constant, this now represents the mean walking distance in males, so it's increased to 410.1 meters. So that's how we include and interpret binary predictor variables. Next, you'll look at incorporating categorical variables. [MUSIC]