A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

45 ratings

Johns Hopkins University

45 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Module 3A: Multiple Regression Methods

This module extends linear and logistic methods to allow for the inclusion of multiple predictors in a single regression model.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Okay everyone, welcome back to Lecture 7, Section C. And here we're going to look at some multiple logistic regression examples from public health and medical literature.

So, hopefully, you'll have some more exposure to interpreting the results from simple and multiple logistic regression models. Presented and published journal articles after viewing this lecture section.

So the first article we're going to look at came from the annals or Archives of Internal Medicine. And it's entitled Discrepancy Between Consensus Recommendations and Actual Community Use of Adjuvant Chemotherapy in Women with Breast Cancers. And so, the abstract goes on to state their purpose. They say, although the efficacy of chemotherapy in prolonging survival for women with breast cancer has been well documented. Little limited population-based information is available on the use of chemotherapy. So what they wanted to do is actually examine the relationship between age and chemotherapy use.

And so, they go on and I'll let you read the details of this if you're interested. But they go one to say in their measurement section, logistic regression analysis to generate the odds and probabilities of receiving chemotherapy. So they employed logistic regression and we'll see how in a bit. And then they go on to say that in their study overall 29% of the women received chemotherapy.

And they note that across all tumor stages the use of chemotherapy decreases substantially within increasing age. And they said overall 66% of the women younger than 45 years of age received chemotherapy. Compared with 44% of women between 50 and 54 years of age. 31% of women between 55 and 59. And 18% of women between 60 and 64 years of age. This decreasing pattern of chemotherapy with age. So, they just presented the unadjusted proportions. Then they go on to say the decreasing pattern of chemotherapy use with age continued after adjustment for prognostic factors. And that's the following the results of their multiple logistic regression which we'll examine in detail now.

So to describe how they fit these logistic regression models, they say we use multivariable logistic regression. So in this sense they're using multivariable to mean potentially multiple predictors in the model. Logistic regression analysis to generate the odds ratio receiving chemotherapy. In women with breast cancer and determine the effect of age on chemotherapy use. In this model, we adjusted for race categorized into three groups, white, black or other. Tumor stage in three categories, node status and hormone receptor status. Whether the patient had received surgery and radiation therapy. Categorized as breast conserving surgery without radiation, breast conserving surgery with radiation, or mastectomy. And adjuvant hormone therapy use, yes or no. And they go on to say in addition to the odds ratios. They used the results of the logistic regression to generate the probabilities of receiving chemotherapy. From the parameters of the logistic regression model for women with different ages by holding other factors constant. In other words, they tried to predict adjusted probability estimates of receiving chemotherapy by age. Standardizing the age groups by these other factors.

So here's the table that, one of the tables they present one of the results tables, Table 4. Where they show the results of this multivariant logistic regression for the odds or probability of receiving chemotherapy. In Stage I, Stage II, or Stage IIIA breast cancer. From 1991 through 1997 and the sample was from women in New Mexico in that time period. And they first show in each of these age groups, split out into five year intervals for the most part except on the ends. They showed the unadjusted proportions here. The p hats, if you will of the proportions of women who are receiving chemotherapy in each of those age groups. And then they con, converted these into odds ratios.

So each of these odds ratios compares the relative odds of receiving chemotherapy for a given age group. To the less than 45 year old group, the reference group. And we can see just like with the, I like this because they show the probability are decreasing. And this gives us the sense of the magnitude of decrease on the risk difference scale. And then we can see the, what this means on the odds ratio scale. And that's shifting downward as well. But, they can give confidence intervals for each of these odds ratios. And we can see that on the whole, the results are statistically different from the reference group across all these age groups. And a lot of the confidence intervals for the age groups greater than 45 do not overlap as well.

As they go on to say for race, tumor stage, lymph node, hormone receptor status and other treatments received. And that's what we just talked about from the methods section. Then they went on to actually produce estimated adjusted probabilities. Where as these proportions over here are the raw crude proportion of when we're receiving chemotherapy in each of the age groups. These are adjusted, these are standardized. These are the estimated proportion of women in the human age group receiving probability. Where the woman comparable on on all the factors they adjusted for in the logistic regression model. And you can see that some of the adjusted estimates atteniv, attenuate after adjusting for the different varying characteristic levels. But they still show this ordering that the youngest women were most likely to receiving chemotherapy. And it decreases as a function of age. And using the computer they were able to get confidence intervals for these predicted probabilities as well. That requires a little more computation than we can do by hand. We can estimate these predicted probabilities given a regression model. But to get the confidence interval, to get the standard error, that requires the computer, it can't be easily done by hand. So I like this presentation because not only does it show the raw proportions. because this is a binary outcome, and proportions help us understand the magnitude of the ut, utilization here by age. Then it shows the relative comparison on the odds ratio scale adjusted for other characteristics that may differ between the age groups. And then it represents those in terms of adjusted probability. So that, again, reminds us of the order of magnitude that these probabilities of utilizing chemotherapy are on. If we just had the odds ratios we could see the relative comparison of the odds, not the direct comparison of the proportions. But we wouldn't necessarily know what the starting measure was.

All right. So let's look at another example here from Health Affairs. Neighborhood socioeconomic conditions, built environments, and childhood obesity. So here's the abstract. They say we examine the impact of neighborhood socioeconomic conditions and built environments. Which is a measure of the neighborhood stability characteristics. On obesity and overweight prevalence among U.S. children and adolescents using the 2007 national survey of children's heath. The odds of children's being obese or overweight were 20-60% higher among children in neighborhoods. With the most unfavorable conditions such as unsafe surroundings, poor housing and no access to sidewalks, parks, recreation centers. Than among children not facing the same conditions. The effects were much greater for females and younger children. So they actually talk about effect modification here. For example girls age 10 to 11 were two to four times more likely than their counterparts. From more favorable neighborhoods to be overweight or overbese, or obese. They say our findings can contribute to policy decisions aimed at reducing health inequalities and promoting obesity prevention efforts. Such as community-base physical activity and healthy diet initiatives.

So, here's how they present the results. You know, this survey was designed, this is what's called a probability based survey. Where certain subgroups were oversampled relative to their actual proportion in the population. So the survey, the survey,respondent pool is not representative at face value of the entire population. But it was designed so that the researchers know how it differs from the overall population of interest in terms of overrepresented subgroups. And then the results can be weighted back to that original population distribution. So that when they talk about weighted results here, we haven't shown how to do this. But it's an extension of the methods we've learned in the class. But one of the things, I'm just going to show you a snippet of the table here. But I'm going to focus on neighborhood safety here. because that's one of the characteristics they talked about in the abstract. And so what they give here to start is an odds ratio. In this case of a child being obese, for uns, for children from unsafe neighborhoods compared to safe neighborhoods. By some index they used in the article, adjusted for age and sex. And they, you can see that the estimated adjusted odds ratio is 1.61 among children of the same age and sex. Children from unsafe neighborhoods have 61% higher odds of being obese than children from safe neighborhoods. They don't actually put confidence limits on this and I am curious as to why. So we don't know where this is statistically significant or not, it is certainly an estimated increase here in this study 61%. And that's the kind of number they refer to in the abstract. What's interesting though is in the next layer when they adjust for other things above and beyond age and sex. This overall elevated estimate reduces to 1.05. So it appears that the neighborhood safety factor. The association after adjusting for age and sex is being explained by other characteristics above and beyond age and sex. Which they adjusted for in this second regression model. So let's see what they did adjust for to make sense of this. So, I am going to zoom in on the, there was extensive footnotes here. And so I'll let you read through these, they go on to, to def, define how they defined obesity, overweight, which we didn't show. And then they go on to say, for the column we first looked at, the age and sex adjusted. They say adjusted by a logistic regression for age and sex only. And for that second column that said covariate, implying covariate adjusted, it was adjusted for age and sex. But additionally, race, ethnicity, household composition, metropolitan or non-metropolitan residence, household poverty or education levels. TV viewing time, recreational computer use and physical activity. Now, I'm only showing you a portion of these results because they were too much to fit on one slide. But the gist was similar for other industries. Where the associations looked large between less desirable neighborhoods versus more desirable neighborhoods. A lot of that was attenuated after adjusting for some of these other characteristics.

So let's just now go back and just remind you what was in that table that first column we were looking at was what we said, the odds ratio of. For example, being obese for children in safe neighborhoods to unsafe neighborhoods simply adjusted by age and sex. And this is the one adjusted for all those aforementioned other predictors above and beyond age and sex. So the next article I'm giving you the head, in order to be brief and fit it into the title I used a lot of acronyms here. Its HIV, HBV, HCV in IVDUs. So this is HIV. This is hepatitis B virus. This is hepatitis C virus. And IVDU stands for intravenous drug users. So this was done and if you look at the objectives from the abstract here. It says we examined HIV, Hepatitis B, and HCV seroprevalence in an interim analysis. And the potential risk factors associated with these infections among injection drug users. Residing in non-urban communities of Southwestern Connecticut.

And they go on in their methods to say we recruited and interviewed active adult IVD user about the injection associated risk. And conducted serological tests for HIV, HBV, and HCV. Regression analyses were performed to identify risk factors. For infection of any of these viruses and co infection for more than one of these. So they go on to say in the results, among 446, 51.6 carried at least one infection. So over half the sample had at least one of these three.

And 16.3% were coeffected, had it, two or more. And they go and say infection risk was associated with longer duration of injection use, overdose, substance abuse, depret, treatment, depression. And involvement with the criminal justice system. And coinfection was associated with longer injection drug use, lower education, overdose and criminal justice involvement. Multivariant models identified drug use duration substance abuse treatment. And criminal justice involvement as the most significant predictors of infection.

In other words these are things that remain significant in a multiple logistic regression model when other factors were considered. And they go and say injection, drug use duration and education where the most significant predictors of coinfection.

Let's look at their method section and I'll just it's a little blurry, I apologize, but I'll read this to you. But I'll just talk about some of the references the, statistics, so they talk about how they actually got the face-to-face interviews to get these data. And they used two different softwares, SPSS and SAS. And they say, descriptive statistics for generating to characterize the study sample. And three sets of analyses were conducted corresponding to the study questions. First we determined the individual prevedones for each of the three viruses. So in other words they computed p hats if you will for the sample for all participants whom serological data were available. Second in this group we determine the prevalence and risk factors for being infected i.e seropositive for one or more of the three viruses. Third among those who tested positive for one or more virus we determined the prevalence in risk factors for being co infected. Positive for at least additional virus. Let me go on to say for each outcome we initially conducted bivariate analyses to determine significant associations. That is to say we, they looked unadjusted associations. Between the infection or co-infection and each potential predictor on its own. Where proportions of continuous or categorical variables were compared, t-test and chi squared statistics were given. And odds ratios were computed using. And, I'm not sure, I think this may be a mistake, using analysis of variance. Because that's used to compute odd ratio's and Mantel-Haenszel methods and I'll explain that in a minute respectively. Unadjusted and adjusted for age. So Mantel-Haenszel is a method for adjusting that gives very similar results to logistic regression.

They could have done the same sort of adjustment with logistic regression, but what they actually present in their tables. And we'll see them and as they first present the unadjust association between it, in fact, in each of the predictors. Then they present it again adjusted only for age, each association adjusted only for age, and then they give and we'll see here. They go on to say factors associated with po-positive serologies in the bivariate analyses at an alpha level. Of alpha less than 0.1 before adjustment for age, were then included for consideration in a multivariate logistic regression model. That was constructed using a backwards elimination method with a significance level of alpha less than 0.05. So what they did then was they took all candidate predictors. Although had a p value of less than 0.01, in the unadjusted analyses, and put them in a large model. And then started removing the ones that were not statistically significant at the 0.05 level. And their final model included all predictors that stay significant in the multivariant model.

So here's the table they present is very large. I'm going to show some highlights, and there's also one sort of interesting thing about it. So they go on, so here are the unadjusted associations that accrued, these are adjusted only for age. Each of the associations is adjusted for age. And then this is the results from the multi-variant model, where everything is adjusted for everything else that stayed in the model. One thing you notice, though, that when they, they, they're talking about age, but they actually don't report any odd ratios associated with age.

Interestingly enough. And I want to come back to that. Threw me, actually when I first looked at this table, until I went back and read the method section more carefully. But let's just look at some examples of the comparisons they make. So here we have, so this is the outcome here, is the risk of being infected with at least one virus. So this is the results of the logistic regression. So, employment. So, unemployment unadjusted is associated with a 60% higher odds of being infected with at least one virus. And, this was statistically significant, the confidence interval does not include one.

This positive association but it's no longer statistically significant. So after adjusting for age those who were unemployed had an estimated 36% greater odds of being infected than those who weren't. But it was no longer statistically significant after adjustment for age. They look at the results for example from monthly income and they categorized this into four different groups based on U.S. dollars. Those who make less than $500, 500-999, 1,000 and 1,099 and greater than 2,000 and they used the greater than 2,000 as the reference group. And we can see that while the three lower income level groups had estimated odds of varying degrees higher than the reference. It wasn't a dose response pattern. For example, only a 9% estimated odds for the lowest income group compared to the highest. Versus a 107% increase for the next group 599 relative to the reference. But none of these confidence intervals, that all included the null value of one. And in fact, the overall test for if there were any differences in the odds or risk of being affected across any of the income groups. Was not statistically significant.

The estimates remain about the same, change slightly after adjustment for age, but still were not statistically significant. And so on an so forth. This is like I say, only part of one of the tables.

So they actually then go on to show and this is kind of anticlimactic. Because there is nothing that remains statistically significant in the multi-variant model except for the duration of injection drug use.

And this was modeled as the function of time, in years. And what they found, this is the odds ratio adjusted for other things that remain signficant. That greater time induction, you know, drug use, was associated with a higher odds of infection.

And this was statistically significant after adjusting for the other things were included in the multivariable, none of which these earlier things were.

So I'm going to look at, bringing this snippet up and so we can look at some other things here. And you can see there's other factors they considered like history of an overdose, proportion of tattoos done in non commercial venues.

Whether they got substance use treatment or not. And so lets hone in on substance use treatment. Unadjusted, unadjusted interestingly enough, 2., any substance use treatment. Having had any was statistically significantly associated with higher odds of having infection than those who weren't. And that could be connected to things like the duration of drug use and the intensity. And that may be what's explained that increase, it was statistically significant. It states the statistically significant after adjusting for age. And interestingly enough, it stays statistically significant higher by, when attenuated a little bit by these estimates. Indicating that some of that increase may have been explained. Well, let's look at it. It was 2.76 unadjusted. After adjusting for age, age alone, it went to, it was still sizeably larger than the odds for the reference group. But, by a slightly smaller amount, 2.33. And then it went down to 2.24. Not much more shift after adjusting for the other things in this multivariate model. And to get the whole scope of what's in this multi variant model. You'd, you want to look at the article and see this table which crosses two pages and I can't put it all on here. But it's just an attempt to show you some of the highlights.

So for some of the continuous measures like duration of drug use they ultimately put an odds ratio in, in the multi-variant model. But age, they never put any odds ratio. And I couldn't figure out why, and I was a little confused by how they presented it. And then I went back and read the fine print. You usually have to read an article two or three times and look at the tables to really get what's going on. So, they go on to repeat what we had just said about substance abuse. In this multivariate model, participants who had a history of being enrolled in substance abuse treatment were more than twice as likely. Now, it's a little bit of a stretch because this is not a relative risk. It's an odds ratio. They had more than twice the odds to be infected. And then they go on to say, each additional year of injection drug use conferred a significant positive risk of infection, b equals 1.098. So I, when I read it, and I actually read it as if I thought I understood it in the past. And I want to give you a heads up of the year of infection.

So if we wanted to get the odds ratio we need to exponentiate that. That's a little confusing and a little bit of a non sequitur. As to why they presented this as a slope and not the other things which were ratios. And they also said additional rests and time spent incarcerated were similarly associated with additional risk. And they give a slope and a slope in confidence intervals. So it seems that they didn't want to put in odds ratios for continuous predictors and they left that out of parts of the table. Or what they put in was in fact the slope from logistic regression. So they're mixing, if I understand this correctly, metrics in that table. That's a little confusing. And I don't what that, know what their apprehension was regarding putting the actual odds ratio in.

So it's just worth noting that sometimes you have to read the fine print, or even the print right in the main text. To understand exactly what's being shown in the table.

Let's look at a case control example. We haven't done much with case control studies in this class, other than to indicate that we could use odds ratios. To summarize outcome exposure relations if even when direct estimates of risk could not be legally computed. Because of the way the study was sampled. So this is a case control study that will show that logistic regression can be used to estimate odds ratios. And, this came from the Lancet, and it was called Hazardous Alcohol Drinking and Premature Mortality in Russia. A population based case control study. The summary says, the reason for low life expectancy in Russian men and large fluctuations in mortality are unknown. We investigated a contribution of alcohol and hazardous drinking in particular to male mortality in a typical Russian city.

So cases where all deaths in men age 25 to 54 years living in this city, which I will spare you my inability to pronounce it.

Occurring between October 20th, 2003 to October 3rd 2005. Controls were selected at random from the city population and were frequency matched to deaths by age. So they did a little preemptive strike on, to minimize the confounding by age. They matched a, a fixed number of controls to, by age ranges, to, to each age range of the cases. And they went on the same, well they did interviews with proxy informants living in the same household as cases. Because the cases were deceased. To ascertain the alcohol usage history for the cases. And they also surveyed the controls. And they say we ascertain frequency and usual amount of beer, wine and spirits consumed. And frequency a consumption of manufactured ethanol based liquids not intended to be drunk, non-beverage alcohol. Things like mouthwash or cleaners, cleaning fluids for example. And other markers of problem drinking.

Complete information on the markers of problem drinking, frequency of alcohol consumption, education. Smoking was available for 1,468 cases in 1,496 controls. And they go on to say in their findings, that over 51% of the cases were classified as problem drinkers, or drank non-beverage alcohols. Compared with 13% of the controls. The mortality odds ratio for these men, compared with those who either abstained or were non-problematic. The average drinkers was 6.0 with a 95% confidence level from 5 to 7.3. After adjustment for smoking education. And they go on to report some of the other mortality ratio's adjusted and they can do this and report odds ratio's. Because it's a case controlled studies. They couldn't use the results of the analysis to estimate the proportion, or risk of death, for each of these groups. But they could estimate the odds ratios unadjusted and adjusted. So what they go on to say in their article about this was logistic regression was used to estimate the strength of association of factors. With the motal, mortality, with all analyses done with STATA. In all models age was included in six, five year categories. Education, smoking, and marital status were treated as potential confounders. And where appropriate we introduced into models as categorical variables. So it was something we're familiar with now putting in categorical variables to a regression model.

So this first model gives the odds ratio. This one looks at the mean volume of ethanol consumed per week from beverages, from actual alcohol intended to be drunk.

And they do it by bottle. At greater than equal to four bottles, two to four bottles, one to two bottles, half to one bottle, or less than 0.5 bottle. And the reference group for these comparisons was the, the smallest dr, less than 0.5 of a bottle group. Then they also include beverage non-drinkers people who didn't drink alcoholic beverages, but maybe drank alcohol with non-beverages. So you can see if we look at this first model those who drank greater than or equal to four bottles per week. Had 6.8 times the odds and mortality compared to the reference group. And it was statistically significant. This model only adjusted for age.

And they went on and they talked about this new model. So, Model 2 adjusted for age and the other variable in the table, which is the frequency of non-beverage alcohol drinking. Model 3 adjusted for all variables in Model 2 plus smoking and education. And then Model 4 adjusted for all models, all variables in Model 3 plus marital status. So what we see here is that this association. And this dose response, essentially. Because higher consumption is associated with higher mortality. But it's not always statistically significantly so. For each of these categories compared to this reference group. This stays after adjustment but the estimate attenuates a fair amount.

Especially with regards to this greater than four bottles to the reference group. Once we start adjusting for the frequency in non-beverage alcohol drinking.

The odds ratios are much larger if those who did it daily versus those who never, or almost never which was the reference group. The relative odds on only adjusting for age was 30.5. 3,750% increase over the reference group, and it was statistically significant. And that remained similarly high after adjustment for age. It attenuated a bit after sec, different layers of adjustment, but still remained above 20. So that was a pretty notable risk factor as measured through the odds ratio for mortality. And like there was with regular alcohol consumption, there was a dose response. Decreasing consumption of these non-beverage alcohol associated with decreasing mortality. But the ratios by which the comparisons are made to the reference group are sizeably larger than they were for the alcohol consumption groups.

So hopefully this is giving you some insight I just wanted to share you one more example. I am not going to name this article, but some people, some authors, some journals. Actually despite the fact that it's a public service to the authors to actually put in the results in terms of odds ratios. And really to include information about the intercept as well. So that readers could compute predicted probabilities if they were interested. Or the authors could put in some information like we saw in that first article regarding predictive probabilities. But this is a logistic regression table here from an unnamed article. So what they report here are the slopes. So look at sex. We don't even know what the reference is, we don't know if this compares males to females, females to males. They give the slope out to eight decimal places and it's statistically significant. They give a standard error so we could get a confidence interval for the slope. And we could convert these things to odds ratios in the confidence inter. We know how to do it. But really we shouldn't have to do that to understand the results of an article. Similarly all these other things, they don't necessarily define what these things mean. Age, we don't know what the unit is. Maybe it's years. stage. Well, it looks like they might be treating it as continuum, but it doesn't explain it in the table etc. So anyway, this is just an example of how not to present results. Although you are capable of making at least some I mean, there's not enough information here to tell the full story. Like, we don't even know how sex is coded, but we are capable. We could convert these results to adjusted odds ratios and confidence intervals by doing a little math. But it shouldn't be incumbent enough to take out our calculator in order to get the results in a form. That has reasonable consistent meaning for most researchers.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.