A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

68 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Module 1B: More Simple Regression Methods

In this model, more detail is given regarding Cox regression, and it's similarities and differences from the other two regression models from module 1A. The basic structure of the model is detailed, as well as its assumptions, and multiple examples are presented.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Hi, and welcome back.

In this section, we'll talk briefly about accounting for

the uncertainty in our slope estimates from Cox regression.

I don't think we'll see anything that will surprise you here.

And then, we'll also talk briefly about translating Cox regression results to

predict its survival curve.

So upon completion of this lecture section hopefully you'll be able to

create 95% confidence intervals for the slopes from simple Cox regression and

convert these to 95% confidence intervals for

the hazard or incidence rate ratio for the predictor in the model.

You'll be able to estimate P values for testing the null hypothesis that the slope

or log hazard ratio is zero and hence the exponentiated slope or hazard

ratio is one indicating no association between our predictor and the outcome.

And then, you'll also, hopefully,

appreciate that the results from Cox regression models can be translated into

estimated survival curves for each of the groups defined by our predictor.

So a randomized clinical trial, we're very familiar with this example.

On 312 patients with primary biliary cirrhosis we've looked at this before,

this is data from the Mayo Clinic.

And in the last section we took bilirubin as a continuous measure and used Cox

regression to estimate the association between mortality and the bilirubin

level at enrollment in the study measured in milligrams per deciliter.

And we found the following result.

The log hazard of mortality at a given time was equal to

the log risk of mortality at baseline.

Log hazard of mortality in our referent group plus

0.15 times the bilirubin level in milligrams per

deciliter of the group we're estimating it for.

And so we saw this slope.

Beta-1 of 0.15 was the estimated difference in

the log hazard of mortality at any given time in the followup period for

two groups who differed by 0.1 mg/dL in bilirubin levels.

And if we exponentiated this to get the estimated hazard ratio, it turned out to

be about 1.16 so if we compare two groups whose bilirubin and

differed by 1 milligram per deciliter, the group with the higher level had 16 percent

higher estimated mortality over the followup period

compared to the group with 1 milligram per decimate a lower level of bilirubin.

So how do we get these estimates?

Well, of course, we go to the computer and there must be

some algorithm that always yields the same results for the same data set, and

the algorithms is really similar to what we saw with logistic group regression.

For Cox regression, it's actually called Partial Maximum Likelihood.

because there's a part of the model that uses a different method of estimation,

the part that estimates that Lambda not of P,

the law of hazard is a function of time.

And then the part that estimates the slope uses the maximum likelihood approach.

So, this part that estimates the slope under the assumption of

proportional hazards that takes the value of beta 1,

makes our observed data and outcomes most likely among all choices.

So this is similar to the approach used for logistic progression.

And this approach also estimates the standard error for

the slope estimate, and this has to be done by computer.

There's no easy closed form equation that we can solve by hand, nor

would we want to with a large data set.

So here are the results from the Cox Regression with mortality and bilirubin.

We estimated that the log hazard of mortality at any given time T was

estimated for at some log baseline risk at that time

plus the slope of 1.5 times the bilirubin level of the group we were looking at

is the slope with x1 measures bilirubin in milligrams per deciliter.

If I actually use the computer, I get an estimated slope.

As we solve 0.15, the estimated standard arrow for this slope is 0.013.

So, we can easily, using the same old approach, take in our estimate plus or

minus two estimated standard errors.

We can estimate a 95% confidence interval for the true population level slope or

the true population level log hazard ratio of mortality for

two groups who differ by one milligram per deciliter in bilirubin.

And the population here is people with primary biliary cirrhosis.

So, this is our 95 percent conferencible slope population level.

If we exponentiate the end points we get a 98 percent conferencible.

But for a hazard ratio of mortality associated with bilirubin.

So putting all our results together, our estimated hazard ratio was 1.16 and

the confidence interval, 1.13 to 1.19.

So we estimate a 16% increase in mortality for one unit difference in bilirubin.

But after counting for the sample variability,

the true association could be an increase between 13 and 19 percent.

Notice that this confidence interval does not include the null value for

ratios of one and the previous confidence interval for

the log ratio did not include the null value of zero.

How could we get a p value for this?

Well, what is our no hypothesis here?

It would be the no at this slope or

this true population level log hazard ratio of 0.

Another way to state this is that the exponentiated slope.

Which is the population lesser hazard ratio is one.

There in fact is no association between mortality and bilirubin.

So what we do is we start by assuming the null, assume the null, and then we measure

how far our result is in a distance measure of how far our result is.

From the null in terms of standard error,

this is just like every other distance measure we computed.

I'm going to call it a z here because that's what you see in textbooks and

that just means we always use the normal table when getting a p value, but

if you do this out, we get a result that looks like this 11.5.

So we have a result and estimate that 11.5 standard errors above what weve expect

to have gotten just by chance, when a null is true.

So we'd expect on average our estimates would equal to null of 0,

but there's some variations around them but even after tally for

that variation we have something way far above, 11.5.

Above what we'd expect.

So, if we compute the p value, it's the probability of being as far,

or farther from 0, when the known hypothesis is true,

this p value is way less than 0.001.

In the normal table, in fact, doesn't go that high.

And it's data reports it at 0.0000, so that's some very small number.

Suppose we wanted to get a confidence estimate for

the estimated hazard ratio, for persons with bilirubin levels of

3.5 milligrams per deciliter versus 0.8 milligrams per deciliter.

We already computed this difference on the log hazard ratio scale.

We took the difference in bilirubin levels between the two groups, which was 2.7.

And we multiplied it by the slope.

Difference in law of mortality per one unit difference in bilirubin.

And so the cumulative difference in mortality log, risk of mortality for

2.7 usable bilirubin is 0.405.

Well, to get the confidence interval for this, all we're going to do is

take the confidence interval endpoints for the original slope and

multiply each of them by this multiple of 2.7.

So whenever we do the estimate to get our measure for a unit difference that

is not one on the x scale, we also do to the endpoints of the confidence interval.

And so if we do this, and we get the end points on the conference interval.

These go from 0.335 to 0.475.

If we exponentiate them,

we get a confidence interval that goes from 1.40 to 1.61.

So you may recall the exponentiated estimate was 1.5.

So we estimated that the group with bilirubin levels of 3.5 milligrams per

deciliter at study enrollment had a 50 percent greater risk of mortua,

mortality than the group with 0.8 milligrams per deciliter and

this 50 percent elevated risk was across the entire follow up period, but

after accounting for sampling variation.

This increased risk could be on the order of 40% to 61%.

So how could we actually take these results now and

translate them back to survival function?

In other words, for a given value of bilirubin, could we translate these

results that are estimated on the log hazard scale back to the proportion who

would survive beyond a given point in time with a specific spilurin level?

Well the short answer is, it can be done, but it's mathematically involved.

I'll give you a little window into that in a minute.

But further, when we have a continuous measure, like bilirubin.

It's not generally possible to display the estimated survival curves,

for all possible values of x1 in a sample.

You know, our range here was 0.3 to much larger.

Even if we could theoretically estimate the curve, for each.

Value.

So for display purposes if we were presenting this in the paper,

these results, we might choose to present the estimated survival curves from

Cox Regression for several specific values of bilirubin.

And here's and example of this graph.

And I had to use the computer to do this, but what I've plotted here.

Are the estimated survival curves for

subjects with bilirubin levels of 0 .5 at time of enrollment, one, two, and three.

And you can see

these curves all have similar shape after they diverge, and if we were to zoom in.

In the early stages of the followup term,

we see similar shapes as well but there is this ordered difference.

Its bilirubin at baseline increases the proportion surviving beyond

a certain point and the follow up time decreases.

How did we get these estimates or how does the computer does it, do it?

What's not so simple is just plugging in a value of bilirubin and

cranking out a number because we also have to consider the time element.

So the longer answer.

And I'm just giving you this for

some insights here is at any given time t for any value of x1.

In our case, bilirubin, we can plug in the bilirubin value.

And get the log hazard of mortality at that time by adding whatever this thing

is at time t,

whatever this starting log hazard is plus .15 times our bilirubin estimate.

And then we can exponentiate this to get the hazard of mortality.

For a group with that bilirubin level at that specific time.

So the instantaneous rate of mortality at that time for

persons with that bilirubin level.

And again, this is just FYI, but

we can get something called the Cumulative Survival for

any given group at a specific time in the follow-up period which estimates the total

amount of risk they've accrued from time zero up through this given time.

And what this is equal to and

if you haven't had calculus you may not recognize this symbol.

And if you have had calculus you may wish to forget this symbol, but

this is the integral from time zero to time t of the time specific hazards.

And what this really means for those of you who

are not intrigued by the wonders of calculus, and

certainly we don't require this for this course,

but this basically sums up the time specific risks or

hazards from the start of the study to the current time.

Wherever we're looking and then finally the survival function is equal

to the exponentiated with the base e of the negative of the cumulative hazard.

So let's just think logically about this for a minute.

This percentage, the percentage surviving beyond

a given time t is going to be smaller.

The larger this cumulative hazard is, so let's, even if you don't appreciate this

mathematics, you don't have to worry about the math, but just think of the gist of

this that says the more risk a group has incurred over time,

the smaller their chances of making it beyond a given time point.

And that's what the mathematics behind this does.

And I obviously won't expect you to do any of this by hand.

I can't do this by hand.

This requires a computer but I just wanted to give you some intuition as to

how we can translate the results from Cox regression back into a function that

tracks the proportion who have not had the outcome over time.

Because remember our regression involves the element of

time through that base line log hazard function.

Let's look at our example of gestational age and infant mortality.

Recall we had gestational age categorized into five categories so

our baseline group was those who were pre-term, less than 36 weeks.

The group that was 36 weeks to 38 weeks,

those who were 38 weeks to 39 weeks, those

who were 39 weeks to 41 weeks, and those who were greater than or equal to 41.

And so, we got the estimates,

we translated them into hazard ratio estimates before.

But here are the estimates with their accompanied standard error

from the computer.

And I'm not going to go through the computations here, but

you could easily compute the conference intervals for

each of these corresponding slopes and hazard ratios by hand.

Here's the end result though.

This gives the estimated re,

I'm sorry, reference group here is the pre-term group.

This gives the estimated hazard of mortality anywhere on the followup period

for children with each of these.

In each of these other four gestational age categories.

And the resulting confidence interval.

And you could verify that based on the information they gave you before.

But you can see all of these comparisons are statistically significant as evidenced

by the lack of one in the confidence interval and the low p values.

So there is a significant reduction in mortality relative to

the reference group for each of these four additional gestational age categories.

However, if you look you can see that the confidence intervals for

these things across some of the gestational age categories cross over.

So, again, this bolsters the general idea that the big difference in,

in risk of, for mortality is making it to full term versus not.

We could, if we wanted to, go through with the mathematics or

have the computer do it for us, and

it's actually easier to to in a computer package and estimate the survival curves

based on these Cox model results for each of the gestational age categories.

And here it is for all five.

One of them is directly on top of another, so

it only looks like there are four curves.

But you can see these results look very similar to what we

got with the Kaplan Meier curves.

And in fact you might ask, how are these different?

Well, let me show you the Kaplan Meier estimates versus

the Cox Regression estimates for two of the groups.

This is for the greater than 41 week group.

And this is for the reference group of less than 36 weeks.

And if you look at these zoomed in upon,

the Cox regression estimates are the curves that tend to be smoother.

And the Kaplan-Meier are the choppier curves.

And you can see they look similar and they track similar territory, but

they don't exactly agree.

And that's because the Cox regression is estimating the survival curve

having estimated the log hazard under the assumption of proportional hazards.

Whereas the Kaplan Meier just takes the data as is and

doesn't make any assumptions structurally.

So we will get different estimates.

The nicety of using the Cox regression results as we'll see in the,

when we get to multiple regression is that we can estimate specific survival

curves for multiple val x inputs, not just one at a time, so we can get more

specific in whom we estimate the survival for based on more than one predictor.

So in general, Cox regression slopes are interpretable, as we noted, as log hazard

ratios, and the 95% confidence levels for the slopes can be constructed, as usual,

by taking our estimate and adding, subtracting two standard errors.

Slop and the confidence interval endpoints can be exponentiated to

get an estimated hazard ratio and its 95% confidence interval.

And the cumulative survival through a given time t can be estimated from

the results of a Cox regression, but the mathematics requires a computer.

And when our predictor of interest is nominal or ordinal categorical, and

we were to plot the results from the simple Cox regression,

the survival curves estimates first the Kaplan–Meier, they would look similar but

may differ slightly.

The nice thing about Cox regression above and

beyond the Kaplan–Meier is that we can plot specific survival curve estimates for

very specific values of the continuous variable, like in our bilirubin example.

We didn't have to break bilirubin into groups to present survival curves,

which is what we would have had to do if we used the Kaplan-Meier approach.

So the Cox allows us, when we have our predictors continuous,

to estimate curves for a specific singular value of the predictor.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.