So, in that specific example, we looked at what's the probability that the random
variable was larger than six. But we might want to look what's the
probability the random variable's larger than seven, or smaller than six, or
smaller than five, or smaller than 4.3, or so on.
So if you take a random variable, you could construct a function that, when you
plug in a value. Returns the probability that the random
variable is less than that value. And you could construct a function that
when you plug in a value, returns the probability that the random variable is
larger than that value. And these things are so inherently useful
that we give them names. So the cumulative distribution function,
CDF, is simply a function that takes any specific value and returns the probability
that the random variable is. Less than that value.
And again, if it's continuous, it doesn't matter whether it's less than and equal to
or less than. But, the cumulative distribution function
is defined for both continuous and discrete random variables, so let's be
specific and say that it's less than or equal to.
The survival function is the opposite, namely, that it is exactly the probability
that a random variable is larger than any specific value.
So if you plug in x into the survival function, it returns the probability that
the random variable is larger than x. So if, on our previous slide, this figure,
imagine if on the horizontal axis, instead of looking at sect, I was looking for some
arbitrary value x, The gray area would be s of x and the white area to the left of x
and to the vertical axis would be f of x. Notice in this case that f is the
probability of being less than or equal to x, s is the probability of being strictly
greater than x, so that s of x and f of x have to add up to one.
Because their probabilities have complimentary events.
Probability X is less than or equal to X and probability X is strictly greater than
X. So if you've calculated the cumulative
distribution function you've also calculated the survival function because
all you have to do is one minus it, conversely if you've calculated the
survival function then you've calculated the cumulative distribution function.
Next we'll just go through our previous example and calculate exactly the survival
function and the CDF. Let's actually go through an example of
calculating the survival function and cumulative distribution function, just
from the exponential density that we considered before.
Let's calculate the survival function first.
So recall the survival function is the probability that the random variable's
strictly greater than the value lowercase x.
So further recall that to calculate probabilities, we need to calculate areas
under the probability density function. In this case we want the probability being
x or larger. So let's take the integral from x to
infinity of a probability density function.
Here I used the dummy variable of t for integration.
The integral, or the anti-derivative, of e to the -t/5/5 is just -e to the -t/5.
We want to evaluate that from x, which yields the value e to the -x/5, and then
subtract off the value as it limits to infinity, which is zero, so we just wind
up with e to the -x/5. Now we could also go through the example
of calculating the cumulative distribution function which would instead of
calculating the, integral from x to infinity, we would just be calculating the
integral from zero up to x, but because we have already calculated the survival
function, we know that its one minus the survival function, so its 1-e to -x over
five. So the cumulative distribution function is
the integral from minus infinity to x of the probability density function.
Again, here, we're just doing zero because the integral from minus infinity to zero
is zero. And, so.
We can just apply the fundamental theorem of calculus.
And, note that the derivative. Of the, CDF is exactly the density again.
So, if we take. Just, to go through our specific example.
One Minus E to the, negative X over five and take the derivative of that.
We get exactly E to the negative X over five divided by five.
So we get the, PDF back. So, derivatives of the cumulative
distribution function. Exactly yield, the.
Probability density function back. Quantiles are properties of distributions
or density functions equivalently. When I talk about the distribution or
density in general, just maybe say the word distribution, so if I want to talk
about the Bell curve or the associated distribution, I will just talk about
Gaussian distribution or the normal distribution and so on.
And then we are talking about the mathematics, I will be more specific.
The alpha quantile distribution. Is the point.
So that the probability of being less than that point is exactly alpha.
So we want. If Xed alpha, is the alphath quantile
redistribution, we want the probability of being less than or equal to Xed alpha to
be exactly alpha. So lets just take as a specific example,
if alpha was 0.25, X sub 0.25, is that point such that the probability of being
less than it is 25%. So for example, in our cancer survival
example, the 0.25th quantile of that distribution.
Is the time of survival so that 25 percent of the people survive less than that time.
The percentile is merely the quantile expressed as the percent, so the
twenty-fifth percentile is the point two fifth quantile.
And then the median, the population median, is exactly the fiftieth
percentile. Let's, just go through these concepts
again. With our, density that we've been looking
at. This exponential density.
Suppose, we wanted to find the twenty-fifth percentile of the exponential
survival distribution. What we want is to find the point X, on
the horizontal axis. So that the white area to the left of it,
is point two five. So let's actually go through this
calculation. In order to find the point so that the
area to the left of it is point two five, we just want to solve the equation point
two five equals f of x. We'll recall in this problem a couple of
slides ago we solved that f of x is one minus e to the negative x over five.
And, if you just simply solved that for x, we wind up with the solution x equals
minus log point seven five times five, which is about one point four, four.
How is that one point four, four interpreted?
About 25 percent of the subjects from this population live less than, 1.44 Years.
You can get quantiles directly from R by the Q function, Qx in this case, because
we are talking about the exponential PDF. So Qx gives you the quantiles from the
exponential PDF, Px gives you survival or CDF properties from the exponential, and
Dx gives you the density itself and that rule R follows for most of the common
distributions. The median, to remind you, is the point
fifth quantile, the fiftieth percentile. And quantiles that we just figured out the
point to fifth is generally called the lower quartile and you might have said, oh
I have heard of the median before. Maybe I've heard of what a lower quartile
is before and what those things are to me is that they are the middle of the data or
point in the data so that 50 percent of the observations are lower than it or the
lower quartile is that point in the data so that 25 percent of the observations are
below it. What in the world is Bryan talking about
at this point. So when we talk about the median, that we
are discussing in this lecture, we are talking about the population median.
And when you collect data and take a sample median, that's a estimate of
something, so we should talk about what that's an estimator of.
Right, it's an estimator and it has to have an estimand.
In the same way, if we take a sample mean of data, that's an estimator of something
and it has to have an estimand. And so what, what we're talking about in
this lecture is one way to construct estimands of these quantities.
In this case, the median, if you take the sample median, it is hopefully trying to
estimate the population median, that point in the population so that the probability
of being less than it is 50%. And you'll find in this class that there's
this simple rule. Sample things tend to estimate population
things. So sample medians estimate population
medians. Sample variances estimate population
variances. Sample means estimate population means,
and so on. And what we are going to see is this
probability modelling and the associated assumptions.
These are the things that connect our data to the population so that we can actually
have s demands. If we didn't go through this exercise, we
would be able to take a median, you know, that would just be an entity in a sample.
The whole point of probability in modelling.
Cuz then it connects your sample to this population so that your, now your, sample
median now has a population median that it's trying to estimate.
Now in, this is kind of a very difficult concept.
I think the sample median is a very easy concept.
It's saying, you know, you have a list of observations, take the middle one after
you order them. The population median is a much more
difficult concept. It's saying I have described a population
via this distribution. And this distribution has a point so that
50 percent of observations lie below it, and that's the population median.
And, I think it's a good idea. Whenever you're talking, in this class.
To put the word population or sample in front of it to remind yourself.
Now, people who work in statistics do this so much.
That they just kind of. Forget about these distinctions.
Even though they know them, they just forget about them because they become sort
of second nature. But when you're first learning this, it
seems quite odd. And I also wanna, mention.
You know the sample median is the well defined quantity that doesn't, require
tons of assumptions. It's the probability modelling that's the
delicate part. That requires assumption.
So if you are going to say that the sample median estimates the population median,
whether assumptions need to be taken for the account for that to be true.
And specially, when you want to do inference with your sample median or
evaluated uncertainty. And that's basically we are going to spend
nearly all of this class discussing is how we are going to connect these probability
and population concepts to sample data. Thanks recruits.
This was mathematical statistics boot camp lecture two.
In the next lecture, we're going to expand on probability modeling and defining
characteristics of probabilities, and I look forward to seeing you.