We're going for a couple years here.
And then you have particulate matter levels down here at the bottom.
Now you can see there's not as much particulate matter data
as there is mortality data, because there's a lot of missing data.
So, essentially you just want to say, is
this top thing correlated with the bottom thing.
So question is, can we encode
everything that we found in the statistical
and epidemiological research into a single package?
The answer is yes.
Time series studies like this don't have a huge range of variation.
And so they typically involve similar types of data.
You know, it might be hospitalization instead of
mortality or what not, but it's often very similar.
And so can we create kind of
a deterministic statistical machine for this area?
So the basic
kind of pipeline that looks, it's a very simple pipeline.
This is not a very complicated analysis for the most part.
You want to check the data, see if there
are any outliers, high leverage types of points.
Pollution data is often skewed, so you want to check for that.
Look for overdispersion.
Do you want to fill in the missing data? The answer is absolutely no.
There's been a lot of work on that.
It doesn't, it doesn't turn out well.
The big question really here is model selection, so one
of the things that we have to worry about is
called unmeasured confounding in these types of time series studies.
And you don't measure a lot of things that vary over time.
So I guess this is like your batch effects.
And so you have to, there are various approaches
to estimating of how you adjust these unmeasured confounders.
We use semiparametric progression methods to do this, so.
Estimating the degrees of freedom, this is, has the most profound effect
on any kind of association that you might estimate, so this is critical.
So, but, you know, there is lots
of research on how to do this.
There's been a couple of papers One is mine.
One of the various approaches to e, estimating
this number of degrees of freedom and there, and
you can settle on a, you know, one or
two approaches that are That are better than others.
So we can just implement that.
Other aspects of the model tend not to be that important.
Again, whether you adjust for temperature and weather and other
things it kind of doesn't really matter how you do that.
There's other things that you're typically interested
in, multiple lag analysis and sensitivity analysis.
So you want to see, you know, you can
select a model here, but if you want to see
if you can move the model back and forth
a little bit Does your association change dramatically, right?
So those are the typical things that you want to see, in this kind of analysis.
And when I review, you know, one paper a month, of, of
a time series study these are the things that I always ask for.
So. >> Roger.
>> Yeah.
>> Is there a 15 second response to why.
Computation is so bad in the setting? >> Oh, because the data
are missing systematically so the, the, the pollution data is very difficult to
collect, and so they typically only measure it only once every six days.
So there's five days missing for every six to eight days.
One observation for every six days.
So, you can try to imp-to inpute it but you just
add a lot of noise for a little bit of savings advice.