0:00
Hello everyone.
In this lecture, we will use our tools that we have acquired until now.
And we'll try to fit an ARIMA model into a real life dataset.
Dataset is called daily female births in California in 1959.
So we're going to look at the time series for
whole year and the frequencies for every day.
It's going to be daily time series.
And we'll have a number of female births at each day in California in 1959.
Objectives are to fit an ARIMA model to a real world data set.
To judge various fitting tools, such as ACF, PACF, and AIC.
And to examine Ljung-Box Q-statistics for
testing if there's an autocorrelation in a time series.
0:55
As you remember, this is our basically guideline for modelling a time series.
We're going to look at if there's a trend in the data,
if there's a difference in the variation, if we need transformation or not.
And then we're going to look at ACF, PACF.
Akaike Information Criterion, sum of squared errors, and
also Ljung-Box Q-statistics.
And at the end of the day,
we're going to use R to estimate our perimeters in the model.
1:28
Okay, so this data that is coming from Time Series Data Library which is TSDL,
it is created by Rob Hyndman,
Professor of statistics at Monash University, Australia.
And there's a link here, you can go directly to the library.
And the category that I chose was demography, so this dataset,
daily total female births in California is under category Demography.
It is a time series from January 1, 1959,
to December 31, 1959, for 365 days.
It's a daily time series.
4:21
And we're going to format the date as month, day, and year.
And we're going to plot the data on the right hand side here plus plot the data.
Once we have the birth, let's plot our dataset.
And when we plot it,
you see that we have this Daily total female births in California.
This is our time series.
This is our Date, and this is the Number of births.
As you can see, this is definitely not a stationary time series.
There's some kind of trend going on.
So maybe we should need to remove the trend later on, but
lets first check Q-statistic.
In other words, is there a correlation or
after correlation among different lags of this time series?
And we're going to use box test as we said.
Box-test, this is the name of the dataset and this is the lag that we
said we're going to use the logarithm of the length of that time series.
If we use this Ljung-Box test, we obtain the following
p-value and then p-value is actually very,
very small, which is smaller than any significance
level you would choose like 0.005 or 0.001.
So there is definitely after correlations, so we can fit somewhere the model, ARIMA
model may be here, some linear model, and try to pull the linear part of it.
And then we're going to look again to its residuals,
maybe there's other correlation there, maybe there's not.
So we hope that we will get some white noise there.
6:12
So let's plot our different time series.
This is our difference time series.
This is our date and we get some kind of maybe a stationary time series.
There is a little bit peak on September 2nd if you
look at the time series there's September 2nd which is exactly
nine months after December 31st.
We can attribute this to randomness, so we can say that,
okay, maybe we can ignore this outlier.
Then we basically have some stationery time series.
And let's look at the auto correlation theories and the auto correlation here
we're going to use, again, box test for the difference of the time series.
And if you run this, we obtain, again, very,
very small p-value, so there's definitely some autocorrelations.
Let's look at ACF, I told you before that ACF will also suggest
that there might be some autocorrelations.
7:15
This is the ACF of the difference data.
If I look at the ACF, we see that there's a significant lag at lag 1.
And then there's one peak at lag 21 maybe.
So maybe we can ignore this and we can refer to this randomness.
Maybe this is a random peak.
And if you can ignore that, then maybe we have two significant peaks.
This is lag zero, which is always one, and we have some lag one here.
Autocorrelation functions suggest order for the moving average process.
If I look at the PACF or partial auto-correlation function.
We obtain the following.
We see a lot of significant peaks if even if we ignore lag 21.
We can see that there is a lot of significant lags until lag 7.
Okay, so we have various competing model we're going to have to try.
So first we're going to try to fit ARIMA model,
the to number_of _births with order 0,1,1 all of them
will have to have 1, because we have the difference the data.
So this time, I'm going to look at the model basically MA1 for
the difference data.
9:02
Okay, so this model one is our ARIMA 0,1,1,
I'm going to pull out some of the squared errors.
In other words, I'm going to look at the residuals square them and
sum them up and we're going to call this SSE1.
We want this to be as small as possible.
We're going to also look at the residual.
In other words, once we model our time series with Arima(0,1,1) let's say,
then we have residual.
We hope not to have any autocorrelation in the residual.
We would like to have some white noise in the residual.
So we're going to take Model 1 residuals.
We're going to look at the box test and that's going to be called Model 1 Test.
And we're going to look at its p-values.
And so we continue this process for different models, Arima(0,1,1),
Arima(0,1,2), Arima(7,1,1), Arima(7,1,2).
And we have some of the squares of errors and we also have the R test.
I'm going to define a data frame df, it's going to have
the row names AIC which is Akaike information criterion.
It's going to have some of squared errors, is going to have p-value.
And then I'm going to take for example model1 AIC value, SSE1,
sum of squares errors for the model1, and I'm going to take model one test and
I'm going to look at the p-value of it.
We're going to do this for every single model.
So, let's do this.
Let's go back here and run these commands.
Model1, model2, model3,
model4, and now we have our data frame.
I'm going to call the column names are Arima(0,1,1), Arima(0,1,2), and so forth.
And then I'm going to format it without using these scientific notation and
at this point I should get my results.
10:58
So as you can see here, my data frame without the scientific notation basically
are, I have AIC values and I have sum of squared errors.
And I have p-values.
Remember what p-values are,
p-values are trying to see if there is an autocorrelation in the residuals.
As you can see from all of the p-values, none of the residuals have
significant auto correlation, so we can assume that there is no autocorrelation.
We cannot reject the null hypothesis,
there's no autocorrelation in the residuals.
Let's look at the AIC values.
AIC values, as you can see, is 2462,
2459, 2464, 2466.
The minimum value is 2459.
So if you are going with the minimum AIC value,
our model is going to be basically Arima(0,1,2).
If we are looking at the smallest SSE values, sum of the squared errors,
it seems like the smallest error is actually 17,574,
which will tell us Arima(7,1,2) model.
But if you have Arima(7,1,2) model, 7 plus 1 plus 2,
you have 10 parameters in the model.
So if you are using, trying to get the simplest model as we can, so
in this example, you're going to use AIC as our criterion.
And we're going to go with Arima(0,1, 2) model.
Now that we have decided that we're going to fit Arima(0,1,2 model to our dataset,
I'm going to use SARIMA routine from astsa package.
Now that we have loaded astsa package, were going to say Sarima S stands for
seasonal autoregressive integrated moving average models.
Which is going to be our subject in week five.
Sarima, you have number_of_births, and
we have this 0, 1, 2 basically this is p, d, q, and for
now lets just ignore this last three perimeters 0, 0, 0.
If I want this command, it gives me a lot of things.
For example, it gives me my call and
my coefficients.
And it actually gives me the coefficients here with their standard errors.
So ma1 coefficient is negative 0.85, 11 and
this is ma2 coefficient, this is our constants.
If you look at this p-values it means that ma1 and
ma2 coefficients are significant at the level of 0.05 maybe
the constant is not significant but we can keep this our model.
13:52
Now that we have fit ARIMA (0,1,2) model to our data set, we have the following.
We have (1- B)Xt, this 1- B is the differencing,
because of order 1, B = 1.
And we have 2 moving average terms, which are right here, Zt -1 and Zt- 2 terms.
We have no autoregressive terms,
this is the constant, and Zt is the current noise.
The numbers in the indices are shown here as the standard errors of
these coefficients.
Thus, if I put (Xt- 1) to the right hand side,
we have the following Arima (0,1,2) model.
Xt = Xt- 1, this is because of differencing, we have a constant,
we have a noise right now for current time.
And we have dependence on the previous two noises, Zt -1 and Zt-1.
Now from Sarima routine we have seen that Zt which is assumed to
be normally distributed, has expectation 0 and a variance 49.08.
So what have we learned?
We have learned how to fit an ARIMA model to a real world dataset using
various fitting tools such as ACF, PACF and AIC.