In this part we would cover the process of performing ARIMA with SAS and a little theory in between.

Hope you have gone through the

There are three types by which you can calculate test statistics of dickey-fuller test.

Inappropriate exclusion or inclusion of these terms substantially affects test reliability.

Hope you have gone through the

**Part-1 of this series,**here comes the Part-2 .**Data File Location :**Library -SASHELP

Data set -AIR

**Data Preparation Steps For ARIMA Modeling**- Check if there is variance that changes with time -
**Volatility.**For ARIMA, the volatility should not be very high. - If the
**volatility is very high**, we need to make it non-volatile. - Check for
**Stationary**- a series should be stationary before performing ARIMA. - If data is
**non-stationary**, we need to make it stationary. - Check for
**Seasonality**in the data

**Step 1 : Check the series****As a matter of practice, we first plot the time series and have a cursory look upon it. It can be done directly in SAS using following code :**

proc sgplot data = sashelp.AIR;It would give you the following plot in the result window :

series x = date Y = AIR;

run;

quit;

SAS : Time Series Modeling |

It is clear from the chart above that the series of AIR is having an

**increasing trend and consistent pattern over time**. The peaks are at a constant time interval which is indicative of presence of seasonality in the series.This is anon-stationary seriesfor sure and hence we need to make it stationary first.

Practically, ARIMA works well in case of such types of series with a clear trend and seasonality. We first separate and capture the trend and seasonality component off the time-series and we are left with a series i.e. stationary. This stationary series is forecasted using ARIMA and then final forecasting incorporates the pre-captured trend and seasonality.

We would understand it in details further in

**Step-3**.**Step 2 : Check the volatility of the series**

Volatility is the degree of variation of a time-series over time. For ARIMA, the volatility should not be very high. For checking the volatility of time-series, we do a scatter plot using the following SAS code :

Proc gplot data=SAShelp.AIR;

plot Date * AIR;

Run;Quit;

It would give you the following plot in the result window :

Check the volatility of Series |

The highlighted area is showing the diverging pattern (Fan shaped) of the scatter plot and hence depicting that the data is volatile. Ideally, the highlighted pattern should be parallel for ARIMA modeling.

**Step 3 : Treatment of Volatile Series**

We need to make the series non-volatile and move ahead. We would transform the AIR series and remove volatility. Generally a hit and trail method for transformation is used, but we would suggest to not to waste your time.

**Box-Cox Transformation**can be used to help you out and recommend the suitable transformation.

Proc Transreg Data = sashelp.AIR;

Model BOXCOX (AIR) = Identity(Date);

Run;

You get following plot along with Lamba value, which is "0" in this case.

Now based on this Lambda value, you can decide the transformation. Take help from the table provided below.

Box cox Transformation |

In our case, it is suggesting a log transformation, so we do the same. In a new data (Masterdata) we create a new variable (Log_AIR).

Data Masterdata;

Set SAShelp.AIR;

Log_AIR = log(AIR);Run;

We can check the volatility again of the transformed series, just to be sure, using scatter plot as elaborated above.

**Step 4 : Check For Non-Stationarity**

Now on the transformed series, we check whether the series is stationary or non-stationary.

For performing ARIMA , a series should be stationary, however if the series is non-stationary, we make it stationary (For more explanation on stationarity, read Part 1 of this series).

Rather than identifying the series's stationarity visually as we have done in step 1, we now use

**Augmented Dickey-Fuller Unit Ratio Test**for the same.**Unit Root -**Homogeneous Non-Stationarity Data

**Dickey-Fuller test**

The Dickey-Fuller test is used to test the null hypothesis that the time series exhibits a lag d unit root against the alternative of stationarity.

**Null Hypothesis :**Non-Stationary**Alternative Hypothesis :**StationaryThere are three types by which you can calculate test statistics of dickey-fuller test.

**Zero Mean - No Intercept.**Series is a random walk**without drift**.**Single Mean - Includes Intercept.**Series is a random walk**with drift.****Trend - Includes Intercept and Trend.**Series is a random walk**with linear trend**.

All the above test statistics are computed from the OLS regression model.

**Drawback of ADF Test**

*Uncertainty about what test version to use, i.e. about including the intercept and time trend terms.*

Using of prior knowledge (for instance, as result of visual inspection of a given time series) about whether the intercept and time trend should be included is the mostly recommended way to overcome the difficulty mentioned.

We run

**Proc ARIMA**with**Stationarity = (ADF)**option to do so :PROC ARIMA DATA= Masterdata ;

IDENTIFY VAR = log_Air STATIONARITY= (ADF) ;

RUN;

QUIT;

There are many outputs of the above code, a part of which is used for checking stationarity:

ARIMA : Check Stationary |

**Important Note :**

CheckTau Statistics (Pr < Tau)inADF Unit Root Teststable. It should be less than 0.05 to say data is stationary at 5% level of significance.

**Step 5 : Make Non-Stationary Data Stationary**

Post establishing the non-stationarity of the series, we need to make the series stationary.

**Differencing**process is used for making the series stationary.Differencing :Transformation of the series to a new time series where the values are the differences between consecutive values

Differencing Procedure may be applied consecutively more than once, giving rise to the

**"first differences"**,**"second differences"**, etc.**Differencing Orders :**

1st order : xt = xt – xt-1.

**For eg. Sales - lag1(Sales)**
2nd order : 2xt = (xt - xt-1 )=xt – 2xt-1 + xt-2

It is unlikely that more than two differencing orders would ever be required.

**Note :**If there is a physical explanation for a trend or seasonal cycle : use**regression**to make series stationary.
For that we use the output of the Step-3 code itself. While we have run the code above, we have got "Autocorrelation Check for White Noise" along with " Augmented Dickey-Fuller Unit Root Tests".

Looking at "Autocorrelation Check for White Noise", we decide the order(s) of differencing required.

Stationary : Order of Differencing |

*A heat map has been made using Excel for demonstration, SAS output is black and white only.*

The first row of the above autocorrelation matrix shows correlation of time-series with 1st to 6th lags, second row show the same for 7th to 12th lags...and so on ... The same is visible in ACF chart provided in Step-3 visuals.

We can see that in above matrix the highest auto-correlation exists with 1st lag, it starts decreasing but again increases to attain a local peak at 12th lag.

**Step 6 : Check Seasonality**

Highest Correlation with 1st Lag indicates towards the presence of trend and that with

**12th lag indicates an annual seasonality.**Hence we need to do differencing at first and Twelfths orders.

We perform differencing and check the stationarity again.

PROC ARIMA DATA= masterdata ;

IDENTIFY VAR = Log_Air (1,12) STATIONARITY= (ADF) ;

RUN;quit;

We have used 1 and 12 in bracket to define the 1st and 12th order of differencing.

**Check whether data is stationary**Check Tau Statistics (Pr < Tau) in ADF Unit Root Tests table again and see if the value <0.05 to say data is stationary at 5% level of significance.

**How this differencing actually worked :**1. First order (1) Differencing removes the trend, but Seasonality still exists.

2. Second Order (12) Differencing removes the seasonality.

**How to do it with MS Excel:**

First subtract first lag from each observation and plot it. Then in this new series subtract 12th lag from each observation.

**Step 7 : Split Data into Training and Validation**

Now we can break the data into

**Training and Validation**samples.We cannot use**random sampling like we do in regression models**to split the data. Instead, we can use recent data for validation and remaining data be used to train the model. We would develop ARIMA model and forecast on Testing part and would check the results on Validation part.Data Training Validation;After these 7 steps, we would train ARIMA model on Training Dataset in

Set Masterdata;

If date >= '01Jan1960'd then output Validation;

Else output Training;

Run;

**Part 3**of this series.

**Check out :**Time Series Forecasting - ARIMA [Part 3]

**About the Author -**

This article was originally written byRajat Agarwal, laterDeepanshugave final touch to the post. Rajat is an analytics professional with more than 8 years of work experience in diverse business domains. He has gained expert knowledge in Excel and SAS. He loves to create innovative and imaginative dashboards with Excel. He is founder and lead author cum editor atAsk Analytics.

despite doing everything - using MINIC, my autocorrelation is still significant, what should I do

ReplyDeleteAutocorrelation Check of Residuals

To Chi- Pr >

Lag Square DF ChiSq --------------------Autocorrelations--------------------

6 19.46 4 0.0006 -0.045 -0.094 0.282 -0.178 0.079 0.279

12 41.37 10 <.0001 -0.260 0.024 0.311 -0.199 0.005 -0.121

18 64.20 16 <.0001 -0.275 0.212 -0.075 -0.193 0.207 -0.072

24 92.59 22 <.0001 -0.125 0.151 -0.180 -0.168 0.272 -0.250