import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.stats.stattools import durbin_watson
from sklearn.linear_model import LinearRegression, LassoLars, lars_path, lasso_path
from sklearn.preprocessing import normalize
from statsmodels.tsa.arima_model import ARIMA
%matplotlib inline
Read retail gas prices from csv.
def custom_read_csv(filename):
df = pd.read_csv(filename)
df['date'] = pd.to_datetime(df['date'])
return df.set_index('date')
# note that all prices are released on Monday, dayofweek = 0
national_price = custom_read_csv('national.csv').dropna()
state_price = custom_read_csv('state.csv').dropna()
state_price = state_price.drop('Unnamed: 10', axis=1).dropna()
city_price = custom_read_csv('city.csv').dropna()
price = pd.concat([national_price, state_price, city_price], axis=1)
The training period is 1992 - 2010. Save 2011 - 2016 for cross-validation.
trainingdates = np.intersect1d(price.index, pd.date_range('1992-01-01','2010-12-31'))
The price series is not a stationary time series. Augmented Dicken Fuller test fails to reject the non-stationary null hypothesis.
pvalue = adfuller(price.national[trainingdates])[1]
pvalue
Differentiate the series. (Take approximate log return.)
logreturn = ((price.national - price.national.shift(1)) / price.national.shift(1)).dropna().rename('logreturn')
pvalue = adfuller(logreturn[trainingdates])[1]
pvalue
The first order difference is stationary.
Therefore, we should choose d = 1 in the ARIMA(p,d,q) model.
logreturn[trainingdates].plot(figsize=(15,6))
plt.show()
g = sns.pointplot(x=np.arange(11),
y=acf(logreturn, nlags=10))
Next, we look at partial autocorrelation of residues. Partial autocorrelation gives the partial correlation of a time series with its own lagged values, controlling for the values of the time series at all shorter lags.
g = sns.pointplot(x=np.arange(11), y=pacf(logreturn, nlags=10))
Partial autocorrelation and autocorrelation together examine whether there is a moving average component, and whether there is an autoregressive component.
In AR(p) model (no moving average), $$X_t - a_1X_{t-1} - \cdots a_pX_{t-p} = \epsilon_t.$$ Theoretically, partial autocorrelation cuts off at lag $p$, while autocorrelation slowly dies down.
In MA(q) model (only moving average no autoregression), $$X_t = \epsilon_t + \theta_1 \epsilon_{t-1} + \cdots + \theta_q \epsilon_{t-q}.$$ Theoretically, autocorrelation cuts off at lag $q$, while partial autocorrelation slowly dies down.
In ARMA(p,q) model with both moving average and autoregression, both autocorrelation and partial autocorrelation slowly die down.
Based on observed autocorrelation and partial autocorrelation, let's try ARIMA(3,1,0) model.
ts = price.national.loc[trainingdates]
model = ARIMA(ts, order=(3,1,0)).fit()
print model.summary()
price_prediction = model.predict()
a1, a2, a3 = model.arparams
cvdates = np.intersect1d(price.index, pd.date_range('2011-01-01','2016-12-31'))
logreturn_prediction = a1 * logreturn.shift(1) + a2 * logreturn.shift(2) + a3 * logreturn.shift(3)
logreturn_prediction = logreturn_prediction.rename('predicted')
corr = pd.concat([logreturn, logreturn_prediction], axis=1).ewm(span=52 * 2).corr().dropna()
corr = corr.unstack()['logreturn']['predicted']
corr[cvdates].plot(
figsize=(15,6), title='2-year Exponentially Weighted Correlation between Predicted and Actual Values of Log Returns')
plt.show()
1.0 * np.logical_and(logreturn[cvdates] > 0, logreturn_prediction[cvdates] > 0).sum() / \
(logreturn_prediction[cvdates] > 0).sum()
1.0 * np.logical_and(logreturn[cvdates] > 0, logreturn_prediction[cvdates] > 0).sum() / \
(logreturn[cvdates] > 0).sum()
1.0 * (logreturn[cvdates] * logreturn_prediction[cvdates] > 0).sum() / len(cvdates)
Model | Precision | Recall | Accuracy |
---|---|---|---|
Multivariate Rolling Regression | 66% | 78% | 72% |
ARIMA | 68% | 71% | 72% |
In cross-validation,
This is not too surprising because the multivariate model has more data input. ARIMA model has the advantage of being simpler.
In the next notebook, the two models are compared to logistic regression. Logistic regression is significantly better at classifying price move directions.