-
Geographical Analysis 6
-
Lecture1.1
-
Lecture1.2
-
Lecture1.3
-
Lecture1.4
-
Lecture1.5
-
Lecture1.6
-
-
Cap Table 3
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
-
Simulation 6
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
Lecture3.4
-
Lecture3.5
-
Lecture3.6
-
-
Search Index 8
-
Lecture4.1
-
Lecture4.2
-
Lecture4.3
-
Lecture4.4
-
Lecture4.5
-
Lecture4.6
-
Lecture4.7
-
Lecture4.8
-
-
Fund Distributions 5
-
Lecture5.1
-
Lecture5.2
-
Lecture5.3
-
Lecture5.4
-
Lecture5.5
-
Holt Winters
Modeling Growth with Holt Winters¶
We’ve shown a few examples thus far of modeling trend and seasonality and in this part we will expand this to include exponential growth and forecasting with our model. Before we dive in, I am going to create dummy data to use. We can think of this data as sales data for a local startup. Don’t worry about what is happening in the code below.
from datetime import datetime
np.random.seed(1)
growth = np.random.normal(.04, .08, 119)
growth = (growth + 1).cumprod()
growth = np.insert(growth, 0, 1)
sales = pd.Series(1000 * growth)
sales.index = pd.date_range(datetime(2010,1,1), datetime(2019,12,31), freq='MS')
sales.plot(kind='line')
plt.show()
Using a Log Transform¶
Often we have data that has a pattern of exponential growth. For this type of data, we can use a log transform to turn it into a more linear series of data.
log_sales = np.log(sales)
log_sales.plot(kind='line')
plt.show()
Holt Linear Trend Model¶
The holt linear trend model is a useful model to for forecasting. First, we can create and fit the model to our log transformed data, then we will work through what is actually happening with the model.
from statsmodels.tsa.holtwinters import Holt
#Create and fit the model to our log sales values
model = Holt(log_sales).fit()
For the model, there is an attribute fittedvalues which returns the values fit by the model we made.
print(model.fittedvalues)
2010-01-01 6.917900
2010-02-01 6.947951
2010-03-01 7.093511
2010-04-01 7.098411
2010-05-01 7.096797
...
2019-08-01 11.580638
2019-09-01 11.638957
2019-10-01 11.583040
2019-11-01 11.611002
2019-12-01 11.528412
Freq: MS, Length: 120, dtype: float64
#Plot the actual and fitted values
log_sales.plot(kind='line')
model.fittedvalues.plot(kind='line')
plt.legend(['Actual', 'Fitted'])
plt.show()
Using np.exp() is the opposite of np.log(), it applies $ e^{x} $ to the data where x is each data point. Let's plot the actual sales, and the fittedvalues after applying np.exp().
sales.plot(kind='line')
np.exp(model.fittedvalues).plot(kind='line')
plt.legend(['Actual', 'Fitted'])
plt.show()
Holt Linear Trend Model Equation¶
The equation for the model is split into three pieces. Below are these pieces.
$ \hat{y_{t+1}} = l_{t} + b_{t}$
$ l_{t+1} = \alpha y_{t+1} + (1-\alpha)(l_{t} + b_{t})$
$ b_{t+1} = \beta (l_{t+1} - l_{t}) + (1-\beta)b_{t}$
where
$ y_{t} = \text{Actual Value at time t} $
$ \hat{y_{t}} = \text{Predicted Value at time t} $
$ l_{t} = \text{Level value at time t} $
$ b_{t} = \text{Trend value at time t} $
$ \alpha = \text{Trend smoothing parameter} $
$ \beta = \text{Level smoothing parameter} $
You can find the parameters in the model from the attribute params.
print(model.params)
{'smoothing_level': 0.9101797580094325, 'smoothing_slope': 0.0, 'smoothing_seasonal': nan, 'damping_slope': nan, 'initial_level': 6.878615835752972, 'initial_slope': 0.03928418793573401, 'initial_seasons': array([], dtype=float64), 'use_boxcox': False, 'lamda': None, 'remove_bias': False}
From here we will grab the alpha, beta, initial slope, and initial level.
alpha = model.params['smoothing_level']
beta = model.params['smoothing_slope']
initial_slope = model.params['initial_slope']
initial_level = model.params['initial_level']
We also can create a table to hold all the relevant information that we want to replicate.
table = pd.concat([log_sales, model.level, model.slope, model.fittedvalues], axis=1)
table.columns = ['Actual', 'Level', 'Slope', 'Predicted']
print(table)
Actual Level Slope Predicted
2010-01-01 6.907755 6.908666 0.039284 6.917900
2010-02-01 7.064714 7.054227 0.039284 6.947951
2010-03-01 7.055734 7.059127 0.039284 7.093511
2010-04-01 7.053477 7.057513 0.039284 7.098411
2010-05-01 7.006556 7.014661 0.039284 7.096797
... ... ... ... ...
2019-08-01 11.601551 11.599673 0.039284 11.580638
2019-09-01 11.534361 11.543756 0.039284 11.638957
2019-10-01 11.570601 11.571718 0.039284 11.583040
2019-11-01 11.477101 11.489128 0.039284 11.611002
2019-12-01 11.599065 11.592719 0.039284 11.528412
[120 rows x 4 columns]
Let's first predict the values for the first month to check. We have the initial slope and initial level to use in the equation.
#Compute the predicted
predicted = initial_level + initial_slope
print("Predicted: {}".format(predicted))
#Compute the new level
level = alpha * log_sales.loc['2010-01-01'] + (1-alpha) * (initial_slope + initial_level)
print("Level: {}".format(level))
#Compute the new trend
b = beta * (level - initial_level) + (1-beta) * initial_slope
print("Trend: {}".format(b))
Predicted: 6.917900023688706
Level: 6.908666482406613
Trend: 0.03928418793573401
We notice that we do in fact get the same matching values as before. We can loop through each date to confirm the values of the full table.
#Use a list to keep track of each period
new_table = []
#Initialize the starting values of level and trend
b_prior = initial_slope
level_prior = initial_level
#Loop through all values
for y in log_sales.values:
#Compute predicted
predicted = level_prior + b_prior
#Compute the new level
level = alpha * y + (1-alpha) * (b_prior + level_prior)
#Compute the new trend
b = beta * (level - level_prior) + (1-beta) * b_prior
#Add all values to the table
new_table.append([y, level, b, predicted])
#Set the old values to the ones computed
b_prior = b
level_prior = level
#Turn it into a dataframe with the matching index and columns
new_table = pd.DataFrame(new_table, index=table.index, columns=table.columns)
print(new_table)
Actual Level Slope Predicted
2010-01-01 6.907755 6.908666 0.039284 6.917900
2010-02-01 7.064714 7.054227 0.039284 6.947951
2010-03-01 7.055734 7.059127 0.039284 7.093511
2010-04-01 7.053477 7.057513 0.039284 7.098411
2010-05-01 7.006556 7.014661 0.039284 7.096797
... ... ... ... ...
2019-08-01 11.601551 11.599673 0.039284 11.580638
2019-09-01 11.534361 11.543756 0.039284 11.638957
2019-10-01 11.570601 11.571718 0.039284 11.583040
2019-11-01 11.477101 11.489128 0.039284 11.611002
2019-12-01 11.599065 11.592719 0.039284 11.528412
[120 rows x 4 columns]
We have confirmed that we are able to replicate the model! This is great news. Now, what if we had a non-linear set of data, would this still work? Let's test our model on the version that is not log adjusted and see what happens.
model = Holt(sales).fit()
sales.plot(kind='line')
model.fittedvalues.plot(kind='line')
plt.show()
You will notice that it does in fact still track well to it! You will also notice now that the slope is changing each period because there is no longer a linear slope!
table = pd.concat([sales, model.level, model.slope, model.fittedvalues], axis=1)
table.columns = ['Actual', 'Level', 'Slope', 'Predicted']
print(table)
Actual Level Slope Predicted
2010-01-01 1000.000000 978.074327 51.203091 944.735108
2010-02-01 1169.947629 1114.138436 60.213753 1029.277418
2010-03-01 1159.487697 1165.385003 59.261605 1174.352189
2010-04-01 1156.874513 1183.762266 54.920448 1224.646608
2010-05-01 1103.846289 1157.340999 46.283470 1238.682714
... ... ... ... ...
2019-08-01 109267.176869 107853.520440 2678.707461 105703.976305
2019-09-01 102166.703656 105485.623519 2142.851913 110532.227901
2019-10-01 105937.100392 106608.132840 2034.510502 107628.475431
2019-11-01 96480.965355 101305.962983 1255.491466 108642.643342
2019-12-01 108995.841014 106443.076507 1667.647554 102561.454449
[120 rows x 4 columns]
Forecasting¶
This model can forecast into the future based upon the latest value. The way to do it is to call the forecast on the model and given a number of periods forward which you want to forecast. Below is a forecast for 10 periods ahead.
forecast = model.forecast(10)
print(forecast)
2020-01-01 108110.724061
2020-02-01 109778.371614
2020-03-01 111446.019168
2020-04-01 113113.666722
2020-05-01 114781.314276
2020-06-01 116448.961829
2020-07-01 118116.609383
2020-08-01 119784.256937
2020-09-01 121451.904490
2020-10-01 123119.552044
Freq: MS, dtype: float64
#Plot the forecast with the actual vs. the model
sales.plot(kind='line')
model.fittedvalues.plot(kind='line')
forecast.plot(kind='line')
plt.legend(['Actual', 'Model', 'Forecasted'])
plt.show()
Validation with Out of Sample Data¶
It is good practice to validate models with data out of sample. What this means is that we want to hold back a portion of data and test different models against that data to see which one performs the best. Let's split into in-sample and out-of-sample data where we will have the first 9 years be the in-sample and the last year will be the out-of-sample.
#Split the into in_sample and out_sample
in_sample = sales.loc[:'2018-12-31']
out_sample = sales.loc['2019-01-01':]
Let's do two forecasts, one with holt winters, and another that assumes we have 6% growth every month. Then we can plot the two vs. the actual.
#Fit the model
model = Holt(in_sample).fit()
#Forecast with the model
forecast_model = model.forecast(12)
#Do a simple model
forecast_simple = np.cumprod([1.06]*12) * in_sample.iloc[-1]
forecast_simple = pd.Series(forecast_simple, index=out_sample.index)
out_sample.plot(kind='line')
forecast_model.plot(kind='line')
forecast_simple.plot(kind='line')
plt.legend(['Actual', 'Holt', 'Simple'])
plt.show()
Mean Squared Error¶
The mean squared error is a good measure of how similar a forecast is compared to a actual data. The lower the number, the better the forecast performed. The formula is:
$ \frac{1}{N}\sum_{i=1}^{N}(Y_{i} - \hat{Y_{i}})^2$
where
$ N = \text{Number of data points} $
$ Y_{i} = \text{Actual i-th value} $
$ \hat{Y_{i}} = \text{Predicted i-th value} $
MSE1 = ((out_sample - forecast_model) ** 2).sum() / len(out_sample)
MSE2 = ((out_sample - forecast_simple) ** 2).sum() / len(out_sample)
print(MSE1)
print(MSE2)
354320706.44752467
2179517798.195563
Because the first MSE is so much lower we can say that the Holt model performed better (at least in this case). Now let's work with seasonality. The code below will set up our data, don't worry about what is actually happening in it.
import math
np.random.seed(1)
growth = np.random.normal(.04, .08, 119)
growth = (growth + 1).cumprod()
growth = np.insert(growth, 0, 1)
sales = pd.Series(1000 * growth)
seasonality = np.sin(np.linspace(0, 2 * math.pi, 12))
seasonality = seasonality / 3 + 1
seasonality = np.tile(seasonality, 10)
sales = sales * seasonality
sales.index = pd.date_range(datetime(2010,1,1), datetime(2019,12,31), freq='MS')
ax = sales.plot(kind='line')
plt.title("Sales")
plt.show()
ax = np.log(sales).plot(kind='line')
plt.title("Log(Sales)")
plt.show()
Holt-Winters¶
The holt winters model is one that does very well in modeling both trend and seasonality. As we can see from above, there is clearly seasonality within our data which we want to deal with. The holt-winters model can take on many forms in terms of using multiplicative or additive for both the trend or seasonality. Like we did before, we are going to work through using the statsmodels version, but we won't go over the equations or verify them this time since they are a little more involved and depend on whether you use additive or multiplicative. We are going to compare out of sample testing for the different set ups.
First, let's set up just a basic model where both trend and seasonality are multiplicative.
#Import and fit the model
from statsmodels.tsa.holtwinters import ExponentialSmoothing
model = ExponentialSmoothing(sales, seasonal_periods=12, trend='mul', seasonal='mul').fit()
#Plot the model vs. the actual
sales.plot(kind='line')
model.fittedvalues.plot(kind='line')
plt.legend(['Actual', 'Model'])
plt.show()
When we forecast forward we see that the seasonality is preserved.
#Forecast a month forward
forecast = model.forecast(12)
sales.plot(kind='line')
model.fittedvalues.plot(kind='line')
forecast.plot(kind='line')
plt.show()
Now let's split into in-sample and out-of-sample and see which performs better.
in_sample = sales.loc[:'2018-12-31']
out_sample = sales.loc['2019-01-01':]
#Build 4 different models
model1 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='add', seasonal='add').fit()
model2 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='add', seasonal='mul').fit()
model3 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='mul', seasonal='add').fit()
model4 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='mul', seasonal='mul').fit()
/Users/seanmcowen/anaconda/envs/FinanceAndPython/lib/python3.8/site-packages/statsmodels/tsa/holtwinters.py:743: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
warn("Optimization failed to converge. Check mle_retvals.",
/Users/seanmcowen/anaconda/envs/FinanceAndPython/lib/python3.8/site-packages/statsmodels/tsa/holtwinters.py:743: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
warn("Optimization failed to converge. Check mle_retvals.",
/Users/seanmcowen/anaconda/envs/FinanceAndPython/lib/python3.8/site-packages/statsmodels/tsa/holtwinters.py:743: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
warn("Optimization failed to converge. Check mle_retvals.",
#Forecast ahead 12 periods with all of our new models
models = [model1, model2, model3, model4]
forecasts = pd.concat([model.forecast(12) for model in models], axis=1)
forecasts.columns = ['Model 1', 'Model 2', 'Model 3', 'Model 4']
#Plot the results
ax = out_sample.plot(kind='line')
forecasts.plot(kind='line', ax=ax)
plt.legend(['Actual', 'Model 1', 'Model 2', 'Model 3', 'Model 4'])
plt.show()
It becomes very obvious quickly that model 1 and 3 are not very close to the correct forecast, so let's zoom in to just see model 2 and 4 vs. the actual.
ax = out_sample.plot(kind='line')
forecasts[['Model 2', 'Model 4']].plot(kind='line', ax=ax)
plt.legend(['Actual', 'Model 2', 'Model 4'])
plt.show()
The MSE will confirm just how much of a better fit model 2 and 4 are compared to model 1 and 2. AS well, we will see that within this specific timeframe model 2 looks much better.
MSE = ((forecasts.subtract(out_sample, axis=0)) ** 2).mean()
print(MSE.sort_values())
Model 2 2.011727e+08
Model 4 6.880283e+08
Model 1 2.644744e+10
Model 3 7.780873e+11
dtype: float64
Something to consider with this kind of analysis is how at times it can come down to the luck which models perform better. Let's do this same analysis except this time look up until 2017 for fitting our model, then predict 2018.
in_sample = sales.loc[:'2017-12-31']
out_sample = sales.loc['2018-01-01':'2018-12-31']
#Fit the models
model1 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='add', seasonal='add').fit()
model2 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='add', seasonal='mul').fit()
model3 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='mul', seasonal='add').fit()
model4 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='mul', seasonal='mul').fit()
/Users/seanmcowen/anaconda/envs/FinanceAndPython/lib/python3.8/site-packages/statsmodels/tsa/holtwinters.py:743: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
warn("Optimization failed to converge. Check mle_retvals.",
#Forecast ahead with all models
models = [model1, model2, model3, model4]
forecasts = pd.concat([model.forecast(12) for model in models], axis=1)
forecasts.columns = ['Model 1', 'Model 2', 'Model 3', 'Model 4']
#Plot the models
ax = out_sample.plot(kind='line')
forecasts.plot(kind='line', ax=ax)
plt.legend(['Actual', 'Model 1', 'Model 2', 'Model 3', 'Model 4'])
plt.show()
We confirm what we saw with the last test that model 1 and model 3 are not great fits. What about zooming in again.
ax = out_sample.plot(kind='line')
forecasts[['Model 2', 'Model 4']].plot(kind='line', ax=ax)
plt.legend(['Actual', 'Model 2', 'Model 4'])
<matplotlib.legend.Legend at 0x7f8fcdc532b0>
The two are very tight to the actual values which shows they are good candidates still. This time, when we compute the MSE, model 4 is actually the best choice. This just goes to show that there is always randomness present that may change what looks the best or not.
MSE = ((forecasts.subtract(out_sample, axis=0)) ** 2).mean()
print(MSE.sort_values())
Model 4 1.394074e+07
Model 2 2.769388e+07
Model 1 5.667049e+09
Model 3 1.356650e+11
dtype: float64