Geographical Analysis 6
- Lecture1.1
  
  Introduction
- Lecture1.2
  
  Location
- Lecture1.3
  
  Folium
- Lecture1.4
  
  Partitioning the Data
- Lecture1.5
  
  Mapping Colors
- Lecture1.6
  
  Measuring Growth
Cap Table 3
- Lecture2.1
  
  Introduction to the Cap Table
- Lecture2.2
  
  Computing Share price
- Lecture2.3
  
  Cap Table Functions
Simulation 6
- Lecture3.1
  
  Simulating Outcomes
- Lecture3.2
  
  Randomness for simulation
- Lecture3.3
  
  Exponential Distribution
- Lecture3.4
  
  Function Minimization Basics
- Lecture3.5
  
  Adding Minimization Arguments
- Lecture3.6
  
  Measuring the Error
Search Index 8
- Lecture4.1
  
  Search index
- Lecture4.2
  
  Airbnb Data
- Lecture4.3
  
  Prediction
- Lecture4.4
  
  Mapping
- Lecture4.5
  
  Regression
- Lecture4.6
  
  Seasonal Decomposition
- Lecture4.7
  
  Insights from Decomposition
- Lecture4.8
  
  Holt Winters
Fund Distributions 5
- Lecture5.1
  
  The Venture Capital Life Cycle
- Lecture5.2
  
  Limited Partners Profit Part 1
- Lecture5.3
  
  Limited Partners Profit Part 2
- Lecture5.4
  
  IRR Performance
- Lecture5.5
  
  Comparing Performance Fee Structures

Holt Winters

Modeling Growth with Holt Winters¶

We’ve shown a few examples thus far of modeling trend and seasonality and in this part we will expand this to include exponential growth and forecasting with our model. Before we dive in, I am going to create dummy data to use. We can think of this data as sales data for a local startup. Don’t worry about what is happening in the code below.

In [45]:

from datetime import datetime

np.random.seed(1)
growth = np.random.normal(.04, .08, 119)
growth = (growth + 1).cumprod()
growth = np.insert(growth, 0, 1)
sales = pd.Series(1000 * growth)
sales.index = pd.date_range(datetime(2010,1,1), datetime(2019,12,31), freq='MS')
sales.plot(kind='line')
plt.show()

Using a Log Transform¶

Often we have data that has a pattern of exponential growth. For this type of data, we can use a log transform to turn it into a more linear series of data.

In [46]:

log_sales = np.log(sales)
log_sales.plot(kind='line')
plt.show()

Holt Linear Trend Model¶

The holt linear trend model is a useful model to for forecasting. First, we can create and fit the model to our log transformed data, then we will work through what is actually happening with the model.

In [47]:

from statsmodels.tsa.holtwinters import Holt

#Create and fit the model to our log sales values
model = Holt(log_sales).fit()

For the model, there is an attribute fittedvalues which returns the values fit by the model we made.

In [48]:

print(model.fittedvalues)

2010-01-01     6.917900
2010-02-01     6.947951
2010-03-01     7.093511
2010-04-01     7.098411
2010-05-01     7.096797
                ...
2019-08-01    11.580638
2019-09-01    11.638957
2019-10-01    11.583040
2019-11-01    11.611002
2019-12-01    11.528412
Freq: MS, Length: 120, dtype: float64

In [49]:

#Plot the actual and fitted values
log_sales.plot(kind='line')
model.fittedvalues.plot(kind='line')
plt.legend(['Actual', 'Fitted'])
plt.show()

Using np.exp() is the opposite of np.log(), it applies $ e^{x} $ to the data where x is each data point. Let's plot the actual sales, and the fittedvalues after applying np.exp().

In [50]:

sales.plot(kind='line')
np.exp(model.fittedvalues).plot(kind='line')
plt.legend(['Actual', 'Fitted'])
plt.show()

Holt Linear Trend Model Equation¶

The equation for the model is split into three pieces. Below are these pieces.

$ \hat{y_{t+1}} = l_{t} + b_{t}$

$ l_{t+1} = \alpha y_{t+1} + (1-\alpha)(l_{t} + b_{t})$

$ b_{t+1} = \beta (l_{t+1} - l_{t}) + (1-\beta)b_{t}$

where

$ y_{t} = \text{Actual Value at time t} $

$ \hat{y_{t}} = \text{Predicted Value at time t} $

$ l_{t} = \text{Level value at time t} $

$ b_{t} = \text{Trend value at time t} $

$ \alpha = \text{Trend smoothing parameter} $

$ \beta = \text{Level smoothing parameter} $

You can find the parameters in the model from the attribute params.

In [51]:

print(model.params)

{'smoothing_level': 0.9101797580094325, 'smoothing_slope': 0.0, 'smoothing_seasonal': nan, 'damping_slope': nan, 'initial_level': 6.878615835752972, 'initial_slope': 0.03928418793573401, 'initial_seasons': array([], dtype=float64), 'use_boxcox': False, 'lamda': None, 'remove_bias': False}

From here we will grab the alpha, beta, initial slope, and initial level.

In [52]:

alpha = model.params['smoothing_level']
beta = model.params['smoothing_slope']
initial_slope = model.params['initial_slope']
initial_level = model.params['initial_level']

We also can create a table to hold all the relevant information that we want to replicate.

In [53]:

table = pd.concat([log_sales, model.level, model.slope, model.fittedvalues], axis=1)
table.columns = ['Actual', 'Level', 'Slope', 'Predicted']
print(table)

               Actual      Level     Slope  Predicted
2010-01-01   6.907755   6.908666  0.039284   6.917900
2010-02-01   7.064714   7.054227  0.039284   6.947951
2010-03-01   7.055734   7.059127  0.039284   7.093511
2010-04-01   7.053477   7.057513  0.039284   7.098411
2010-05-01   7.006556   7.014661  0.039284   7.096797
...               ...        ...       ...        ...
2019-08-01  11.601551  11.599673  0.039284  11.580638
2019-09-01  11.534361  11.543756  0.039284  11.638957
2019-10-01  11.570601  11.571718  0.039284  11.583040
2019-11-01  11.477101  11.489128  0.039284  11.611002
2019-12-01  11.599065  11.592719  0.039284  11.528412

[120 rows x 4 columns]

Let's first predict the values for the first month to check. We have the initial slope and initial level to use in the equation.

In [54]:

#Compute the predicted 
predicted = initial_level + initial_slope
print("Predicted: {}".format(predicted))

#Compute the new level
level = alpha * log_sales.loc['2010-01-01'] + (1-alpha) * (initial_slope + initial_level)
print("Level: {}".format(level))

#Compute the new trend
b = beta * (level - initial_level) + (1-beta) * initial_slope
print("Trend: {}".format(b))

Predicted: 6.917900023688706
Level: 6.908666482406613
Trend: 0.03928418793573401

We notice that we do in fact get the same matching values as before. We can loop through each date to confirm the values of the full table.

In [55]:

#Use a list to keep track of each period
new_table = []

#Initialize the starting values of level and trend
b_prior = initial_slope
level_prior = initial_level

#Loop through all values
for y in log_sales.values:
    #Compute predicted
    predicted = level_prior + b_prior

    #Compute the new level
    level = alpha * y + (1-alpha) * (b_prior + level_prior)

    #Compute the new trend
    b = beta * (level - level_prior) + (1-beta) * b_prior

    #Add all values to the table
    new_table.append([y, level, b, predicted])

    #Set the old values to the ones computed
    b_prior = b
    level_prior = level

#Turn it into a dataframe with the matching index and columns
new_table = pd.DataFrame(new_table, index=table.index, columns=table.columns)
print(new_table)

               Actual      Level     Slope  Predicted
2010-01-01   6.907755   6.908666  0.039284   6.917900
2010-02-01   7.064714   7.054227  0.039284   6.947951
2010-03-01   7.055734   7.059127  0.039284   7.093511
2010-04-01   7.053477   7.057513  0.039284   7.098411
2010-05-01   7.006556   7.014661  0.039284   7.096797
...               ...        ...       ...        ...
2019-08-01  11.601551  11.599673  0.039284  11.580638
2019-09-01  11.534361  11.543756  0.039284  11.638957
2019-10-01  11.570601  11.571718  0.039284  11.583040
2019-11-01  11.477101  11.489128  0.039284  11.611002
2019-12-01  11.599065  11.592719  0.039284  11.528412

[120 rows x 4 columns]

We have confirmed that we are able to replicate the model! This is great news. Now, what if we had a non-linear set of data, would this still work? Let's test our model on the version that is not log adjusted and see what happens.

In [56]:

model = Holt(sales).fit()

sales.plot(kind='line')
model.fittedvalues.plot(kind='line')
plt.show()

You will notice that it does in fact still track well to it! You will also notice now that the slope is changing each period because there is no longer a linear slope!

In [57]:

table = pd.concat([sales, model.level, model.slope, model.fittedvalues], axis=1)
table.columns = ['Actual', 'Level', 'Slope', 'Predicted']
print(table)

                   Actual          Level        Slope      Predicted
2010-01-01    1000.000000     978.074327    51.203091     944.735108
2010-02-01    1169.947629    1114.138436    60.213753    1029.277418
2010-03-01    1159.487697    1165.385003    59.261605    1174.352189
2010-04-01    1156.874513    1183.762266    54.920448    1224.646608
2010-05-01    1103.846289    1157.340999    46.283470    1238.682714
...                   ...            ...          ...            ...
2019-08-01  109267.176869  107853.520440  2678.707461  105703.976305
2019-09-01  102166.703656  105485.623519  2142.851913  110532.227901
2019-10-01  105937.100392  106608.132840  2034.510502  107628.475431
2019-11-01   96480.965355  101305.962983  1255.491466  108642.643342
2019-12-01  108995.841014  106443.076507  1667.647554  102561.454449

[120 rows x 4 columns]

Forecasting¶

This model can forecast into the future based upon the latest value. The way to do it is to call the forecast on the model and given a number of periods forward which you want to forecast. Below is a forecast for 10 periods ahead.

In [58]:

forecast = model.forecast(10)
print(forecast)

2020-01-01    108110.724061
2020-02-01    109778.371614
2020-03-01    111446.019168
2020-04-01    113113.666722
2020-05-01    114781.314276
2020-06-01    116448.961829
2020-07-01    118116.609383
2020-08-01    119784.256937
2020-09-01    121451.904490
2020-10-01    123119.552044
Freq: MS, dtype: float64

In [59]:

#Plot the forecast with the actual vs. the model
sales.plot(kind='line')
model.fittedvalues.plot(kind='line')
forecast.plot(kind='line')
plt.legend(['Actual', 'Model', 'Forecasted'])
plt.show()

Validation with Out of Sample Data¶

It is good practice to validate models with data out of sample. What this means is that we want to hold back a portion of data and test different models against that data to see which one performs the best. Let's split into in-sample and out-of-sample data where we will have the first 9 years be the in-sample and the last year will be the out-of-sample.

In [60]:

#Split the into in_sample and out_sample
in_sample = sales.loc[:'2018-12-31']
out_sample = sales.loc['2019-01-01':]

Let's do two forecasts, one with holt winters, and another that assumes we have 6% growth every month. Then we can plot the two vs. the actual.

In [61]:

#Fit the model
model = Holt(in_sample).fit()

#Forecast with the model
forecast_model = model.forecast(12)

#Do a simple model
forecast_simple = np.cumprod([1.06]*12) * in_sample.iloc[-1]
forecast_simple = pd.Series(forecast_simple, index=out_sample.index)

out_sample.plot(kind='line')
forecast_model.plot(kind='line')
forecast_simple.plot(kind='line')
plt.legend(['Actual', 'Holt', 'Simple'])
plt.show()

Mean Squared Error¶

The mean squared error is a good measure of how similar a forecast is compared to a actual data. The lower the number, the better the forecast performed. The formula is:

$ \frac{1}{N}\sum_{i=1}^{N}(Y_{i} - \hat{Y_{i}})^2$

where

$ N = \text{Number of data points} $

$ Y_{i} = \text{Actual i-th value} $

$ \hat{Y_{i}} = \text{Predicted i-th value} $

In [62]:

MSE1 = ((out_sample - forecast_model) ** 2).sum() / len(out_sample)
MSE2 = ((out_sample - forecast_simple) ** 2).sum() / len(out_sample)

print(MSE1)
print(MSE2)

354320706.44752467
2179517798.195563

Because the first MSE is so much lower we can say that the Holt model performed better (at least in this case). Now let's work with seasonality. The code below will set up our data, don't worry about what is actually happening in it.

In [63]:

import math

np.random.seed(1)
growth = np.random.normal(.04, .08, 119)
growth = (growth + 1).cumprod()
growth = np.insert(growth, 0, 1)
sales = pd.Series(1000 * growth)

seasonality = np.sin(np.linspace(0, 2 * math.pi, 12))
seasonality = seasonality / 3 + 1
seasonality = np.tile(seasonality, 10)

sales = sales * seasonality
sales.index = pd.date_range(datetime(2010,1,1), datetime(2019,12,31), freq='MS')

ax = sales.plot(kind='line')
plt.title("Sales")
plt.show()

ax = np.log(sales).plot(kind='line')
plt.title("Log(Sales)")
plt.show()

Holt-Winters¶

The holt winters model is one that does very well in modeling both trend and seasonality. As we can see from above, there is clearly seasonality within our data which we want to deal with. The holt-winters model can take on many forms in terms of using multiplicative or additive for both the trend or seasonality. Like we did before, we are going to work through using the statsmodels version, but we won't go over the equations or verify them this time since they are a little more involved and depend on whether you use additive or multiplicative. We are going to compare out of sample testing for the different set ups.

First, let's set up just a basic model where both trend and seasonality are multiplicative.

In [64]:

#Import and fit the model
from statsmodels.tsa.holtwinters import ExponentialSmoothing
model = ExponentialSmoothing(sales, seasonal_periods=12, trend='mul', seasonal='mul').fit()

In [65]:

#Plot the model vs. the actual
sales.plot(kind='line')
model.fittedvalues.plot(kind='line')
plt.legend(['Actual', 'Model'])
plt.show()

When we forecast forward we see that the seasonality is preserved.

In [66]:

#Forecast a month forward
forecast = model.forecast(12)
sales.plot(kind='line')
model.fittedvalues.plot(kind='line')
forecast.plot(kind='line')

plt.show()

Now let's split into in-sample and out-of-sample and see which performs better.

In [67]:

in_sample = sales.loc[:'2018-12-31']
out_sample = sales.loc['2019-01-01':]

#Build 4 different models
model1 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='add', seasonal='add').fit()
model2 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='add', seasonal='mul').fit()
model3 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='mul', seasonal='add').fit()
model4 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='mul', seasonal='mul').fit()

/Users/seanmcowen/anaconda/envs/FinanceAndPython/lib/python3.8/site-packages/statsmodels/tsa/holtwinters.py:743: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warn("Optimization failed to converge. Check mle_retvals.",
/Users/seanmcowen/anaconda/envs/FinanceAndPython/lib/python3.8/site-packages/statsmodels/tsa/holtwinters.py:743: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warn("Optimization failed to converge. Check mle_retvals.",
/Users/seanmcowen/anaconda/envs/FinanceAndPython/lib/python3.8/site-packages/statsmodels/tsa/holtwinters.py:743: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warn("Optimization failed to converge. Check mle_retvals.",

In [68]:

#Forecast ahead 12 periods with all of our new models
models = [model1, model2, model3, model4]
forecasts = pd.concat([model.forecast(12) for model in models], axis=1)
forecasts.columns = ['Model 1', 'Model 2', 'Model 3', 'Model 4']

In [69]:

#Plot the results
ax = out_sample.plot(kind='line')
forecasts.plot(kind='line', ax=ax)
plt.legend(['Actual', 'Model 1', 'Model 2', 'Model 3', 'Model 4'])
plt.show()

It becomes very obvious quickly that model 1 and 3 are not very close to the correct forecast, so let's zoom in to just see model 2 and 4 vs. the actual.

In [70]:

ax = out_sample.plot(kind='line')
forecasts[['Model 2', 'Model 4']].plot(kind='line', ax=ax)
plt.legend(['Actual', 'Model 2', 'Model 4'])
plt.show()

The MSE will confirm just how much of a better fit model 2 and 4 are compared to model 1 and 2. AS well, we will see that within this specific timeframe model 2 looks much better.

In [71]:

MSE = ((forecasts.subtract(out_sample, axis=0)) ** 2).mean()
print(MSE.sort_values())

Model 2    2.011727e+08
Model 4    6.880283e+08
Model 1    2.644744e+10
Model 3    7.780873e+11
dtype: float64

Something to consider with this kind of analysis is how at times it can come down to the luck which models perform better. Let's do this same analysis except this time look up until 2017 for fitting our model, then predict 2018.

In [72]:

in_sample = sales.loc[:'2017-12-31']
out_sample = sales.loc['2018-01-01':'2018-12-31']

#Fit the models
model1 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='add', seasonal='add').fit()
model2 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='add', seasonal='mul').fit()
model3 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='mul', seasonal='add').fit()
model4 = ExponentialSmoothing(in_sample, seasonal_periods=12, trend='mul', seasonal='mul').fit()

/Users/seanmcowen/anaconda/envs/FinanceAndPython/lib/python3.8/site-packages/statsmodels/tsa/holtwinters.py:743: ConvergenceWarning: Optimization failed to converge. Check mle_retvals.
  warn("Optimization failed to converge. Check mle_retvals.",

In [73]:

#Forecast ahead with all models
models = [model1, model2, model3, model4]
forecasts = pd.concat([model.forecast(12) for model in models], axis=1)
forecasts.columns = ['Model 1', 'Model 2', 'Model 3', 'Model 4']

In [74]:

#Plot the models
ax = out_sample.plot(kind='line')
forecasts.plot(kind='line', ax=ax)
plt.legend(['Actual', 'Model 1', 'Model 2', 'Model 3', 'Model 4'])
plt.show()

We confirm what we saw with the last test that model 1 and model 3 are not great fits. What about zooming in again.

In [75]:

ax = out_sample.plot(kind='line')
forecasts[['Model 2', 'Model 4']].plot(kind='line', ax=ax)
plt.legend(['Actual', 'Model 2', 'Model 4'])

Out[75]:

<matplotlib.legend.Legend at 0x7f8fcdc532b0>

The two are very tight to the actual values which shows they are good candidates still. This time, when we compute the MSE, model 4 is actually the best choice. This just goes to show that there is always randomness present that may change what looks the best or not.

In [76]:

MSE = ((forecasts.subtract(out_sample, axis=0)) ** 2).mean()
print(MSE.sort_values())

Model 4    1.394074e+07
Model 2    2.769388e+07
Model 1    5.667049e+09
Model 3    1.356650e+11
dtype: float64

Prev Insights from Decomposition

Next The Venture Capital Life Cycle

Data Science

Modeling Growth with Holt Winters¶

Using a Log Transform¶

Holt Linear Trend Model¶

Holt Linear Trend Model Equation¶

Forecasting¶

Validation with Out of Sample Data¶

Mean Squared Error¶

Holt-Winters¶

Leave A Reply Cancel reply

Data Science

Geographical Analysis 6

Cap Table 3

Simulation 6

Search Index 8

Fund Distributions 5

Holt Winters

Modeling Growth with Holt Winters¶

Using a Log Transform¶

Holt Linear Trend Model¶

Holt Linear Trend Model Equation¶

Forecasting¶

Validation with Out of Sample Data¶

Mean Squared Error¶

Holt-Winters¶

Leave A Reply Cancel reply

Login with your site account

Register a new account