-
Geographical Analysis 6
-
Lecture1.1
-
Lecture1.2
-
Lecture1.3
-
Lecture1.4
-
Lecture1.5
-
Lecture1.6
-
-
Cap Table 3
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
-
Simulation 6
-
Lecture3.1
-
Lecture3.2
-
Lecture3.3
-
Lecture3.4
-
Lecture3.5
-
Lecture3.6
-
-
Search Index 8
-
Lecture4.1
-
Lecture4.2
-
Lecture4.3
-
Lecture4.4
-
Lecture4.5
-
Lecture4.6
-
Lecture4.7
-
Lecture4.8
-
-
Fund Distributions 5
-
Lecture5.1
-
Lecture5.2
-
Lecture5.3
-
Lecture5.4
-
Lecture5.5
-
Regression
Regression Modeling¶
We are going to model the search index time series in two dimensions, the trend over time, and the seasonality. Our exploratory analysis conducted prior showed that there is great promise for this approach. To begin, create our data sample again between 2012 to 2019 for airbnb, and let’s have columns for the year, month and season.
#Get the sample data
airbnb = df.loc['2012-01-01':'2019-12-31']['Airbnb'].reset_index()
airbnb.columns = ['Date', 'Search']
#Create the columns for year, month and season
airbnb['Year'] = airbnb['Date'].dt.year
airbnb['Month'] = airbnb['Date'].dt.month
airbnb['Season'] = airbnb['Month'].map({1: 'Winter',
2: 'Winter',
3: 'Spring',
4: 'Spring',
5: 'Spring',
6: 'Summer',
7: 'Summer',
8: 'Summer',
9: 'Fall',
10: 'Fall',
11: 'Fall',
12: 'Winter'})
print(airbnb)
Date Search Year Month Season
0 2012-01-01 2.0 2012 1 Winter
1 2012-02-01 3.0 2012 2 Winter
2 2012-03-01 4.0 2012 3 Spring
3 2012-04-01 4.0 2012 4 Spring
4 2012-05-01 5.0 2012 5 Spring
.. ... ... ... ... ...
91 2019-08-01 83.0 2019 8 Summer
92 2019-09-01 72.0 2019 9 Fall
93 2019-10-01 69.0 2019 10 Fall
94 2019-11-01 64.0 2019 11 Fall
95 2019-12-01 64.0 2019 12 Winter
[96 rows x 5 columns]
A Very Brief Introduction to Linear Regression¶
Linear regression is a very basic modeling tool that models the linear relationship between a group of variables and a response variable. Because this course is not one on econometrics, we will not cover linear regression in depth. If you have no experience with it then the following resource will be helpful for learning:
If you do not have the background and also do not decide to read into linear regression, you just want to know the following to understand the rest of the lecture. Given a set of coeffecients and a set of variables, linear regression predicts the response variable as the summation of each coeffecient multiplied by the variable.
Dummy Variables¶
Dummy variables are a way to represent numerically whether or not something is true for a given observation. If we have 4 categorical variables for season (Fall, Spring, Summer, Winter), then we can represent them as 4 dummy variables testing whether or not each instance is in a given season. The pandas function get_dummies returns to us columns for all present categorical variables and 0 or 1 to denote whether it is either true or false.
#Get the dummy variables
print(pd.get_dummies(airbnb['Season']))
Fall Spring Summer Winter
0 0 0 0 1
1 0 0 0 1
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
.. ... ... ... ...
91 0 0 1 0
92 1 0 0 0
93 1 0 0 0
94 1 0 0 0
95 0 0 0 1
[96 rows x 4 columns]
The Base Case¶
When we model out data, we need to have a base case to compare to because otherwise it is redundant. If we know that a date is not in Fall, Spring, or Summer, then it must be in winter! For our modeling purposes we then want to leave out winter as a dummy variable and consider it the base case. Below, we drop the winter dummy variable then join the airbnb data and dummy variable data.
airbnb = airbnb.join(pd.get_dummies(airbnb['Season']).drop(columns=['Winter']))
print(airbnb)
Date Search Year Month Season Fall Spring Summer
0 2012-01-01 2.0 2012 1 Winter 0 0 0
1 2012-02-01 3.0 2012 2 Winter 0 0 0
2 2012-03-01 4.0 2012 3 Spring 0 1 0
3 2012-04-01 4.0 2012 4 Spring 0 1 0
4 2012-05-01 5.0 2012 5 Spring 0 1 0
.. ... ... ... ... ... ... ... ...
91 2019-08-01 83.0 2019 8 Summer 0 0 1
92 2019-09-01 72.0 2019 9 Fall 1 0 0
93 2019-10-01 69.0 2019 10 Fall 1 0 0
94 2019-11-01 64.0 2019 11 Fall 1 0 0
95 2019-12-01 64.0 2019 12 Winter 0 0 0
[96 rows x 8 columns]
Trend Component¶
As for the trend component, we saw that there is a linear trend over time for the overall search index. Because of this, we want a linear variable of time to describe how far along in time we are. If we take T=1 to be the first month 1/1/2012, then add 1 to this variable for each month, we will create a linear trend variable to be used in modeling.
#Create the trend component
airbnb['T'] = (airbnb['Year'] - 2012) * 12 + airbnb['Month']
print(airbnb)
Date Search Year Month Season Fall Spring Summer T
0 2012-01-01 2.0 2012 1 Winter 0 0 0 1
1 2012-02-01 3.0 2012 2 Winter 0 0 0 2
2 2012-03-01 4.0 2012 3 Spring 0 1 0 3
3 2012-04-01 4.0 2012 4 Spring 0 1 0 4
4 2012-05-01 5.0 2012 5 Spring 0 1 0 5
.. ... ... ... ... ... ... ... ... ..
91 2019-08-01 83.0 2019 8 Summer 0 0 1 92
92 2019-09-01 72.0 2019 9 Fall 1 0 0 93
93 2019-10-01 69.0 2019 10 Fall 1 0 0 94
94 2019-11-01 64.0 2019 11 Fall 1 0 0 95
95 2019-12-01 64.0 2019 12 Winter 0 0 0 96
[96 rows x 9 columns]
Finally, we also need a constant like usual.
airbnb['Constant'] = 1
print(airbnb)
Date Search Year Month Season Fall Spring Summer T Constant
0 2012-01-01 2.0 2012 1 Winter 0 0 0 1 1
1 2012-02-01 3.0 2012 2 Winter 0 0 0 2 1
2 2012-03-01 4.0 2012 3 Spring 0 1 0 3 1
3 2012-04-01 4.0 2012 4 Spring 0 1 0 4 1
4 2012-05-01 5.0 2012 5 Spring 0 1 0 5 1
.. ... ... ... ... ... ... ... ... .. ...
91 2019-08-01 83.0 2019 8 Summer 0 0 1 92 1
92 2019-09-01 72.0 2019 9 Fall 1 0 0 93 1
93 2019-10-01 69.0 2019 10 Fall 1 0 0 94 1
94 2019-11-01 64.0 2019 11 Fall 1 0 0 95 1
95 2019-12-01 64.0 2019 12 Winter 0 0 0 96 1
[96 rows x 10 columns]
Set the date as the index.
#Set the index
airbnb = airbnb.set_index('Date')
We can run a regression model by using statsmodels. We need our X variables and the Y variable we are predicting first of all. Then we create an OLS object passing in Y and X, and fit the model. After fitting the model we can get a print out.
import statsmodels.api as sm
#Fit the model
Y = airbnb['Search']
X = airbnb[['Fall', 'Spring', 'Summer', 'T', 'Constant']]
model = sm.OLS(Y,X).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Search R-squared: 0.953
Model: OLS Adj. R-squared: 0.951
Method: Least Squares F-statistic: 460.0
Date: Wed, 06 Oct 2021 Prob (F-statistic): 1.90e-59
Time: 00:10:30 Log-Likelihood: -301.82
No. Observations: 96 AIC: 613.6
Df Residuals: 91 BIC: 626.5
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Fall -3.0751 1.668 -1.844 0.068 -6.388 0.237
Spring 5.8567 1.664 3.519 0.001 2.551 9.163
Summer 10.2033 1.665 6.129 0.000 6.896 13.510
T 0.8984 0.021 42.165 0.000 0.856 0.941
Constant -9.0144 1.545 -5.834 0.000 -12.084 -5.945
==============================================================================
Omnibus: 1.954 Durbin-Watson: 0.797
Prob(Omnibus): 0.376 Jarque-Bera (JB): 1.406
Skew: 0.014 Prob(JB): 0.495
Kurtosis: 2.408 Cond. No. 247.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The coefficients confirm what we saw graphically, that there are strong seasonality and trend components. We can grab the fitted values created from this model quite easily.
#Get the fitted values
predicted = model.fittedvalues
print(predicted)
Date
2012-01-01 -8.116022
2012-02-01 -7.217666
2012-03-01 -0.462622
2012-04-01 0.435734
2012-05-01 1.334089
...
2019-08-01 83.837622
2019-09-01 71.457578
2019-10-01 72.355933
2019-11-01 73.254289
2019-12-01 77.227755
Length: 96, dtype: float64
#Plot the results
plt.plot(airbnb.index, airbnb['Search'])
plt.plot(predicted.index, predicted.values)
plt.xlabel("Time")
plt.ylabel("Google Search Index")
plt.title("Google Search Index for Airbnb Actual vs. Predicted")
plt.legend(['Actual', 'Predicted'])
plt.show()