Regression
Regression Modeling¶
We are going to model the search index time series in two dimensions, the trend over time, and the seasonality. Our exploratory analysis conducted prior showed that there is great promise for this approach. To begin, create our data sample again between 2012 to 2019 for airbnb, and let’s have columns for the year, month and season.
#Get the sample data
airbnb = df.loc['2012-01-01':'2019-12-31']['Airbnb'].reset_index()
airbnb.columns = ['Date', 'Search']
#Create the columns for year, month and season
airbnb['Year'] = airbnb['Date'].dt.year
airbnb['Month'] = airbnb['Date'].dt.month
airbnb['Season'] = airbnb['Month'].map({1: 'Winter',
2: 'Winter',
3: 'Spring',
4: 'Spring',
5: 'Spring',
6: 'Summer',
7: 'Summer',
8: 'Summer',
9: 'Fall',
10: 'Fall',
11: 'Fall',
12: 'Winter'})
print(airbnb)
A Very Brief Introduction to Linear Regression¶
Linear regression is a very basic modeling tool that models the linear relationship between a group of variables and a response variable. Because this course is not one on econometrics, we will not cover linear regression in depth. If you have no experience with it then the following resource will be helpful for learning:
If you do not have the background and also do not decide to read into linear regression, you just want to know the following to understand the rest of the lecture. Given a set of coeffecients and a set of variables, linear regression predicts the response variable as the summation of each coeffecient multiplied by the variable.
Dummy Variables¶
Dummy variables are a way to represent numerically whether or not something is true for a given observation. If we have 4 categorical variables for season (Fall, Spring, Summer, Winter), then we can represent them as 4 dummy variables testing whether or not each instance is in a given season. The pandas function get_dummies returns to us columns for all present categorical variables and 0 or 1 to denote whether it is either true or false.
#Get the dummy variables
print(pd.get_dummies(airbnb['Season']))
The Base Case¶
When we model out data, we need to have a base case to compare to because otherwise it is redundant. If we know that a date is not in Fall, Spring, or Summer, then it must be in winter! For our modeling purposes we then want to leave out winter as a dummy variable and consider it the base case. Below, we drop the winter dummy variable then join the airbnb data and dummy variable data.
airbnb = airbnb.join(pd.get_dummies(airbnb['Season']).drop(columns=['Winter']))
print(airbnb)
Trend Component¶
As for the trend component, we saw that there is a linear trend over time for the overall search index. Because of this, we want a linear variable of time to describe how far along in time we are. If we take T=1 to be the first month 1/1/2012, then add 1 to this variable for each month, we will create a linear trend variable to be used in modeling.
#Create the trend component
airbnb['T'] = (airbnb['Year'] - 2012) * 12 + airbnb['Month']
print(airbnb)
Finally, we also need a constant like usual.
airbnb['Constant'] = 1
print(airbnb)
Set the date as the index.
#Set the index
airbnb = airbnb.set_index('Date')
We can run a regression model by using statsmodels. We need our X variables and the Y variable we are predicting first of all. Then we create an OLS object passing in Y and X, and fit the model. After fitting the model we can get a print out.
import statsmodels.api as sm
#Fit the model
Y = airbnb['Search']
X = airbnb[['Fall', 'Spring', 'Summer', 'T', 'Constant']]
model = sm.OLS(Y,X).fit()
print(model.summary())
The coefficients confirm what we saw graphically, that there are strong seasonality and trend components. We can grab the fitted values created from this model quite easily.
#Get the fitted values
predicted = model.fittedvalues
print(predicted)
#Plot the results
plt.plot(airbnb.index, airbnb['Search'])
plt.plot(predicted.index, predicted.values)
plt.xlabel("Time")
plt.ylabel("Google Search Index")
plt.title("Google Search Index for Airbnb Actual vs. Predicted")
plt.legend(['Actual', 'Predicted'])
plt.show()