Prediction
The Impact of Outliers¶
Outliers within our data can skew our results, especially if they come from extreme circumstances that we do not believe represent the normal state of the data. For airbnb, it is quite obvious that 2020 has a lot of differences from all the other years. Two major events are changing the data to an extreme extent: the coronavirus pandemic which drives down the searches sharply around the spring, as well as the IPO of airbnb which drove up the searches towards the end of 2020. While we don’t want to throw out data without a good reason, in the case of 2020, it is justifiable to ignore that data.
#Drop 2020
airbnb = airbnb.drop(columns=[2020])
airbnb.plot(kind='line')
plt.show()
Predicting the Trend and Seasonality¶
Now that we have a picture of those last 6 years, let's change the sample and to now be all data between 2012 and 2019.
airbnb = df.loc['2012-01-01':'2019-12-31']['Airbnb'].reset_index()
airbnb.columns = ['Date', 'Search']
airbnb['Year'] = airbnb['Date'].dt.year
airbnb['Month'] = airbnb['Date'].dt.month
airbnb = airbnb.pivot('Month', 'Year', 'Search')
print(airbnb)
If we get the mean value for all years, a very obvious linear trend over time becomes quite apparent.
ax = airbnb.mean().plot(kind='line')
ax.set_xlabel("Year")
ax.set_ylabel("Average Search Index")
ax.set_title("Google Search Index Airbnb Yearly Comparison")
plt.show()
Normalizing Yearly Data¶
If we wanted to compare year by year different aspects of seasonality, we would want to first control for those huge trend over time. This will make it easier to compare on a relative basis. If we divide each column by the sum of the column we can get the percent of the search index that each month took up of the total search for a given year.
airbnb2 = airbnb.divide(airbnb.sum())
print(airbnb2)
As an easy confirmation that we did this correctly, we can check to make sure that the sum of each year is 1, meaning every month is correctly a proportion of the overall yearly search volume.
print(airbnb2.sum())
After we have controlled for the overall yearly search volume, once again we can draw some sharp insights. Now, we see that around summer months is always a peak of the shares which makes intuitive sense.
from matplotlib.ticker import PercentFormatter
ax = airbnb2.plot(kind='line')
ax.set_xlabel("Month")
ax.set_ylabel("% of Yearly Search Index")
ax.set_title("Google Search Index Airbnb Monthly Comparison")
ax.yaxis.set_major_formatter(PercentFormatter(1))
plt.show()
The mean across rows will provide a picture of the average for seasonality. It illustrates the overall pattern.
ax = airbnb2.mean(axis=1).plot(kind='line')
ax.set_xlabel("Month")
ax.set_ylabel("% of Yearly Search Index")
ax.set_title("Google Search Index Airbnb Monthly Comparison")
ax.yaxis.set_major_formatter(PercentFormatter(1))
plt.show()