Log Transformation
Data Transformations¶
Often in the real world, the data receive needs to have different techniques applied to it before we can find any insights. This lesson is going to focus on a lot of possible data transformations you may want to use in the future on a given dataset.
Sales Data¶
Let’s introduce our first data set, sales data from a random company. The x-axis is the month (in terms of t months since inception), while the y-axis is the units sold in thousands. As you will be able to see, the company has really been getting off its feet lately! You do not need to yet understand the data creation process.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Create the data, ignore for now
np.random.seed(1)
sales = np.cumprod(np.random.normal(1.02, .04, 200))
#Plot the sales
plt.plot(sales)
plt.xlabel("t")
plt.ylabel("Sales (Thousands)")
plt.title("Company XYZ Sales Data")
plt.show()
The Log Transformation¶
Exponential growth is a very classic phenomenon where growth is curving upwards faster and faster. For many techniques we need linear data to properly apply a model, which can be an issue if the true data is not linear. What the log transformation does is apply the log function to each data point. This will deal with exponential growth quite well in many cases. We can apply it with np.log which will return a new object where the log transformation was applied.
#Transform the sales to log sales
log_sales = np.log(sales)
#Plot the log sales
plt.plot(log_sales)
plt.xlabel("t")
plt.ylabel("Log Sales (Thousands)")
plt.title("Company XYZ Log Sales Data")
plt.show()
To reverse a log transformation you can use np.exp which takes e to each number and essentially will cancel out the log transformation. Notice how the first 5 values are the same below.
#Compare the reversed values
print(sales[:5])
print(np.exp(log_sales[:5]))