Applying a Normalization
Applying a Normalization¶
We see that our model doesn’t do a great job of classifying our points. A lot of this is because of how different the two features are. A way to fix this would be to take the data we have and standardize it to instead be in relation to standard deviations away from the mean. This controls for both the mean and the variance of different features. The first step would be to find mu and std.
#Find mu
mu = X.mean()
#Find std
std = X.std()
Now we can scale all the different data points by subtracting the mean and dividing by the standard deviation.
#Scale the X features
X_scaled = (X - mu) / std
print(X_scaled)
#Plot the scaled data
fig, ax = plt.subplots()
X_scaled[:100].plot.scatter(x='X1', y='X2', label='A', ax=ax, color='red')
X_scaled[100:200].plot.scatter(x='X1', y='X2', label='B', ax=ax, color='green')
X_scaled[200:300].plot.scatter(x='X1', y='X2', label='C', ax=ax, color='blue')
plt.xlim([-4, 4])
plt.ylim([-4, 4])
plt.show()
Of course, we also can easily convert from scaled back to the original by doing the opposite. This is helpful because it means that we can build our KMeans model on scaled points, then switch it back to the nominal version.
#Fit the model
model = KMeans(n_clusters=3, random_state=1).fit(X_scaled)
#Convert the grids to be scaled
x_grid_scaled = (x_grid - mu[0]) / std[0]
y_grid_scaled = (y_grid - mu[1]) / std[1]
#Grab the colors we want to map to
cmap = colors.ListedColormap(['green', 'blue', 'red'])
#Predict and reshape
z = model.predict(np.vstack([x_grid_scaled.ravel(), y_grid_scaled.ravel()]).T)
z = z.reshape(x_grid.shape)
#Set up our plots
fig, ax = plt.subplots()
#Plot all the actual data
A.plot.scatter(x='X1', y='X2', label='A', ax=ax, color='red')
B.plot.scatter(x='X1', y='X2', label='B', ax=ax, color='green')
C.plot.scatter(x='X1', y='X2', label='C', ax=ax, color='blue')
#Plot the regions
ax.imshow(z, interpolation='nearest',
extent=(-125, 125, -125, 125),
cmap=cmap,
alpha=.4,
aspect='auto', origin='lower')
plt.show()
Now we see much better classification because the data has been normalized!