-
Introduction 2
-
Lecture1.1
-
Lecture1.2
-
-
Building the Algorithm 4
-
Lecture2.1
-
Lecture2.2
-
Lecture2.3
-
Lecture2.4
-
-
Visualizing the Algorithm 2
-
Lecture3.1
-
Lecture3.2
-
-
Normalization 2
-
Lecture4.1
-
Lecture4.2
-
Visualizing the Algorithm
Visualizing the Algorithm¶
To better understand how the algorithm works we will do some more visualization for it. First let’s build three sets of data points that each have just two features.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(0)
A = pd.DataFrame(np.random.normal(0,5, (100,2)) + (20,20), columns=['X1', 'X2'])
B = pd.DataFrame(np.random.normal(0,5, (100,2)) + (0,20), columns=['X1', 'X2'])
C = pd.DataFrame(np.random.normal(0,5, (100,2)) + (10,0), columns=['X1', 'X2'])
fig, ax = plt.subplots()
A.plot.scatter(x='X1', y='X2', label='A', ax=ax, color='red')
B.plot.scatter(x='X1', y='X2', label='B', ax=ax, color='green')
C.plot.scatter(x='X1', y='X2', label='C', ax=ax, color='blue')
plt.title("Base Data")
plt.show()
Granted it is not reasonable to assume we will have the actual categories in the real world (and if we did we would not even need this), but this will be useful for understanding how the algorithm works. Regardless, fit the KMeans algorithm.
from sklearn.cluster import KMeans
X = pd.concat([A,B,C])
model = KMeans(n_clusters=3, random_state=0).fit(X)
To visualize this data, one thing we will want to do is to see what the boundaries are for the different groups. To achieve this, we need to find the at each point what the group would be. Meshgrid takes two 1D arrays and builds every combination of the values. This will allow us to grab a lot of x and y values across a grid. Intialize two arrays with numbers between -15 and 35 (the approx. bounds) and call np.meshgrid to get back all of the x and y points we want to test.
#Grab all the x and y points
x_grid, y_grid = np.meshgrid(np.linspace(-15, 35, 1000), np.linspace(-15, 35, 1000))
print(x_grid)
print()
print()
print(y_grid)
[[-15. -14.94994995 -14.8998999 ... 34.8998999 34.94994995
35. ]
[-15. -14.94994995 -14.8998999 ... 34.8998999 34.94994995
35. ]
[-15. -14.94994995 -14.8998999 ... 34.8998999 34.94994995
35. ]
...
[-15. -14.94994995 -14.8998999 ... 34.8998999 34.94994995
35. ]
[-15. -14.94994995 -14.8998999 ... 34.8998999 34.94994995
35. ]
[-15. -14.94994995 -14.8998999 ... 34.8998999 34.94994995
35. ]]
[[-15. -15. -15. ... -15. -15.
-15. ]
[-14.94994995 -14.94994995 -14.94994995 ... -14.94994995 -14.94994995
-14.94994995]
[-14.8998999 -14.8998999 -14.8998999 ... -14.8998999 -14.8998999
-14.8998999 ]
...
[ 34.8998999 34.8998999 34.8998999 ... 34.8998999 34.8998999
34.8998999 ]
[ 34.94994995 34.94994995 34.94994995 ... 34.94994995 34.94994995
34.94994995]
[ 35. 35. 35. ... 35. 35.
35. ]]
For prediction, we are going to need a 2D array. The function ravel will take the arrays and break them down to 1D, and then from there we can stack and transpose to get our data for prediction.
print(np.vstack([x_grid.ravel(), y_grid.ravel()]).T)
[[-15. -15. ]
[-14.94994995 -15. ]
[-14.8998999 -15. ]
...
[ 34.8998999 35. ]
[ 34.94994995 35. ]
[ 35. 35. ]]
Now predict the labels for each point in the grid and reshape.
z = model.predict(np.vstack([x_grid.ravel(), y_grid.ravel()]).T)
z = z.reshape(x_grid.shape)
print(z)
[[1 1 1 ... 1 1 1]
[1 1 1 ... 1 1 1]
[1 1 1 ... 1 1 1]
...
[0 0 0 ... 2 2 2]
[0 0 0 ... 2 2 2]
[0 0 0 ... 2 2 2]]
And plot, we can even pass in a colormap to match the colors of the data we created before. We use imshow passing in the prediction values with a few other options. For one we use extent to feed the boundaries of the area. As well, we use interpolation to make it so that we can interpolate between points. As well, a cmap or the color map is passed which holds the colors, alpha to make it less dark, and orgin of lower/aspect of auto to make it correctly align.
from matplotlib import colors
#Create the color map
cmap = colors.ListedColormap(['green', 'blue','red'])
#Plot the regions
plt.imshow(z, interpolation='nearest',
extent=(-15, 35, -15, 35),
cmap=cmap,
alpha=.4,
aspect='auto', origin='lower')
plt.show()
From here we can overlay the actual data to see how well the regions group the different pieces of data.
#Set up our plots
fig, ax = plt.subplots()
#Plot all the actual data
A.plot.scatter(x='X1', y='X2', label='A', ax=ax, color='red')
B.plot.scatter(x='X1', y='X2', label='B', ax=ax, color='green')
C.plot.scatter(x='X1', y='X2', label='C', ax=ax, color='blue')
#Plot the regions
ax.imshow(z, interpolation='nearest',
extent=(-15, 35, -15, 35),
cmap=cmap,
alpha=.4,
aspect='auto', origin='lower')
plt.show()