Sklearn

Using the sklearn’s KMeans Algorithm¶

The sklearn library has a handy KMeans algorithm already written for you which you can use. First we can import it and create an instance of the model. It will take as an argument the number of clusters we want to build.

Take a step back for a second, what does the number of clusters mean in the context of this project? It means the number of colors that we will find as the “center” of different clusters of colors. For example, if we had mostly red and green in the picture, we would expect to get back two clusters where the cluster center is red for one and green for the other. Let’s first work with using four clusters.

In [10]:

from sklearn.cluster import KMeans

#Create a model with 4 clusters
model = KMeans(n_clusters=4)

Once we make the model, we fit it by calling fit and passing in our data to fit with. In this case, we have the X variable from before holding our data.

In [11]:

import numpy as np

#Set the seed
np.random.seed(1)

#Fit the model
model = model.fit(X)

Let's go through a few of the attributes now. First of all, there will now be 4 cluster centers. We can get it back from the attribute clustercenters.

In [12]:

#Report the cluster centers
print(model.cluster_centers_)

[[0.63052206 0.35656398 0.09485633]
 [0.14324513 0.11149619 0.06230063]
 [0.82655272 0.76054029 0.62328195]
 [0.34061374 0.38111133 0.34026588]]

There is also the attribute labels_ which finds the labels for each of the points. This label is the closest cluster center for each point.

In [13]:

#Likewise, we can get labels
print(model.labels_)

[1 1 1 ... 3 1 1]

Now, if we index the cluster center by the labels, we will get back the cluster center assigned to each point! This is helpful because then we can assign the value of the cluster center to every point. This is very useful when you want to do something like image compression and only use a few colors to save on space. Let's first find the Y values.

In [14]:

#Index the cluster centers with the model labels
Y = model.cluster_centers_[model.labels_]
print(Y)

[[0.14324513 0.11149619 0.06230063]
 [0.14324513 0.11149619 0.06230063]
 [0.14324513 0.11149619 0.06230063]
 ...
 [0.34061374 0.38111133 0.34026588]
 [0.14324513 0.11149619 0.06230063]
 [0.14324513 0.11149619 0.06230063]]

As we did before, we can reshape the array to get back to the original shape.

In [15]:

#Reshape the array
Y = Y.reshape(img_shape)

Finally we get to see what the image looks like when we only use four colors!

In [16]:

#Show the replicated images
plt.imshow(Y, vmin=0,vmax=1)
plt.show()

This is only four different colors but notice how it actaully looks fairly similar to the original picture! Now we can experiment with a different number of clusters.

In [17]:

#Set the seed to make it easy to replicate
np.random.seed(0)

#Try different numbers of clusters
for n_clusters in range(2,11):
    #Create the model
    model = KMeans(n_clusters=n_clusters).fit(X)

    #Find the replicated image
    Y = model.cluster_centers_[model.labels_]
    Y = Y.reshape(img_shape)

    #Plot the replicated image
    plt.imshow(Y, vmin=0,vmax=1)
    plt.title("{} Clusters".format(n_clusters))
    plt.show()

The final cool thing that we can do if we want is to show the cluster centers as a palette of colors by simply passing it in as a nested list like below.

In [18]:

#We can get the colors used if we were interested
plt.imshow([model.cluster_centers_])
plt.show()

Those are the 10 different colors which are present in the image!

Data Science

KMeans

Sklearn

Using the sklearn’s KMeans Algorithm¶

Leave A Reply Cancel reply

Modal title