Assign

Step 2: Assign¶

This part can be hard to understand if you aren’t as skilled with mathematical notations so we are going to take it very slowly! To start with, we are going to have to go through each pixel and assign a group, so pull the first pixel in our image that we will be working with. I will use a random number I chose for this example.

In [4]:

#Grab the pixel
pixel = X[85000]
print(pixel)

#Plot the pixel color (we need to nest it twice for this to work with just a 1D array)
plt.imshow([[pixel]])
plt.show()

[0.34901961 0.37647059 0.49019608]

For this pixel, we have to find the distance from each centroid that we currently have. Let's grab the first color that we want to compare with.

In [5]:

#Grab the centroid color
c = colors[0]
print(c)

#Plot the centroid color (we need to nest it twice for this to work with just a 1D array)
plt.imshow([[c]])
plt.show()

[0.5488135  0.71518937 0.60276338]

Euclidean Distance¶

The euclidean distance between two points p and q is defined as:

$$d(p, q) = \sqrt{\sum_{i=1}^{n}(q_i-p_i)^2}$$

where

$d = \text{Euclidean distance} $

$p = \text{Point 1} $

$q = \text{Point 2} $

$i = \text{Component i of a point} $

$n = \text{Number of components in the points} $

Let's work through piece by piece how to quickly compute the euclidean distance between these two points. First to find the term $q_i-p_i$, we can do the following.

In [6]:

#Finding euclidean distance piece by piece...
print(pixel-c)

[-0.1997939  -0.33871878 -0.1125673 ]

Now to move on to $(q_i-p_i)^2$

In [7]:

#Finding euclidean distance piece by piece...
print((pixel-c) ** 2)

[0.0399176  0.11473041 0.0126714 ]

We need to sum the values which can easily be with the sum function, which will get us to the point of $\sum_{i=1}^{n}(q_i-p_i)^2$

In [8]:

#Finding euclidean distance piece by piece...
print(((pixel-c) ** 2).sum())

0.16731940807323972

And finally, we need to take the square root of that to get to our final answer for the distance.

In [9]:

#Finding euclidean distance piece by piece...
print(((pixel-c) ** 2).sum() ** .5)

0.40904695093991317

Now the distance needs to be computed for all colors to find the lowest distance of the bunch. Let's loop thorugh each and see what the distances are below.

In [10]:

#Plot the pixel color (we need to nest it twice for this to work with just a 1D array)
print("Pixel to assign:")
print(pixel)
plt.imshow([[pixel]])
plt.show()
print("--------")
print()

for c in colors:
    #Plot the centroid
    print("Centroid:")
    print(c)
    plt.imshow([[c]])
    plt.show()

    #Find the distance
    distance = ((pixel-c) ** 2).sum() ** .5
    print("Distance: {}".format(distance))

    print("--------")
    print()

Pixel to assign:
[0.34901961 0.37647059 0.49019608]

--------

Centroid:
[0.5488135  0.71518937 0.60276338]

Distance: 0.40904695093991317
--------

Centroid:
[0.54488318 0.4236548  0.64589411]

Distance: 0.25461886779807286
--------

Centroid:
[0.43758721 0.891773   0.96366276]

Distance: 0.7053733024307559
--------

Centroid:
[0.38344152 0.79172504 0.52889492]

Distance: 0.41847189438882065
--------

As we can tell, that blue color is the closest in terms of color. Now to generalize this so that it can run fast, I want to build out a function which will assign the centroid. Once again we will work through first how to compute the distances without needing a for loop (numpy is much faster than using a for loop).

In [11]:

#Step by step finding the distance

print("Step 1 - Find difference")
distance = pixel - colors
print(distance)
print()
print("Step 2 - Square Values")
distance = distance ** 2
print(distance)
print()
print("Step 3 - Get the sum by summing across axis 1")
distance = distance.sum(axis=1)
print(distance)
print()
print("Step 4 - Take the square root")
distance = distance ** .5
print(distance)
print()

Step 1 - Find difference
[[-0.1997939  -0.33871878 -0.1125673 ]
 [-0.19586358 -0.04718421 -0.15569803]
 [-0.0885676  -0.51530241 -0.47346668]
 [-0.03442191 -0.41525445 -0.03869884]]

Step 2 - Square Values
[[0.0399176  0.11473041 0.0126714 ]
 [0.03836254 0.00222635 0.02424188]
 [0.00784422 0.26553658 0.2241707 ]
 [0.00118487 0.17243626 0.0014976 ]]

Step 3 - Get the sum by summing across axis 1
[0.16731941 0.06483077 0.4975515  0.17511873]

Step 4 - Take the square root
[0.40904695 0.25461887 0.7053733  0.41847189]

Combining all these steps we have the following.

In [12]:

#Find the distance to each centroid
distance = ((pixel - colors) ** 2).sum(axis=1) ** .5
print(distance)

[0.40904695 0.25461887 0.7053733  0.41847189]

The argmin function will return the index of the smallest distance in the array.

In [13]:

#Find the smallest distance
i = distance.argmin()
print(i)

Now we are going to make a minor change up. Instead of going pixel by pixel, we are going to go color by color. The following code is an example of how we can get the distance for every point from the first centroid.

In [14]:

#Find each points distance from the first centroid
c = colors[0]
print(((X - c) ** 2).sum(axis=1) ** .5)

[1.07987628 1.07987628 1.08444261 ... 0.54093431 1.07790283 1.06856875]

Using list comprehension, create a list of 4 arrays where each array holds the distance for each point from the corresponding centroid.

In [15]:

#Get the distance
distance = [((X - c) ** 2).sum(axis=1) ** .5 for c in colors]
print(distance)

[array([1.07987628, 1.07987628, 1.08444261, ..., 0.54093431, 1.07790283,
       1.06856875]), array([0.94127289, 0.94127289, 0.94528317, ..., 0.41424507, 0.9390246 ,
       0.93058987]), array([1.38021447, 1.38021447, 1.38397528, ..., 0.87929775, 1.37898733,
       1.36997904]), array([1.02195868, 1.02195868, 1.02644326, ..., 0.52630349, 1.02050884,
       1.01097913])]

Using the vstack function from numpy will stack it so that each columns holds the distance from each centroid for each point.

In [16]:

#Stack the distances
distance = np.vstack(distance)
print(distance)

[[1.07987628 1.07987628 1.08444261 ... 0.54093431 1.07790283 1.06856875]
 [0.94127289 0.94127289 0.94528317 ... 0.41424507 0.9390246  0.93058987]
 [1.38021447 1.38021447 1.38397528 ... 0.87929775 1.37898733 1.36997904]
 [1.02195868 1.02195868 1.02644326 ... 0.52630349 1.02050884 1.01097913]]

Finally, use argmin along the columns to grab the labels!

In [17]:

#Find the labels
labels = distance.argmin(axis=0)
print(labels)

[1 1 1 ... 1 1 1]

Like we did before, you can index with the labels to find the image created from this before the next step.

In [18]:

#Show the image
Y = colors[labels]
Y = Y.reshape(img_shape)
plt.imshow(Y)
plt.show()

If you wanted to see the number of pixels with each label, you can call np.unique with the labels and return_count=True. It will give you back unique labels and the count for each.

In [19]:

#Find the unique label counts
print(np.unique(labels, return_counts=True))

(array([0, 1, 2, 3]), array([ 25058, 238195,   1747,   5000]))

Data Science

KMeans

Assign

Step 2: Assign¶

Euclidean Distance¶

Leave A Reply Cancel reply

Modal title