K means Clustering – Introduction

Last Updated : 13 May, 2025

K-Means Clustering is an Unsupervised Machine Learning algorithm which groups unlabeled dataset into different clusters. It is used to organize data into groups based on their similarity.

Understanding K-means Clustering

For example online store uses K-Means to group customers based on purchase frequency and spending creating segments like Budget Shoppers, Frequent Buyers and Big Spenders for personalised marketing.

The algorithm works by first randomly picking some central points called centroids and each data point is then assigned to the closest centroid forming a cluster. After all the points are assigned to a cluster the centroids are updated by finding the average position of the points in each cluster. This process repeats until the centroids stop changing forming clusters. The goal of clustering is to divide the data points into clusters so that similar data points belong to same group.

How k-means clustering works?

We are given a data set of items with certain features and values for these features like a vector. The task is to categorize those items into groups. To achieve this we will use the K-means algorithm. 'K' in the name of the algorithm represents the number of groups/clusters we want to classify our items into.

The algorithm will categorize the items into k groups or clusters of similarity. To calculate that similarity we will use the Euclidean distance as a measurement. The algorithm works as follows:

First we randomly initialize k points called means or cluster centroids.
We categorize each item to its closest mean and we update the mean's coordinates, which are the averages of the items categorized in that cluster so far.
We repeat the process for a given number of iterations and at the end, we have our clusters.

The "points" mentioned above are called means because they are the mean values of the items categorized in them. To initialize these means, we have a lot of options. An intuitive method is to initialize the means at random items in the data set. Another method is to initialize the means at random values between the boundaries of the data set. For example for a feature x the items have values in [0,3] we will initialize the means with values for x at [0,3].

Selecting the right number of clusters is important for meaningful segmentation to do this we use Elbow Method for optimal value of k in KMeans which is a graphical tool used to determine the optimal number of clusters (k) in K-means.

Implementation of K-Means Clustering in Python

We will use blobs datasets and show how clusters are made.

Step 1: Importing the necessary libraries

We are importing Numpy, Matplotlib and scikit learn.

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

Step 2: Create custom dataset with make_blobs and plot it

Python

X,y = make_blobs(n_samples = 500,n_features = 2,centers = 3,random_state = 23)

fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()

Output:

Clustering dataset - Geeksforgeeks — Clustering dataset

Step 3: Initializing random centroids

The code initializes three clusters for K-means clustering. It sets a random seed and generates random cluster centers within a specified range and creates an empty list of points for each cluster.

Python

k = 3

clusters = {}
np.random.seed(23)

for idx in range(k):
    center = 2*(2*np.random.random((X.shape[1],))-1)
    points = []
    cluster = {
        'center' : center,
        'points' : []
    }
    
    clusters[idx] = cluster
    
clusters

Output:

Screenshot-2025-05-08-120956 — Random Centroids

Step 4: Plotting random initialize center with data points

Python

plt.scatter(X[:,0],X[:,1])
plt.grid(True)
for i in clusters:
    center = clusters[i]['center']
    plt.scatter(center[0],center[1],marker = '*',c = 'red')
plt.show()

Output:

Data points with random center - Geeksforgeeks — Data points with random center

The plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It also marks the initial cluster centers (red stars) generated for K-means clustering.

Step 5: Defining Euclidean distance

Python

def distance(p1,p2):
    return np.sqrt(np.sum((p1-p2)**2))

Step 6: Creating function Assign and Update the cluster center

This step assigns data points to the nearest cluster center and the M-step updates cluster centers based on the mean of assigned points in K-means clustering.

Python

def assign_clusters(X, clusters):
    for idx in range(X.shape[0]):
        dist = []
        
        curr_x = X[idx]
        
        for i in range(k):
            dis = distance(curr_x,clusters[i]['center'])
            dist.append(dis)
        curr_cluster = np.argmin(dist)
        clusters[curr_cluster]['points'].append(curr_x)
    return clusters

def update_clusters(X, clusters):
    for i in range(k):
        points = np.array(clusters[i]['points'])
        if points.shape[0] > 0:
            new_center = points.mean(axis =0)
            clusters[i]['center'] = new_center
            
            clusters[i]['points'] = []
    return clusters

Step 7: Creating function to Predict the cluster for the datapoints

Python

def pred_cluster(X, clusters):
    pred = []
    for i in range(X.shape[0]):
        dist = []
        for j in range(k):
            dist.append(distance(X[i],clusters[j]['center']))
        pred.append(np.argmin(dist))
    return pred

Step 8: Assign, Update and predict the cluster center

Python

clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)

Step 9: Plotting data points with their predicted cluster center

Python

plt.scatter(X[:,0],X[:,1],c = pred)
for i in clusters:
    center = clusters[i]['center']
    plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.show()

Output:

K-means Clustering - Geeksforgeeks — K-means Clustering

The plot shows data points colored by their predicted clusters. The red markers represent the updated cluster centers after the E-M steps in the K-means clustering algorithm.

Hierarchical Clustering in Machine Learning

kartik

Improve

Article Tags :

Practice Tags :

Machine Learning

K means Clustering – Introduction

Understanding K-means Clustering

How k-means clustering works?

Implementation of K-Means Clustering in Python

Step 1: Importing the necessary libraries

Step 2: Create custom dataset with make_blobs and plot it

Step 3: Initializing random centroids

Step 4: Plotting random initialize center with data points

Step 5: Defining Euclidean distance

Step 6: Creating function Assign and Update the cluster center

Step 7: Creating function to Predict the cluster for the datapoints

Step 8: Assign, Update and predict the cluster center

Step 9: Plotting data points with their predicted cluster center

Similar Reads

Introduction to Machine Learning

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advance Machine Learning Technique

Machine Learning Practice

Thank You!

What kind of Experience do you want to share?