K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

Every machine Learning engineer wants to achieve accurate predictions with their algorithms. Such learning algorithms are generally divided into two types: supervised and unsupervised. K-means clustering is one of the unsupervised algorithms where the available input data does not have a labeled response.

What is Clustering

Clustering is like sorting a bunch of similar items into different groups based on their characteristics. In data mining and machine learning, it’s a powerful technique used to group similar data points together, making it easier to find patterns or understand large datasets. Essentially, clustering helps identify natural groupings in your data. There are two common types of clustering methods:

Types of Clustering

Clustering is a type of unsupervised learning wherein data points are grouped into different sets based on their degree of similarity.

The various types of clustering are:

  • Hierarchical clustering
  • Partitioning clustering 

Hierarchical clustering is further subdivided into:

  • Agglomerative clustering
  • Divisive clustering

Partitioning clustering is further subdivided into:

  • K-Means clustering 
  • Fuzzy C-Means clustering 

Hierarchical Clustering

Hierarchical clustering uses a tree-like structure, like so:

hierarchical clustering

In agglomerative clustering, there is a bottom-up approach. We begin with each element as a separate cluster and merge them into successively more massive clusters, as shown below:

clustering-slide19

Divisive clustering is a top-down approach. We begin with the whole set and proceed to divide it into successively smaller clusters, as you can see below:

clustering slide20

Partitioning Clustering 

Partitioning clustering is split into two subtypes - K-Means clustering and Fuzzy C-Means.

In k-means clustering, the objects are divided into several clusters mentioned by the number ‘K.’ So if we say K = 2, the objects are divided into two clusters, c1 and c2, as shown:

clustering-slide21

Master Gen AI Strategies for Businesses with

Generative AI for Business Transformation ProgramExplore Program
Master Gen AI Strategies for Businesses with

Here, the features or characteristics are compared, and all objects having similar characteristics are clustered together. 

Fuzzy c-means is very similar to k-means in the sense that it clusters objects that have similar characteristics together. In k-means clustering, a single object cannot belong to two different clusters. But in c-means, objects can belong to more than one cluster, as shown. 

clustering-slide22

What is K-Means Clustering?

K-means clustering is a way of grouping data based on how similar or close the data points are to each other. Imagine you have a bunch of points, and you want to group them into clusters. The algorithm works by first randomly picking some central points (called centroids) and then assigning every data point to the nearest centroid. 

Once that’s done, it recalculates the centroids based on the new groupings and repeats the process until the clusters make sense. It’s a pretty fast and efficient method, but it works best when the clusters are distinct and not too mixed up. One challenge, though, is figuring out the right number of clusters (K) beforehand. Plus, if there’s a lot of noise or overlap in the data, K Means might not perform as well.

Become an AI and Machine Learning Expert

With Purdue University's Post Graduate ProgramExplore Program
Become an AI and Machine Learning Expert

Objective of K-Means Clustering

K-Means clustering primarily aims to organize similar data points into distinct groups. Here’s a look at its key objectives:

  • Grouping Similar Data Points

K-Means is designed to cluster data points that share common traits, allowing patterns or trends to emerge. Whether analyzing customer behavior or images, the method helps reveal hidden relationships within your dataset.

  • Minimizing Within-Cluster Distance

Another objective is to keep data points in each group as close to the cluster's centroid as possible. Reducing this internal distance results in compact, cohesive clusters, enhancing the accuracy of your results.

  • Maximizing Between-Cluster Distance

K-Means also aims to maintain clear separation between different clusters. By maximizing the distance between groups, the algorithm ensures that each cluster remains distinct, providing a better understanding of data categories without overlap.

Properties of K-Means Clustering

Now, let’s look at the key properties that make K-means clustering algorithm effective in creating meaningful groups:

  • Similarity Within a Cluster

One of the main things K Means aims for is that all the data points in a cluster should be pretty similar to each other. Imagine a bank that wants to group its customers based on income and debt. If customers within the same cluster have vastly different financial situations, then a one-size-fits-all approach to offers might not work. For example, a customer with high income and high debt might have different needs compared to someone with low income and low debt. By making sure the customers in each cluster are similar, the bank can create more tailored and effective strategies.

  • Differences Between Clusters

Another important aspect is that the clusters themselves should be as distinct from each other as possible. Going back to our bank example, if one cluster consists of high-income, high-debt customers and another cluster has high-income, low-debt customers, the differences between the clusters are clear. This separation helps the bank create different strategies for each group. If the clusters are too similar, it can be challenging to treat them as separate segments, which can make targeted marketing less effective.

Applications of K-Means Clustering

Here are some interesting ways K-means clustering is put to work across different fields:

  • Distance Measures

At the heart of K-Means clustering is the concept of distance. Euclidean distance, for example, is a simple straight-line measurement between points and is commonly used in many applications. Manhattan distance, however, follows a grid-like path, much like how you'd navigate city streets. Squared Euclidean distance makes calculations easier by squaring the values, while cosine distance is handy when working with text data because it measures the angle between data vectors. Picking the right distance measure really depends on what kind of problem you’re solving and the nature of your data.

  • K-Means for Geyser Eruptions

K-Means clustering has even been applied to studying the eruptions of the Old Faithful geyser in Yellowstone. The data collected includes eruption duration and the waiting time between eruptions. By clustering this information, researchers can uncover patterns that help predict the geyser’s behavior. For instance, you might find clusters of similar eruption durations and intervals, which could improve predictions for future eruptions.

  • Customer Segmentation

One of the most popular uses of K-means clustering is for customer segmentation. From banks to e-commerce, businesses use K-means clustering customer segmentation to group customers based on their behaviors. For example, in telecom or sports industries, companies can create targeted marketing campaigns by understanding different customer segments better. This allows for personalized offers and communications, boosting customer engagement and satisfaction.

  • Document Clustering

When dealing with a vast collection of documents, K-Means can be a lifesaver. It groups similar documents together based on their content, which makes it easier to manage and retrieve relevant information. For instance, if you have thousands of research papers, clustering can quickly help you find related studies, improving both organization and efficiency in accessing valuable information.

  • Image Segmentation

In image processing, K-Means clustering is commonly used to group pixels with similar colors, which divides the image into distinct regions. This is incredibly helpful for tasks like object detection and image enhancement. For instance, clustering can help separate objects within an image, making analysis and processing more accurate. It’s also widely used to extract meaningful features from images in various visual tasks.

  • Recommendation Engines

K-Means clustering also plays a vital role in recommendation systems. Say you want to suggest new songs to a listener based on their past preferences; clustering can group similar songs together, helping the system provide personalized suggestions. By clustering content that shares similar features, recommendation engines can deliver a more tailored experience, helping users discover new songs that match their taste.

  • K-Means for Image Compression

K-Means can even help with image compression by reducing the number of colors in an image while keeping the visual quality intact. K-Means reduces the image size without losing much detail by clustering similar colors and replacing the pixels with the average of their cluster. It’s a practical method for compressing images for more accessible storage and transmission, all while maintaining visual clarity.

Advantages of K-means

  1. Simple and easy to implement: The k-means algorithm is easy to understand and implement, making it a popular choice for clustering tasks.
  2. Fast and efficient: K-means is computationally efficient and can handle large datasets with high dimensionality.
  3. Scalability: K-means can handle large datasets with many data points and can be easily scaled to handle even larger datasets.
  4. Flexibility: K-means can be easily adapted to different applications and can be used with varying metrics of distance and initialization methods.

Disadvantages of K-Means

  1. Sensitivity to initial centroids: K-means is sensitive to the initial selection of centroids and can converge to a suboptimal solution.
  2. Requires specifying the number of clusters: The number of clusters k needs to be specified before running the algorithm, which can be challenging in some applications.
  3. Sensitive to outliers: K-means is sensitive to outliers, which can have a significant impact on the resulting clusters.

Different Evaluation Metrics for Clustering

When it comes to evaluating how well your clustering algorithm is working, there are a few key metrics that can help you get a clearer picture of your results. Here’s a rundown of the most useful ones:

  • Silhouette Analysis

Silhouette analysis is like a report card for your clusters. It measures how well each data point fits into its own cluster compared to other clusters. A high silhouette score means that your points are snugly fitting into their clusters and are quite distinct from points in other clusters. Imagine a score close to 1 as a sign that your clusters are well-defined and separated. Conversely, a score close to 0 indicates some overlap, and a negative score suggests that the clustering might need some work.

  • Inertia

Inertia is a bit like a gauge of how tightly packed your data points are within each cluster. It calculates the sum of squared distances from each point to the cluster's center (or centroid). Think of it as measuring how snugly the points are huddled together. Lower inertia means that points are closer to the centroid and to each other, which generally indicates that your clusters are well-formed. For most numeric data, you'll use Euclidean distance, but if your data includes categorical features, Manhattan distance might be better.

  • Dunn Index

The Dunn Index takes a broader view by considering both the distance within and between clusters. It’s calculated as the ratio of the smallest distance between any two clusters (inter-cluster distance) to the largest distance within a cluster (intra-cluster distance). A higher Dunn Index means that clusters are not only tight and cohesive internally but also well-separated from each other. In other words, you want your clusters to be as far apart as possible while being as compact as possible.

How Does K-Means Clustering Work?

The flowchart below shows how k-means clustering works:

slide32

The goal of the K-Means algorithm is to find clusters in the given input data. There are a couple of ways to accomplish this. We can use the trial and error method by specifying the value of K (e.g., 3,4, 5). As we progress, we keep changing the value until we get the best clusters. 

Another method is to use the Elbow technique to determine the value of K. Once we get the K's value, the system will assign that many centroids randomly and measure the distance of each of the data points from these centroids. Accordingly, it assigns those points to the corresponding centroid from which the distance is minimum. So each data point will be assigned to the centroid, which is closest to it. Thereby we have a K number of initial clusters.

It calculates the new centroid position for the newly formed clusters. The centroid's position moves compared to the randomly allocated one.

Once again, the distance of each point is measured from this new centroid point. If required, the data points are relocated to the new centroids, and the mean position or the new centroid is calculated once again. 

If the centroid moves, the iteration continues indicating no convergence. But once the centroid stops moving (which means that the clustering process has converged), it will reflect the result.

Let's use a visualization example to understand this better. 

We have a data set for a grocery shop, and we want to find out how many clusters this has to be spread across. To find the optimum number of clusters, we break it down into the following steps:

Step 1:

The Elbow method is the best way to find the number of clusters. The elbow method constitutes running  K-Means clustering on the dataset.

Next, we use within-sum-of-squares as a measure to find the optimum number of clusters that can be formed for a given data set. Within the sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroid.

slide34

The WSS is measured for each value of K. The value of K, which has the least amount of WSS, is taken as the optimum value. 

Now, we draw a curve between WSS and the number of clusters. 

slide35

Here, WSS is on the y-axis and number of clusters on the x-axis.

You can see that there is a very gradual change in the value of WSS as the K value increases from 2. 

So, you can take the elbow point value as the optimal value of K. It should be either two, three, or at most four. But, beyond that, increasing the number of clusters does not dramatically change the value in WSS, it gets stabilized. 

Step 2:

Let's assume that these are our delivery points:

delivery points

We can randomly initialize two points called the cluster centroids.

Here, C1 and C2 are the centroids assigned randomly. 

Step 3:

Now the distance of each location from the centroid is measured, and each data point is assigned to the centroid, which is closest to it.

This is how the initial grouping is done:

initial grouping

Step 4: 

Compute the actual centroid of data points for the first group.

Step 5:

Reposition the random centroid to the actual centroid.

random centranoid

Step 6: 

Compute the actual centroid of data points for the second group.

Step 7:

Reposition the random centroid to the actual centroid.

actual centroid

Step 8: 

Once the cluster becomes static, the k-means algorithm is said to be converged. 

The final cluster with centroids c1 and c2 is as shown below:

final centroid

Your AI/ML Career is Just Around The Corner!

AI Engineer Master's ProgramExplore Program
Your AI/ML Career is Just Around The Corner!

K-Means Clustering Algorithm 

Let's say we have x1, x2, x3……… x(n) as our inputs, and we want to split this into K clusters. 

The steps to form clusters are:

Step 1: Choose K random points as cluster centers called centroids. 

Step 2: Assign each x(i) to the closest cluster by implementing euclidean distance (i.e., calculating its distance to each centroid)

Step 3: Identify new centroids by taking the average of the assigned points.

Step 4: Keep repeating step 2 and step 3 until convergence is achieved

Let's take a detailed look at it at each of these steps.

Step 1:

We randomly pick K (centroids). We name them c1,c2,..... ck, and we can say that 

centroid

Where C is the set of all centroids.

Step 2:

We assign each data point to its nearest center, which is accomplished by calculating the euclidean distance.

centroid slide44

Where dist() is the Euclidean distance.

Here, we calculate each x value's distance from each c value, i.e. the distance between x1-c1, x1-c2, x1-c3, and so on. Then we find which is the lowest value and assign x1 to that particular centroid.

Similarly, we find the minimum distance for x2, x3, etc. 

Step 3: 

We identify the actual centroid by taking the average of all the points assigned to that cluster. 

centroid slide 45

Where Si is the set of all points assigned to the ith cluster.     

It means the original point, which we thought was the centroid, will shift to the new position, which is the actual centroid for each of these groups. 

Step 4:

Keep repeating step 2 and step 3 until convergence is achieved.

How to Choose the Value of "K number of clusters" in K-Means Clustering?

Although many choices are available for choosing the optimal number of clusters, the Elbow Method is one of the most popular and appropriate methods. The Elbow Method uses the idea of WCSS value, which is short for for Within Cluster Sum of Squares. WCSS defines the total number of variations within a cluster. This is the formula used to calculate the value of WCSS (for three clusters) provided courtesy of Javatpoint:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

Python Implementation of the K-Means Clustering Algorithm

Here’s how to use Python to implement the K-Means Clustering Algorithm. These are the steps you need to take:

  • Data pre-processing
  • Finding the optimal number of clusters using the elbow method
  • Training the K-Means algorithm on the training data set
  • Visualizing the clusters

1. Data Pre-Processing. Import the libraries, datasets, and extract the independent variables.

# importing libraries    

import numpy as nm    

import matplotlib.pyplot as mtp    

import pandas as pd    

# Importing the dataset  

dataset = pd.read_csv('Mall_Customers_data.csv')  

x = dataset.iloc[:, [3, 4]].values 

2. Find the optimal number of clusters using the elbow method. Here’s the code you use:

#finding optimal number of clusters using the elbow method  

from sklearn.cluster import KMeans  

wcss_list= []  #Initializing the list for the values of WCSS  

#Using for loop for iterations from 1 to 10.  

for i in range(1, 11):  

    kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)  

    kmeans.fit(x)  

    wcss_list.append(kmeans.inertia_)  

mtp.plot(range(1, 11), wcss_list)  

mtp.title('The Elobw Method Graph')  

mtp.xlabel('Number of clusters(k)')  

mtp.ylabel('wcss_list')  

mtp.show() 

3. Train the K-means algorithm on the training dataset. Use the same two lines of code used in the previous section. However, instead of using i, use 5, because there are 5 clusters that need to be formed. Here’s the code:

#training the K-means model on a dataset  

kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)  

y_predict= kmeans.fit_predict(x) 

4. Visualize the Clusters. Since this model has five clusters, we need to visualize each one.

#visulaizing the clusters  

mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first cluster  

mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second cluster  

mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster  

mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for fourth cluster  

mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for fifth cluster  

mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroid')   

mtp.title('Clusters of customers')  

mtp.xlabel('Annual Income (k$)')  

mtp.ylabel('Spending Score (1-100)')  

mtp.legend()  

mtp.show()  

Challenges With K-Means Clustering Algorithm

K-Means is a powerful tool, but it’s not without its hiccups. Here are a couple of common challenges you might face:

  • Uneven Cluster Sizes

One issue you might run into with K Means is when clusters vary in size. Picture this: you have clusters that are small and spread out on the edges, and a larger, more central cluster. When K Means is applied, it can struggle to evenly distribute the data. The algorithm might create clusters that don’t quite match the actual data distribution, leading to some clusters being too small or too large compared to others.

  • Different Densities

Another challenge comes up when the clusters have different densities. Imagine you have some clusters with tightly packed points and others where the points are more scattered. K Means might have trouble with this. It tends to group points based on distance from the cluster center, so tightly packed points might end up in one cluster, while scattered points could be split across different clusters, even if they actually belong together. This can lead to clusters that don’t accurately reflect the true structure of your data.

Demo: K-Means Clustering

Problem Statement—Walmart wants to open a chain of stores across the state of Florida and find the optimal store locations to maximize revenue.

The issue here is that if they open too many stores close to each other, they will not make a profit. But if the stores are too far apart, they will not have enough sales coverage. 

Solution—Walmart is an e-commerce giant. Its database already contains customers' addresses, which it can use to perform K-Means Clustering to find the optimal location.

Conclusion 

The Post Graduate Program in AI and Machine Learning, offered by Simplilearn in collaboration with Purdue University, is designed to equip you with advanced skills in artificial intelligence and machine learning. The course covers key topics such as deep learning, natural language processing, computer vision, and more, providing hands-on experience through real-world projects. With expert instructors and a curriculum aligned to industry demands, this program prepares learners to excel in AI-driven roles across various sectors. You will receive a Purdue certification upon completion, enhancing your career prospects in the rapidly growing AI and ML fields. Explore and enroll today!

FAQs

1. What is the K-Means clustering principle?

K-Means clustering is a method that sorts data into a set number of clusters, denoted as K. It starts by randomly placing cluster centers (centroids), then assigns each data point to the closest centroid. The algorithm updates the centroids based on the points assigned to them and repeats this until the clusters stabilize.

2. What is the difference between KNN and K-Means?

KNN (K-Nearest Neighbours) is a supervised algorithm used for classifying or predicting values based on the nearest neighbors of a data point. K-Means, on the other hand, is an unsupervised algorithm that groups data into clusters by finding similarities among data points without any pre-labeled categories.

3. What is an example of k-means in real life?

K-Means clustering can be used in real life for customer segmentation. For instance, a company might use K-Means to group customers based on their purchasing habits. This helps in tailoring marketing strategies and promotions to different customer groups, improving engagement and sales efficiency.

4. What does K stand for in the algorithm?

In K-Means clustering, "K" stands for the number of clusters you want to divide your data into. You set this number before running the algorithm, and K determines how many groups or clusters the algorithm will create by grouping similar data points together.

5. Which function is used for K-Means clustering?

For K-Means clustering, the “KMeans” function is commonly used in Python’s scikit-learn library, and “kmeans” is used in MATLAB. These functions help to organize data into clusters by repeatedly adjusting cluster centers until the best grouping is found.

About the Author

Mayank BanoulaMayank Banoula

Mayank is a Research Analyst at Simplilearn. He is proficient in Machine learning and Artificial intelligence with python.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, OPM3 and the PMI ATP seal are the registered marks of the Project Management Institute, Inc.