Every Machine Learning engineer wants to achieve accurate predictions with their algorithms. Such learning algorithms are generally broken down into two types — supervised and unsupervised. K-means clustering is one of the unsupervised algorithms where the available input data does not have a labeled response.
What is k-means clustering?
K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster. The K-Means clustering algorithm is an iterative process where you are trying to minimize the distance of the data point from the average data point in the cluster.
The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.
For a better understanding of k-means, let’s take an example from cricket. Imagine you received data on a lot of cricket players from all over the world, which gives information on the runs scored by the player and the wickets taken by them in the last ten matches. Based on this information, we need to group the data into two clusters, namely batsmen and bowlers.
What is unsupervised learning?
Unsupervised learning is where you train a machine learning algorithm, but you don’t give it the answer to the problem. In Unsupervised Learning, the machine uses unlabeled data and learns on itself without any supervision. The machine tries to find a pattern in the unlabeled data and gives a response.
Distance measure determines the similarity between two elements and influences the shape of clusters.
K-Means clustering supports various kinds of distance measures, such as:
- Euclidean distance measure
- Manhattan distance measure
- A squared Euclidean distance measure
- Cosine distance measure
Euclidean Distance Measure
The most common case is determining the distance between two points. If we have a point P and point Q, the Euclidean distance is an ordinary straight line. It is the distance between the two points in Euclidean space.
The formula for distance between two points is shown below:
- Get each characteristic from your dataset;
- Subtract each one, example, (line 1, column 5) — (line1,column5) = X …(line 1, column 13) — (line1,column13) = Z;
- After get the subtract of all columns, you will get all the results and sum it X+Y +Z… ;
- So you will get the sum’s square root ;
Squared Euclidean Distance Measure
This is identical to the Euclidean distance measurement but does not take the square root at the end.
Manhattan Distance Measure
The Manhattan distance is the simple sum of the horizontal and vertical components or the distance between two points measured along axes at right angles.
Cosine Distance Measure
In this case, we take the angle between the two vectors formed by joining the origin point.
How Does K-Means Clustering Work?
The flowchart below shows how k-means clustering works:
The goal of the K-Means algorithm is to find clusters in the given input data. There are a couple of ways to accomplish this. We can use the trial and error method by specifying the value of K (e.g., 3,4, 5). As we progress, we keep changing the value until we get the best clusters.
Another method is to use the Elbow technique to determine the value of K. Once we get the K’s value, the system will assign that many centroids randomly and measure the distance of each of the data points from these centroids. Accordingly, it assigns those points to the corresponding centroid from which the distance is minimum. So each data point will be assigned to the centroid, which is closest to it. Thereby we have a K number of initial clusters.
For the newly formed clusters, it calculates the new centroid position. The position of the centroid moves compared to the randomly allocated one.
Once again, the distance of each point is measured from this new centroid point. If required,
the data points are relocated to the new centroids, and the mean position or the new centroid is calculated once again.
If the centroid moves, the iteration continues indicating no convergence. But once the centroid stops moving (which means that the clustering process has converged), it will reflect the result.
Let’s use a visualization example to understand this better.
We have a data set for a grocery shop, and we want to find out how many clusters this has to be spread across. To find the optimum number of clusters, we break it down into the following steps:
The Elbow method is the best way to find the number of clusters. The elbow method constitutes running K-Means clustering on the dataset.
Next, we use within-sum-of-squares as a measure to find the optimum number of clusters that can be formed for a given data set. Within the sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroid.
The WSS is measured for each value of K. The value of K, which has the least amount of WSS, is taken as the optimum value.
Now, we draw a curve between WSS and the number of clusters.
Here, WSS is on the y-axis and the number of clusters on the x-axis.
You can see that there is a very gradual change in the value of WSS as the K value increases from 2.
So, you can take the elbow point value as the optimal value of K. It should be either two, three, or at most four. But, beyond that, increasing the number of clusters does not dramatically change the value in WSS, it gets stabilized.
Let’s assume that these are our delivery points:
We can randomly initialize two points called the cluster centroids.
Here, C1 and C2 are the centroids assigned randomly.
Now the distance of each location from the centroid is measured, and each data point is assigned to the centroid, which is closest to it.
This is how the initial grouping is done:
Compute the actual centroid of data points for the first group.
Reposition the random centroid to the actual centroid.
Compute the actual centroid of data points for the second group.
Reposition the random centroid to the actual centroid.
Once the cluster becomes static, the k-means algorithm is said to be converged.
The final cluster with centroids c1 and c2 is as shown below:
1. Spam filter
Do you know the junk folder in your email inbox? It is the place where emails have been identified as spam by the algorithm.
Many machine learning courses, such as Andrew Ng’s famed Coursera course, use the spam filter as an example of unsupervised learning and clustering.
What the problem is: Spam emails are at best an annoying part of modern-day marketing techniques, and at worst, an example of people phishing for your personal data. To avoid getting these emails in your main inbox, email companies use algorithms. The purpose of these algorithms is to flag an email as spam correctly or not.
How clustering works: K-Means clustering techniques have proven to be an effective way of identifying spam. The way that it works is by looking at the different sections of the email (header, sender, and content). The data is then grouped.
These groups can then be classified to identify which are spam. Including clustering in the classification process improves the accuracy of the filter to 97%. This is excellent news for people who want to be sure they’re not missing out on your favorite newsletters and offers.
2. Classifying network traffic
Imagine you want to understand the different types of traffic coming to your website. You are particularly interested in understanding which traffic is spam or coming from bots.
What the problem is: As more and more services begin to use APIs on your application, or as your website grows, it is important you know where the traffic is coming from. For example, you want to be able to block harmful traffic and double down on areas driving growth. However, it is hard to know which is which when it comes to classifying the traffic.
How clustering works: K-means clustering is used to group together characteristics of the traffic sources. When the clusters are created, you can then classify the traffic types. The process is faster and more accurate than the previous Autoclass method. By having precise information on traffic sources, you are able to grow your site and plan capacity effectively.