Machine Learning is one of the hottest technologies in 2020, as the data is increasing day by day the need of Machine Learning is also increasing exponentially. Machine Learning is a very vast topic that has different algorithms and use cases in each domain and Industry. One of which is Unsupervised Learning in which we can see the use of Clustering.
Unsupervised learning is a technique in which the machine learns from unlabeled data. As we do not know the labels there is no right answer given for the machine to learn from it, but the machine itself finds some patterns out of the given data to come up with the answers to the business problem.
Clustering is a Machine Learning Unsupervised Learning technique that involves the grouping of given unlabeled data. In each cleaned data set, by using Clustering Algorithm we can cluster the given data points into each group. The clustering Algorithm assumes that the data points that are in the same cluster should have similar properties, while data points in different clusters should have highly dissimilar properties.
In this article, we are going to learn the need of clustering, different types of clustering along with their pros and cons.
What is the need of Clustering?
Clustering is a widely used ML Algorithm which allows us to find hidden relationships between the data points in our dataset.
1) Customers are segmented according to similarities of the previous customers and can be used for recommendations.
2) Based on a collection of text data, we can organize the data according to the content similarities in order to create a topic hierarchy.
3) Image processing mainly in biology research for identifying the underlying patterns.
4) Spam filtering.
5) Identifying Fraudulent and Criminal activities.
6) It can also be used for fantasy football and sports.
Types of Clustering
There are many types of Clustering Algorithms in Machine learning. We are going to discuss the below three algorithms in this article:
1) K-Means Clustering.
2) Mean-Shift Clustering.
1. K-Means Clustering
K-Means is the most popular clustering algorithm among the other clustering algorithms in Machine Learning. We can see this algorithm used in many top industries or even in a lot of introduction courses. It is one of the easiest models to start with both in implementation and understanding.
Step-1 We first select a random number of k to use and randomly initialize their respective center points.
Step-2 Each data point is then classified by calculating the distance (Euclidean or Manhattan) between that point and each group center, and then clustering the data point to be in the cluster whose center is closest to it.
Step-3 We recompute the group center by taking the mean of all the vectors in the group.
Step-4 We repeat all these steps for a n number of iterations or until the group centers don’t change much.
1) Very Fast.
2) Very few computations
3) Linear Complexity O(n).
1) Selecting the k value.
2) Different clustering centers in different runs.
3) Lack of Consistency.
2. Mean-Shift Clustering
Mean shift clustering is a sliding-window-based algorithm that tries to identify the dense areas of the data points. Being a centroid-based algorithm, meaning that the goal is to locate the center points of each class which in turn works on by updating candidates for center points to be the mean of the points in the sliding-window.
These selected candidate windows are then filtered in a post-processing stage in order to eliminate duplicates which will help in forming the final set of centers and their corresponding classes.
Step-1 We begin with a circular sliding window centered at a point C (randomly selected) and having radius r as the kernel. Mean shift is a hill-climbing type of algorithm that involves shifting this kernel iteratively to a higher density region on each step until we reach convergence.
Step-2 After each iteration the sliding window is shifted towards regions of the higher density by shifting the center point to the mean of the points within the window. The density within the sliding window is increases with the increase to the number of points inside it. Shifting the mean of the points in the window will gradually move towards areas of higher point density.
Step 3 In this step we continue to shift the sliding window based on the mean value until there is no direction at which a shift can get more points inside the selected kernel.
Step-4 The Steps 1-2 are done with many sliding windows until all points lie within a window. When multiple sliding windows tend to overlap the window containing the most points is selected. The data points are now clustered according to the sliding window in which they reside.
1) No need to select the number of clusters.
2) Fits well in a naturally data-driven sense
1) The only drawback is the selection of the window size(r) can be non-trivial.
3. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
DBSCAN is like Mean-Shift clustering which is also a density-based algorithm with a few changes.
Step-1 It begins with an arbitrary starting point, the neighborhood of this point is extracted using a distance called an epsilon.
Step-2 The clustering will start if there are enough points and the data point becomes the first new point in a cluster. If there is no sufficient data, the point will be labelled as noise and point will be marked visited.
Step-3 The points within the epsilon tend to become the part of the cluster. This procedure is repeated to all points inside the cluster.
Step-4 The steps 2&3 are repeated until the points in the cluster are visited and labelled.
Step-5 On completing the current cluster, a new unvisited point is processed into a new cluster leading to classifying it into a cluster or as a noise.
1) No need to set the number of clusters.
2) Defines outliers as noise.
3) Helps to find the arbitrarily sized and arbitrarily shaped clusters quite well.
1) Does not perform well on varying density clusters.
2) Does not perform well with high dimensional data.
Also Read: Machine Learning Project Ideas
In this article, we got to know about the need for clustering in the current market, different types of clustering algorithms along with their pros and cons. Clustering is really a very interesting topic in Machine Learning and there are so many other types of clustering algorithms worth learning.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.
What is meant by gaussian mixture clustering?
Gaussian mixture models are usually used in the case of query data to perform either hard or soft clustering. The Gaussian mixture models make a few assumptions in order to perform the clustering well. Based on the assumptions, the model groups the data points that belong to a single distribution together. These are probabilistic models, and they use a soft clustering approach to carry out the clustering process efficiently.
What is the silhouette coefficient in clustering?
In order to measure how well the clustering has been carried out, we use the silhouette coefficient. Basically, the average distance between two clusters is measured, and then the silhouette width is calculated using a formula. This way, we can easily measure the optimal number of clusters present in the given data and thus find out the efficiency of the clustering done.
What is meant by fuzzy clustering in machine learning?
When the given data comes under more than one cluster or group, a fuzzy clustering method is used, which works on a fuzzy C-mean algorithm or fuzzy K-mean algorithm. It is a soft clustering method. According to the distance between the cluster center and the image point, the method assigns membership values to each image point associated with each cluster center.