Whenever we talk about data analysis, there comes the outlier. They might indicate certain errors or some novelty. The process of data mining and analysis involves the analysis of data and predicting the information that the data holds. Sometimes the certain object of a dataset deviates from the others. These deviated objects are termed outliers. They are mostly generated due to certain errors in measurement or execution. In 1969, Grubbs gave the first definition of an outlier.
Errors such as computational errors or incorrect entry of an object cause outliers. The differences of outliers to that of noise are:
- Whenever some random error occurs in some measured variable or there is variance in the measured variable, then it is termed as noise.
- Before detecting the outliers present in a dataset, it is advisable to remove the noise.
Outlier Types
Broadly the outliers may be classified into the univariate outliers and the multivariate outliers.
- When single-dimensional space is considered, the outliers that occur in the feature space are known as univariate outliers.
- The outliers that occur in a feature space of n-dimensions are known as multivariate outliers. It is a difficult task of observing the n-dimensional distribution by a human brain. Therefore, a model is trained to observe such a distribution pattern.
Based on type of outliers they may be classified into:
- Point outliers: They are the single points of data that are located at a point far away from the distribution of data.
- Contextual outliers: As the name suggests, these outliers occur within a context such as the signal of background noise in speech recognition. These types of outliers occur if there is an anomaly in the data instance of a context or any specific condition. There are two types of attributes of the objects of data: contextual attributes, and behavioural attributes. The context is defined by the former type whereas the latter type defines the object’s characteristics.
- Collective outliers: These types of outliers occur if there is anomalous behaviour of data points collectively. It can identify certain novelty in the data.
Our learners also read: Free Online Python Course for Beginners
Check out our data science courses to upskill yourself.
Analysis of Outliers
Outliers are mostly discarded when methods of data mining are applied. But, it’s still used in certain applications like fraud detection. This is mainly because the events that rarely occur can store much more interesting facts than the events that occur more regularly.
Other applications where outlier detection plays a major role are:
- Detection of frauds in the insurance sector, credit cards, and the healthcare sector.
- Fraud detection in telecom.
- In cybersecurity for detecting any form of intrusion.
- In the field of medical analysis.
- Detection of faults in the safety-critical systems.
- In marketing, outlier analysis helps in identifying the customer’s nature of spending.
- Any sort of unusual responses that occurs due to certain medical treatments can be analyzed through outlier analysis in data mining.
The process where the anomalous behavior of the outliers is identified in a dataset is known as outlier analysis. Also, known as “outlier mining”, the process is defined to be an important task of data mining.
Outlier Detection Techniques
Various techniques combined with different approaches are applied for detecting any anomalous behaviour in a dataset. A few techniques used for outlier detection are:
1. Sorting
- It is one of the easiest ways for detecting outliers in data mining.
- The method involves sorting the data according to their magnitude in any of the tools used for data manipulation.
- Observation of data could then lead towards identifying any objects that have a value of quite a higher range or a lower range.
- These objects could be treated as outliers.
2. Data graphing for detecting outliers
- The technique involves the use of a graph to plot all the data points. This will allow the observer to visualize which data points are actually diverging from the other objects in the dataset.
- The outliers are observed in an easier way.
- The types of plots that can be used for detecting outliers in data mining include histogram, scatter plot and box plot.
- Bulk observation of data points on one side compared to the data points on another side represents the outliers in a histogram.
- If we consider two numerical values, then their degree of association is understood well through a scatter plot. If there is an observation that is far away from their association degree, then it represents the outlier.
Top Data Science Skills to Learn
Top Data Science Skills to Learn
1
Data Analysis Course
Inferential Statistics Courses
2
Hypothesis Testing Programs
Logistic Regression Courses
3
Linear Regression Courses
Linear Algebra for Analysis
3. Z-Score for detecting outliers
- The Z-Score is used to identify how much the data points are deviating from the mean of the sample through calculating the standard deviations of the points. A Gaussian distribution is assumed in this case.
- In cases where Gaussian distribution is not applied to describe the data, transformations are applied like scaling it.
- Libraries of Python like Scikit-Learn and Scipy have in-built functions for the easy implementation of the transformations. It further uses the libraries Numpy and Panda.
- If a value of Z-score is 2, then it indicates that the object is lying above the mean with a standard deviation of two, while a value of -2 indicates that the observation is deviating from below the mean with a standard deviation of two.
- For any point of the set of data, the following expression is used for the calculation of the Z-score. A standard threshold is defined for the Z-score. It is unusual for the value to be far away from the value of zero. A value of the Z-score that is far away from zero with a value +/-3 is usually used to identify the outliers.
- If a parametric distribution is considered in a feature space of low dimensions, then the Z-score serves to be a powerful method for removing the outliers from a dataset. Methods like Isolation Forests and Dbscan can be used where the distribution is non-parametric.
4. Dbscan
- The method is a clustering approach and is referred to as the Density-Based Spatial Clustering of Applications with Noise.
- Clustering methods seem to be useful for better visualization and the understanding of data.
- Dbscan can be used to graphically represent the relationships existing between the features and the trends in the data set.
- The density-based algorithm of clustering identifies the neighboring objects by density in a sphere of ‘n-dimensional’ having a radius ‘ɛ.’ The cluster identified in a feature space through this method is a set of points connected through ‘density’.
- The classes of data points as defined by Dbscan are; Core point, Border point, and Outlier.
- A core point in a neighbourhood is defined as a point that at least has the same number of points or has points much more than ‘MinPts’.
- A border point in a neighborhood is defined as a point that lies within a cluster and has no points more than ‘MinPts’. However, the point can still be ‘density reached’ by the other points present in the cluster.
- An outlier is a point that is not present in any cluster and is not ‘density connected’ by other points.
- Two properties are to be satisfied when a cluster is defined: the points should be density connected mutually, and a point that is density reachable by any other points of a cluster, then the point will be the part of the cluster.
Read our popular Data Science Articles
5. Isolation Forests
- For detecting a kind of novelties or outliers, this type of method is the most effective.
- The method is based on the application of binary trees.
- The basic principle followed by the method of random forest is that the points which are the outliers are few in number and deviates far from the other observations in the data.
- The algorithm of the method picks up any feature and does a random splitting of the value that lies between the minimum and the maximum range of values. A forest is then built up likewise for all the other observations in the set.
- Predictions are made by comparing the value of splitting.
- The instance ‘path length’ is defined as the ‘splittings’ generated by the algorithm.
- An outlier is defined to have a shorter path length compared to other observations in the dataset. The approaches for outlier analysis in data mining can also be grouped into statistical methods, a supervised method for outlier detection, and the unsupervised method for outlier detection.
- Statistical methods include the techniques of graphing data, Z-score, etc. for identification of the outliers. When a single outlier is to be detected, it is recommended to use the Grubbs test.
- Supervised methods involve the use of a training set of data that has instances to identify the classes within the data including the outliers.
- In an unsupervised method, there are no labelled instances; however, prediction is made based on the assumption that the dataset has a majority of normal instances.
upGrad’s Exclusive Data Science Webinar for you –
Watch our Webinar on How to Build Digital & Data Mindset?
Explore our Popular Data Science Courses
Conclusion
In data analysis, the majority of the people mostly tend to remove the outliers for predicting wrong information. However, there are scenarios where outlier detection plays an important role like in the detection of fraud. In either way, detecting outliers seem to be important for data mining. Several methods are used for detecting these anomalies in the data as mentioned in the article.
If you are interested in mastering your knowledge over other aspects of data mining; you can check the course “Executive PG Programme in Data Science” offered by upGrad. Applicable for all entry-level professionals within 21-45 years of age, the online course offers training by top leaders of the industries. The twelve months-long courses will focus on over 60+ industry projects and one-on-one interaction with the industry partners, thereby building a great future ahead of you. If you have any queries, contact our team of assistance for any help.
Outliers are the data values causing some extreme deviations in a data set. These deviations are different from errors. Significant outliers can cause misleading outcomes in the data analysis. They are generally caused by faulty computation or errors in entering values in the dataset.
The following are the three key steps to detect all outliers in data mining:
The univariate method is the simplest method to handle an outlier. It does not overview any relationship since it is a single variate and its main purpose is to analyze the data and determine the pattern associated with it. Mean, median, and mode are examples of patterns found in the univariate data.Why are outliers caused and what are the ways to handle them?
There are 3 simple methods to handle outliers- univariate method, multivariate method, and Minkowski error. These techniques are used according to the underlying cause of the outliers. Sometimes more than one technique is required to be combined together to deal with the outliers.
Outliers often prove to impact the data analysis results positively as they can provide significant effects on statistical results. So, it totally depends on the type of outlier that whether it should be removed or not. What are the key steps to detect all outliers?
1. The first step is to choose the right model and distribution for each time series. This is important because a time series can be stationary, non-stationary, discrete, etc and the models for each of these types are different.
2. The next step is to identify the seasonal and trend pattern. If it is not found, it is nearly impossible to identify the contextual and collective outliers. For an automated anomaly detection system, the pattern must be identified automatically and not manually.
3. Determine the reaction between different time series and patterns and identify the anomalies accordingly. Differentiate between univariate and multivariate methods?
On the other hand, the multivariate method is for analyzing three or more variables. It is more precise than the earlier method since, unlike the univariate method, the multivariate method deals with relationships and patterns. Additive Tree, Canonical Correlation Analysis, and Cluster Analysis are some of the ways to perform multivariate analysis.