Summary:
In this article, you will learn about data preprocessing in Machine Learning: 7 easy steps to follow.
- Acquire the dataset
- Import all the crucial libraries
- Import the dataset
- Identifying and handling the missing values
- Encoding the categorical data
- Splitting the dataset
- Feature scaling
Read more to know each in detail.
Data preprocessing in Machine Learning is a crucial step that helps enhance the quality of data to promote the extraction of meaningful insights from the data. Data preprocessing in Machine Learning refers to the technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models. In simple words, data preprocessing in Machine Learning is a data mining technique that transforms raw data into an understandable and readable format.
Data Preprocessing In Machine Learning: What Is It?
Data preprocessing steps are a part of the data analysis and mining process responsible for converting raw data into a format understandable by the ML algorithms.
Text, photos, video, and other types of unprocessed, real-world data are disorganized. It may not only be inaccurate and inconsistent, but it is frequently lacking and doesn’t have a regular, consistent design. Machines prefer to process neat and orderly information; they read data as binary – 1s and 0s.
So, it is simple to calculate structured data like whole numbers and percentages. But before analysis, unstructured data, such as text and photos, must be prepped and formatted with the help of data preprocessing in Machine Learning.
Now that you know what is data preprocessing in machine learning, explore the major tasks in data preprocessing.
Data Preprocessing Steps In Machine Learning: Major Tasks Involved
Data cleaning, Data transformation, Data reduction, and Data integration are the major steps in data preprocessing.
Data Cleaning
Data cleaning, one of the major preprocessing steps in machine learning, locates and fixes errors or discrepancies in the data. From duplicates and outliers to missing numbers, it fixes them all. Methods like transformation, removal, and imputation help ML professionals perform data cleaning seamlessly.
Data Integration
Data integration is among the major responsibilities of data preprocessing in machine learning. This process integrates (merges) information extracted from multiple sources to outline and create a single dataset. The fact that you need to handle data in multiple forms, formats, and semantics makes data integration a challenging task for many ML developers.
Data Transformation
ML programmers must pay close attention to data transformation when it comes to data preprocessing steps. This process entails putting the data in a format that will allow for analysis. Normalization, standardization, and discretisation are common data transformation procedures. While standardization transforms data to have a zero mean and unit variance, normalization scales data to a common range. Continuous data is discretized into discrete categories using this technique.
Data Reduction
Data reduction is the process of lowering the dataset’s size while maintaining crucial information. Through the use of feature selection and feature extraction algorithms, data reduction can be accomplished. While feature extraction entails translating the data into a lower-dimensional space while keeping the crucial information, feature selection requires choosing a subset of pertinent characteristics from the dataset.
Why Data Preprocessing in Machine Learning?
When it comes to creating a Machine Learning model, data preprocessing is the first step marking the initiation of the process. Typically, real-world data is incomplete, inconsistent, inaccurate (contains errors or outliers), and often lacks specific attribute values/trends. This is where data preprocessing enters the scenario – it helps to clean, format, and organize the raw data, thereby making it ready-to-go for Machine Learning models. Let’s explore various steps of data preprocessing in machine learning.
Join Artificial Intelligence Course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.
Steps in Data Preprocessing in Machine Learning
There are seven significant steps in data preprocessing in Machine Learning:
1. Acquire the dataset
Acquiring the dataset is the first step in data preprocessing in machine learning. To build and develop Machine Learning models, you must first acquire the relevant dataset. This dataset will be comprised of data gathered from multiple and disparate sources which are then combined in a proper format to form a dataset. Dataset formats differ according to use cases. For instance, a business dataset will be entirely different from a medical dataset. While a business dataset will contain relevant industry and business data, a medical dataset will include healthcare-related data.
There are several online sources from where you can download datasets like https://www.kaggle.com/uciml/datasets and https://archive.ics.uci.edu/ml/index.php. You can also create a dataset by collecting data via different Python APIs. Once the dataset is ready, you must put it in CSV, or HTML, or XLSX file formats.
2. Import all the crucial libraries
Since Python is the most extensively used and also the most preferred library by Data Scientists around the world, we’ll show you how to import Python libraries for data preprocessing in Machine Learning. Read more about Python libraries for Data Science here. The predefined Python libraries can perform specific data preprocessing jobs. Importing all the crucial libraries is the second step in data preprocessing in machine learning. The three core Python libraries used for this data preprocessing in Machine Learning are:
- NumPy – NumPy is the fundamental package for scientific calculation in Python. Hence, it is used for inserting any type of mathematical operation in the code. Using NumPy, you can also add large multidimensional arrays and matrices in your code.
- Pandas – Pandas is an excellent open-source Python library for data manipulation and analysis. It is extensively used for importing and managing the datasets. It packs in high-performance, easy-to-use data structures and data analysis tools for Python.
- Matplotlib – Matplotlib is a Python 2D plotting library that is used to plot any type of charts in Python. It can deliver publication-quality figures in numerous hard copy formats and interactive environments across platforms (IPython shells, Jupyter notebook, web application servers, etc.).
Read: Machine Learning Project Ideas for Beginners
3. Import the dataset
In this step, you need to import the dataset/s that you have gathered for the ML project at hand. Importing the dataset is one of the important steps in data preprocessing in machine learning. However, before you can import the dataset/s, you must set the current directory as the working directory. You can set the working directory in Spyder IDE in three simple steps:
- Save your Python file in the directory containing the dataset.
- Go to File Explorer option in Spyder IDE and choose the required directory.
- Now, click on the F5 button or Run option to execute the file.
This is how the working directory should look.
Once you’ve set the working directory containing the relevant dataset, you can import the dataset using the “read_csv()” function of the Pandas library. This function can read a CSV file (either locally or through a URL) and also perform various operations on it. The read_csv() is written as:
data_set= pd.read_csv(‘Dataset.csv’)
In this line of code, “data_set” denotes the name of the variable wherein you stored the dataset. The function contains the name of the dataset as well. Once you execute this code, the dataset will be successfully imported.
During the dataset importing process, there’s another essential thing you must do – extracting dependent and independent variables. For every Machine Learning model, it is necessary to separate the independent variables (matrix of features) and dependent variables in a dataset.
Consider this dataset:
This dataset contains three independent variables – country, age, and salary, and one dependent variable – purchased.
Check out upGrad’s free courses on AI.
How to extract the independent variables?
To extract the independent variables, you can use “iloc[ ]” function of the Pandas library. This function can extract selected rows and columns from the dataset.
x= data_set.iloc[:,:-1].values
In the line of code above, the first colon(:) considers all the rows and the second colon(:) considers all the columns. The code contains “:-1” since you have to leave out the last column containing the dependent variable. By executing this code, you will obtain the matrix of features, like this –
[[‘India’ 38.0 68000.0]
[‘France’ 43.0 45000.0]
[‘Germany’ 30.0 54000.0]
[‘France’ 48.0 65000.0]
[‘Germany’ 40.0 nan]
[‘India’ 35.0 58000.0]
[‘Germany’ nan 53000.0]
[‘France’ 49.0 79000.0]
[‘India’ 50.0 88000.0]
[‘France’ 37.0 77000.0]]
Must Read: Free deep learning course!
How to extract the dependent variable?
You can use the “iloc[ ]” function to extract the dependent variable as well. Here’s how you write it:
y= data_set.iloc[:,3].values
This line of code considers all the rows with the last column only. By executing the above code, you will get the array of dependent variables, like so –
array([‘No’, ‘Yes’, ‘No’, ‘No’, ‘Yes’, ‘Yes’, ‘No’, ‘Yes’, ‘No’, ‘Yes’],
dtype=object)
Best Machine Learning and AI Courses Online
4. Identifying and handling the missing values
In data preprocessing, it is pivotal to identify and correctly handle the missing values, failing to do this, you might draw inaccurate and faulty conclusions and inferences from the data. Needless to say, this will hamper your ML project.
Basically, there are two ways to handle missing data:
- Deleting a particular row – In this method, you remove a specific row that has a null value for a feature or a particular column where more than 75% of the values are missing. However, this method is not 100% efficient, and it is recommended that you use it only when the dataset has adequate samples. You must ensure that after deleting the data, there remains no addition of bias.
- Calculating the mean – This method is useful for features having numeric data like age, salary, year, etc. Here, you can calculate the mean, median, or mode of a particular feature or column or row that contains a missing value and replace the result for the missing value. This method can add variance to the dataset, and any loss of data can be efficiently negated. Hence, it yields better results compared to the first method (omission of rows/columns). Another way of approximation is through the deviation of neighbouring values. However, this works best for linear data.
Read: Applications of Machine Learning Applications Using Cloud
5. Encoding the categorical data
Categorical data refers to the information that has specific categories within the dataset. In the dataset cited above, there are two categorical variables – country and purchased.
Machine Learning models are primarily based on mathematical equations. Thus, you can intuitively understand that keeping the categorical data in the equation will cause certain issues since you would only need numbers in the equations.
How to encode the country variable?
As seen in our dataset example, the country column will cause problems, so you must convert it into numerical values. To do so, you can use the LabelEncoder() class from the sci-kit learn library. The code will be as follows –
#Catgorical data
#for Country Variable
from sklearn.preprocessing import LabelEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
And the output will be –
Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)
Here we can see that the LabelEncoder class has successfully encoded the variables into digits. However, there are country variables that are encoded as 0, 1, and 2 in the output shown above. So, the ML model may assume that there is come some correlation between the three variables, thereby producing faulty output. To eliminate this issue, we will now use Dummy Encoding.
Dummy variables are those that take the values 0 or 1 to indicate the absence or presence of a specific categorical effect that can shift the outcome. In this case, the value 1 indicates the presence of that variable in a particular column while the other variables become of value 0. In dummy encoding, the number of columns equals the number of categories.
Since our dataset has three categories, it will produce three columns having the values 0 and 1. For Dummy Encoding, we will use OneHotEncoder class of the scikit-learn library. The input code will be as follows –
#for Country Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
#Encoding for dummy variables
onehot_encoder= OneHotEncoder(categorical_features= [0])
x= onehot_encoder.fit_transform(x).toarray()
On execution of this code, you will get the following output –
array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,
5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])
In the output shown above, all the variables are divided into three columns and encoded into the values 0 and 1.
How to encode the purchased variable?
For the second categorical variable, that is, purchased, you can use the “labelencoder” object of the LableEncoder class. We are not using the OneHotEncoder class since the purchased variable only has two categories yes or no, both of which are encoded into 0 and 1.
The input code for this variable will be –
labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)
The output will be –
Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
In-demand Machine Learning Skills
6. Handling Outliers in Data Preprocessing
Outliers are data points that significantly deviate from the rest of the dataset. These anomalies can skew the results of machine learning models, leading to inaccurate predictions. In the context of data preprocessing, identifying and handling outliers is crucial. Outliers can arise due to measurement errors, data corruption, or genuinely unusual observations.
Detecting outliers often involves using statistical methods such as the Z-score, which measures how many standard deviations a data point is away from the mean. Another method is the Interquartile Range (IQR), which identifies data points outside a certain range around the median.
Once outliers are detected, there are several ways to handle them:
- Removal
Outliers can be removed from the dataset if erroneous or irrelevant. However, this should be done cautiously, as removing outliers can impact the representativeness of the data.
- Transformation
Transforming the data using techniques like log transformation or winsorization can reduce the impact of outliers without completely discarding them.
- Imputation
Outliers can be replaced with more typical values through mean, median, or regression-based imputation methods.
- Binning or Discretization
Binning involves dividing the range of values into a set of intervals or bins and then assigning the outlier values to the nearest bin. This technique can help mitigate the effect of extreme values by grouping them with nearby values.
7. Splitting the dataset
Splitting the dataset is the next step in data preprocessing in machine learning. Every dataset for Machine Learning model must be split into two separate sets – training set and test set.
Training set denotes the subset of a dataset that is used for training the machine learning model. Here, you are already aware of the output. A test set, on the other hand, is the subset of the dataset that is used for testing the machine learning model. The ML model uses the test set to predict outcomes.
Usually, the dataset is split into 70:30 ratio or 80:20 ratio. This means that you either take 70% or 80% of the data for training the model while leaving out the rest 30% or 20%. The splitting process varies according to the shape and size of the dataset in question.
To split the dataset, you have to write the following line of code –
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
Here, the first line splits the arrays of the dataset into random train and test subsets. The second line of code includes four variables:
- x_train – features for the training data
- x_test – features for the test data
- y_train – dependent variables for training data
- y_test – independent variable for testing data
Thus, the train_test_split() function includes four parameters, the first two of which are for arrays of data. The test_size function specifies the size of the test set. The test_size maybe .5, .3, or .2 – this specifies the dividing ratio between the training and test sets. The last parameter, “random_state” sets seed for a random generator so that the output is always the same.
8. Dealing with Imbalanced Datasets in Machine Learning
In many real-world scenarios, datasets are imbalanced, meaning that one class has significantly fewer examples than another. Imbalanced datasets can lead to biased models that perform well on the majority class but struggle with the minority class.
Dealing with imbalanced datasets involves various strategies:
- Resampling
Oversampling the minority class (creating duplicates) or undersampling the majority class (removing instances) can balance the class distribution. However, these methods come with potential risks like overfitting (oversampling) or loss of information (undersampling).
- Synthetic Data Generation
Some of the ways like Synthetic Minority Over-sampling Technique generate synthetic samples by interpolating between existing instances of the outvoted class.
- Cost-Sensitive Learning
It is all about allocating varied misclassification costs to various classes during model training that can uplift the complete model to center on correctly classifying the minority class.
- Ensemble Methods
Ensemble techniques like Random Forest or Gradient Boosting can handle imbalanced data by combining multiple models to perform better on both classes.
9. Feature scaling
Feature scaling marks the end of the data preprocessing in Machine Learning. It is a method to standardize the independent variables of a dataset within a specific range. In other words, feature scaling limits the range of variables so that you can compare them on common grounds.
Consider this dataset for example –
In the dataset, you can notice that the age and salary columns do not have the same scale. In such a scenario, if you compute any two values from the age and salary columns, the salary values will dominate the age values and deliver incorrect results. Thus, you must remove this issue by performing feature scaling for Machine Learning.
Most ML models are based on Euclidean Distance, which is represented as:
You can perform feature scaling in Machine Learning in two ways:
Standardization
Normalization
For our dataset, we will use the standardization method. To do so, we will import StandardScaler class of the sci-kit-learn library using the following line of code:
from sklearn.preprocessing import StandardScaler
The next step will be to create the object of StandardScaler class for independent variables. After this, you can fit and transform the training dataset using the following code:
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
For the test dataset, you can directly apply transform() function (you need not use the fit_transform() function because it is already done in training set). The code will be as follows –
x_test= st_x.transform(x_test)
The output for the test dataset will show the scaled values for x_train and x_test as:
All the variables in the output are scaled between the values -1 and 1.
Now, to combine all the steps we’ve performed so far, you get:
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv(‘Dataset.csv’)
#Extracting Independent Variable
x= data_set.iloc[:, :-1].values
#Extracting Dependent variable
y= data_set.iloc[:, 3].values
#handling missing data(Replacing missing data with the mean value)
from sklearn.preprocessing import Imputer
imputer= Imputer(missing_values =’NaN’, strategy=’mean’, axis = 0)
#Fitting imputer object to the independent varibles x.
imputerimputer= imputer.fit(x[:, 1:3])
#Replacing missing data with the calculated mean value
x[:, 1:3]= imputer.transform(x[:, 1:3])
#for Country Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
#Encoding for dummy variables
onehot_encoder= OneHotEncoder(categorical_features= [0])
x= onehot_encoder.fit_transform(x).toarray()
#encoding for purchased variable
labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
#Feature Scaling of datasets
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
10. Feature Engineering for Improved Model Performance
Feature engineering involves creating new features from existing ones to improve the performance of machine learning models. It aims to enhance the predictive power of models by providing them with more relevant and informative input variables.
Common techniques in feature engineering include:
- Feature Scaling: Scaling features to a similar range can improve the convergence and performance of algorithms sensitive to input variables’ scale.
- Feature Extraction: Techniques like Principal Component Analysis (PCA) can reduce the dimensionality of datasets while retaining most of the original information.
- One-Hot Encoding: Converting categorical variables into binary indicators (0s and 1s) to ensure compatibility with algorithms that require numerical input.
- Polynomial Features: Generating higher-degree polynomial features can capture non-linear relationships between variables.
- Domain-Specific Features: Incorporating domain knowledge to create more relevant features to the problem at hand.
Effective feature engineering requires a deep understanding of the dataset and the problem domain and iterative experimentation to identify which engineered features lead to improved model performance.
Best Practices For Data Preprocessing In Machine Learning
An overview of the best data preprocessing practices are outlined here:
- Knowing your data is among the initial steps in data preprocessing.
- You can get a sense of what needs to be your main emphasis by simply glancing through your dataset.
- Run a data quality assessment to determine the number of duplicates, the proportion of missing values, and outliers in the data.
- Utilise statistical techniques or ready-made tools to assist you in visualising the dataset and provide a clear representation of how your data appears with reference to class distribution.
- Eliminate any fields you believe will not be used in the modelling or closely related to other attributes.
- Dimensionality reduction is a crucial component of data preprocessing. Remove the fields that don’t make intuitive sense. Reduce the dimension by using dimension reduction and feature selection techniques.
- Do some feature engineering to determine which characteristics affect model training most.
So, that’s data processing in Machine Learning in a nutshell!
Popular AI and ML Blogs & Free Courses
You can check IIT Delhi’s Executive PG Programme in Machine Learning & AI in association with upGrad. IIT Delhi is one of the most prestigious institutions in India. With more the 500+ In-house faculty members which are the best in the subject matters.
What is the importance of data preprocessing?
Because errors, redundancies, missing values, and inconsistencies all jeopardize the dataset's integrity, you must address all of them for a more accurate result. Assume you're using a defective dataset to train a Machine Learning system to deal with your clients' purchases. The system is likely to generate biases and deviations, resulting in a bad user experience. As a result, before you use that data for your intended purpose, it must be as organized and 'clean' as feasible. Depending on the type of difficulty you're dealing with, there are numerous options.
What is data cleaning?
There will almost certainly be missing and noisy data in your data sets. Because the data collection procedure isn't ideal, you'll have a lot of useless and missing information. Data cleaning is the way you should employ to deal with this problem. This can be divided into two categories. The first one discusses how to deal with missing data. You can choose to ignore the missing values in this section of the data collection (called a tuple). The second data cleaning method is for data that is noisy. It's critical to get rid of useless data that can't be read by the systems if you want the entire process to run smoothly.
What do you mean by data transformation and reduction?
Data preprocessing moves on to the transformation stage after dealing with the concerns. You use it to convert data into relevant conformations for analysis. Normalization, attribute selection, discretization, and Concept Hierarchy Generation are some of the approaches that can be used to accomplish this. Even for automated methods, sifting through large datasets can take a long time. That is why the data reduction stage is so crucial: it reduces the size of data sets by limiting them to the most important information, increasing storage efficiency while lowering the financial and time expenses of working with them.
Refer to your Network!
If you know someone, who would benefit from our specially curated programs? Kindly fill in this form to register their interest. We would assist them to upskill with the right program, and get them a highest possible pre-applied fee-waiver up to ₹70,000/-
You earn referral incentives worth up to ₹80,000 for each friend that signs up for a paid programme! Read more about our referral incentives here.