Programs

Random Forest Hyperparameter Tuning: Processes Explained with Coding

Random Forest is a Machine Learning algorithm which uses decision trees as its base. Random Forest is easy to use and a flexible ML algorithm. Due to its simplicity and diversity, it is used very widely. It gives good results on many classification tasks, even without much hyperparameter tuning. One of the most important features of random forest is that with the help of this algorithm, you can handle two different data sets in different cases. For example, in regression, the random forest algorithm can easily handle data sets containing continuous variables. Simultaneously, in cases of classification, it can handle data sets containing categorical variables. Compared to other algorithms, random forest usually takes much lesser training time and can predict output with a higher level of accuracy, even in situations where there is a large dataset involved. Keep reading to learn more about the random forest and hyperparameter tuning random forest classifier python.

In this article, we will majorly focus on the working of Random Forest and the different hyper parameters that can be controlled for optimal results.  This article will also shed some light on the importance of hyperparameter tuning random forest classifier python and the advantages and disadvantages of random forest.  The need for Hyperparameter tuning arises because every data has its characteristics.

Best Machine Learning Courses & AI Courses Online

These characteristics can be types of variables, size of the data, binary/multiclass target variable, number of categories in categorical variables, standard deviation of numerical data, normality in the data, etc. Hence tuning the model according to the data is imperative for maximizing the performance of a model.

Construct and Working

Random Forest Algorithm works as a large collection of decorrelated decision trees. It is also known as a bagging technique. Bagging falls in the category of ensemble learning and is based on the theory that the combination of noisy and unbiased models can be averaged out to create a model with low variance. Let us understand how a Random Forest is constructed. 

S is the matrix of data present for performing random forest classification. There are N instances present and A,B,C are the features of the data. From this data, random subsets of data are created. Over which decision trees are created. As we can see from the figure below, one decision tree is created per subset of data, and depending on the size of data, the decision trees are also increased.

In-demand Machine Learning Skills

The output of all the trained decision trees is voted and the majority voted class is the effective output of a Random Forest Algorithm. The decision tree models overfit the data hence the need for Random Forest arises. Decision tree models may be Low Bias but they are mostly high variance. Hence to reduce this variance error on the test set, Random Forest is used.

Hyperparameters

There are various hyperparameters that can be controlled in a random forest:

  1. N_estimators: The number of decision trees being built in the forest. Default values in sklearn are 100. N_estimators are mostly correlated to the size of data, to encapsulate the trends in the data, more number of DTs are needed. 
  2. Criterion: The function that is used to measure the quality of splits in a decision tree (Classification Problem). Supported criteria are gini: gini impurity or entropy: information gain. In case of Regression Mean Absolute Error (MAE) or Mean Squared Error (MSE) can be used. Default is gini and mse.
  3. Max_depth: The maximum levels allowed in a decision tree. If set to nothing, The decision tree will keep on splitting until purity is reached.
  4. Max_features: Maximum number of features used for a node split process. Types: sqrt, log2. If total features are n_features then: sqrt(n_features) or log2(n_features) can be selected as max features for node splitting.
  5. Bootstrap: Bootstrap samples are used when building decision trees if True is selected in bootstrap, else whole data is used for every decision tree.
  6. Min_samples_split: This parameter decides the minimum number of samples required to split an internal node. Default value =2. The problem with such a small value is that the condition is checked on the terminal node. If the data points in the node exceed the value 2, then further splitting takes place. Whereas if a more lenient value like 6 is set, then the splitting will stop early and the decision tree wont overfit on the data.
  7. Min_sample_leaf: This parameter sets the minimum number of data point requirements in a node of the decision tree. It affects the terminal node and basically helps in controlling the depth of the tree. If after a split the data points in a node goes under the min_sample_leaf number, the split won’t go through and will be stopped at the parent node.
  8. Max_leaf_nodes- With the help of this hyperparameter, a condition can be set on the splitting of the nodes in the tree. Thus, the growth of the tree gets automatically restricted. 

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

There are other less important parameters that can also be considered during the hyperparameter tuning process.

n_jobs: number of processors that can be used for training. (-1 for no limit)

max_samples: the maximum data that can be used in each Decision Tree

random_state: the model with a specific random_state will produce similar accuracy/ outputs.

Class_weight: dictionary input, that can handle imbalanced data sets.

Must Read: Types of AI Algorithm 

Advantages and Disadvantages Of Random Forest Classifiers

Mentioned below are some of the strengths and weaknesses of random forest classifiers.

Advantages

  • Works more efficiently for a large range of data items than a single decision tree. 
  • They are very flexible and deliver highly accurate results.
  • They have much less variance when compared to a single decision tree. 
  • Even in the face of disruptions, especially when large sets of data go missing, random forests can still maintain good accuracy. 

Disadvantages

  • Random Forests algorithms usually involve a lot of complexities. 
  • Constructing Random forests usually requires much more time and effort than decision trees.
  • They tend to be less intuitive especially when there is a large collection of decision trees involved. 
  • Usually involves many computational resources for the implementation of the Random tree algorithm.

Importance of Hyperparameter Tuning For Random Forest

Before delving into the different kinds of processes available for hyperparameter tuning in Random Forest, let’s take a look at the importance of hyperparameter tuning for random forest first. 

Hyperparameter tuning in random forest is essential for the overall performance of the machine learning model. It is usually set before the learning process and occurs outside the model. So what happens when hyperparameter tuning random forest does not occur? Well, in such cases the model starts to produce errors and inaccurate results because the loss function does not get minimized. The ultimate goal of hyperparameter tuning random forest is to find a set of optimal hyperparameter values that will result in maximization of the model’s performance, minimizing the loss and producing better output. 

Now that you have understood the basic function of a hyperparameter tuning random forest classifier, let’s take a closer look at the different processes available for a hyperparameter tuning random forest classifier.

Hyperparameter Tuning Processes

There are various ways of performing hyperparameter tuning processes. After the base model has been created and evaluated, hyperparameters can be tuned to increase some specific metrics like accuracy or f1 score of the model.

One must check the overfitting and the bias variance errors before and after the adjustments. The model should be tuned according to the real time requirement. Sometimes an overfitting model might be very sensitive to the data fluctuation in validation, hence the cross validation scores with the cross validation deviation should be checked for possible overfit before and after model tuning. 

The methods for Random Forest tuning on python are covered next.

Also Read: Machine Learning Project Ideas

Randomised Search CV

We can use scikit learn and RandomisedSearchCV where we can define the grid, the random forest model will be fitted over and over by randomly selecting parameters from the grid. We won’t get the best parameters, but we’ll definitely get the best model from the different models being fitted and tested.

Source Code:

from sklearn.model_selection import GridSearchCV

# Create a search grid of parameters that will be shuffled through

param_grid = {

‘bootstrap’: [True],

‘max_depth’: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],

‘max_features’: [‘auto’, ‘sqrt’],

‘min_samples_leaf’: [1, 2, 4],

‘min_samples_split’: [2, 5, 10],

‘n_estimators’: [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]

}

# Using the random grid and searching for best hyperparameters

rf = RandomForestRegressor() #creating base model

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs = -1)

rf_random.fit(train_features, train_labels) #fit is to initiate training process

The randomised search function will search the parameters through 5 fold cross validation and 100 iterations to end up with the best parameters.

FYI: Free nlp course!

Grid Search CV

Grid search is used after randomised search to narrow down the range to search the perfect hyperparameters. Now that we know where we can focus we can explicitly run those parameters through grid search and evaluate different models to get the final values for every hyperparameter.

Source Code:

from sklearn.model_selection import GridSearchCV

# Create the parameter grid based on the results of random search 

param_grid = {

    ‘bootstrap’: [True],

    ‘max_depth’: [80, 90, 100, 110],

    ‘max_features’: [2, 3],

    ‘min_samples_leaf’: [3, 4, 5],

    ‘min_samples_split’: [8, 10, 12],

    ‘n_estimators’: [100, 200, 300, 1000]

}

# Create a based model

rf = RandomForestRegressor()

# Instantiate the grid search model

grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 

                          cv = 3, n_jobs = -1, verbose = 2)

Results after execution:

# Fit the grid search to the data

grid_search.fit(train_features, train_labels)

grid_search.best_params_

{‘bootstrap’: True,

 ‘max_depth’: 80,

 ‘max_features’: 3,

 ‘min_samples_leaf’: 5,

 ‘min_samples_split’: 12,

 ‘n_estimators’: 100}

best_grid = grid_search.best_estimator_

Popular Machine Learning and Artificial Intelligence Blogs

Conclusion

We went through the working of a random forest model and how each hyperparameter works to alter the decision trees and hence the random forest model as a whole. We also had a look at the efficient technique to combine the use of randomised and grid search to get to the best parameters for our model. Hyperparameter tuning is very important as it helps us control bias and variance performance of our model. 

If you’re interested to learn more about the decision tree, Machine Learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Which hyperparameters can be tuned in random forest?

In random forest, the hyperparameters are the number of trees, number of features and the type of trees (such as GBM or M5). The number of features is important and should be tuned. In this case, random forest is useful because it automatically tunes the number of features. The number of trees and the type of trees are not that important, but one should never use over 500 trees because it is a waste of time. Generally speaking, the type of trees and the number of trees are tuned according to the data.

How do you optimize a Random Forest model?

To be successful, the two main components of the Random Forest algorithm (and other decision tree variants) are selection of features and the tree structure. Regarding tree structure, you will have to experiment with the number of trees and features used in each tree. Most importantly, you need to find that sweet spot where your model is both accurate enough and does not overfit.

What is Random Forest in machine learning?

Random forests are an ensemble of decision trees. They are powerful and flexible models which can be used in many different ways. In fact, random forests have become very popular over the last decade. The model is used in many different fields (biology, marketing, finance, text mining etc.). It has been used in major competitions and has produced state-of-the-art results. The most common use of random forests is to classify (or label) data. But, they can also be used to regress continuous values (estimate a value) and to cluster similar data points.

Want to share this article?

Lead the AI Driven Technological Revolution

PG DIPLOMA IN MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE
Learn More

Leave a comment

Your email address will not be published. Required fields are marked *

Our Popular Machine Learning Course

Get Free Consultation

Leave a comment

Your email address will not be published. Required fields are marked *

×
Get Free career counselling from upGrad experts!
Book a session with an industry professional today!
No Thanks
Let's do it
Get Free career counselling from upGrad experts!
Book a Session with an industry professional today!
Let's do it
No Thanks