Suppose you’ve built a machine learning program and used the random forest model for training it. However, the output/result of the program is not as accurate as you want it to be. So what do you do?
There are three methods for improving a machine learning model to improve the output of a machine learning program:
- Improve the input data quality and feature engineering
- Hyperparameter tuning of the algorithm
- Using different algorithms
But what if you have already used all the data sources available? The next logical step is hyperparameter tuning. Thus, if you have created a machine learning program with a random forest model, used the best data source, and want to improve the output of the program further, you should opt for random forest hyperparameter tuning.
Before we delve into random forest hyperparameter tuning, let’s first have a look at hyperparameters and hyperparameter tuning in general.
What are Hyperparameters?
In the context of machine learning, hyperparameters are parameters whose value is used to control the learning process of the model. They are external to the model, and their values cannot be estimated from data.
For random forest hyperparameter tuning, hyperparameters include the number of decision trees and the number of features considered by each tree during node splitting.
What is Hyperparameter Tuning?
Hyperparameter tuning is the process of searching for an ideal set of hyperparameters for a machine learning problem.
Now that we have seen what hyperparameters and hyperparameter tuning is, let us have a look at hyperparameters in a random forest and random forest hyperparameter tuning.
What is Random Forest Hyperparameter Tuning?
To understand what random forest hyperparameters tuning is, we will have a look at five hyperparameters and the hyperparameter tuning for each.
Hyperparameter 1: max_depth
max_depth is the longest path between the root node and the leaf node in a tree in a random forest algorithm. By tuning this hyperparameter, we can limit the depth up to which we want the tree to grow in the random forest algorithm. This hyperparameter reduces the growth of the decision tree by working on a macro level.
Hyperparameter 2: max_terminal_nodes
This hyperparameter restricts the growth of a decision tree in the random forest by setting a condition on the splitting of nodes in the tree. The splitting of the nodes will stop, and the growth of the tree will cease if there are more terminal nodes than the specified number after splitting.
For instance, let us suppose that we have a single node in the tree, and the maximum terminal nodes are set to four. Since there is only one node, to begin with, the node will be split, and the tree will grow further. After the split reaches the maximum limit of four, the decision tree will not grow further as the splitting will be terminated. Using max_terminal_nodes hyperparameter tuning helps prevent overfitting. However, if the value of the tuning is very small, the forest is likely to underfit.
Related Read: Decision Tree Classification
Hyperparameter 3: n_estimators
A data scientist always faces the dilemma of how many decision trees to consider. One may say that choosing more number of trees is the way to go. This may hold true, but it also increases the time complexity of the random forest algorithm.
With the n_estimators hyperparameter tuning, we can decide the number of trees in the random forest model. The default value of the n_estimators parameter is ten. This means that ten different decision trees are constructed by default. By tuning this hyperparameter, we can change the number of trees that will be constructed.
Hyperparameter 4: max_features
With this hyperparameter tuning, we can decide the number of features to be provided to each tree in the forest. Generally, if the value of max features is set to six, the overall performance of the model is found to be the highest. However, you can also set the max features parameter value to the default, which is the square root of the number of features present in the dataset.
Hyperparameter 5: min_samples_split
This hyperparameter tuning decides the minimum number of samples required to split an internal leaf node. By default, the value of this parameter is two. It means that to split an internal node, there must be at least two samples present.
How To Do Random Forest Hyperparameter Tuning?
You need to carry out random forest hyperparameter tuning manually, by calling the function that creates the model. Random forest hyperparameter tuning is more of an experimental approach than a theoretical one. Thus, you may need to try out different combinations of hyperparameter tuning and evaluate the performance of each before deciding on one.
For example, suppose you have to tune the number of estimators and the minimum split of a tree in a random forest algorithm. Therefore, you can use the following command to perform hyperparameter tuning:
forest = RandomForestClassifier(random_state = 1, n_estimators = 20, min_samples_split = 2)
In the above example, the number of estimators is changed from their default value of ten to twenty. Thus, instead of ten decision trees, the algorithm will create twenty trees in the random forest. Similarly, an internal leaf node will be split only if it has at least two samples.
We hope that this blog helped you understand random forest hyperparameter tuning. There are many other hyperparameters that you can tune to improve the output of the machine learning program. In most instances, hyperparameter tuning is enough to improve the output of the machine learning program.
However, in rare cases, even random forest hyperparameter tuning might not prove helpful. In such situations, you will need to consider a different machine learning algorithm such as linear or logistic regression, KNN, or any other algorithm that you deem fit.
If you’re interested to learn more about decision trees, machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.
Why use the random forest algorithm?
The random forest algorithm is one of the most widely used models when it comes to the category of supervised learning algorithms in machine learning. The random forest algorithm can solve both classification and regression problems in machine learning. It is focused on ensemble learning, the concept which combines several classifiers for solving a complicated problem such that it can improve the overall functioning and outcome of a model. The random forest algorithm is popular because it takes much less time for training compared to many other algorithms. It can also offer highly accurate forecasts for massive sets of data, even if some parts of the data are missing.
What is the difference between a decision tree and a random forest?
A decision tree algorithm is a supervised learning technique in machine learning which models a single tree constituting a series of subsequent decisions that lead to a specific outcome. A decision tree is simple to interpret and understand. But it is often inadequate for solving more complex problems. This is where the random forest algorithm becomes useful – it leverages several decision trees to resolve specific problems. In other words, the random forest algorithm randomly generates multiple decision trees and combines their results to produce the final outcome. Although the random forest is more difficult to interpret than the decision tree, it produces accurate results when massive volumes of data are involved.
What are the advantages of using a random forest algorithm?
The greatest advantage of using the random forest algorithm lies in its flexibility. You can use this technique for both classification and regression tasks. Apart from its versatility, this algorithm is also extremely handy – the default parameters it uses are efficient enough for producing high accuracy in predictions. Moreover, machine learning classification models are well-known for problems like over-fitting. If there are an ample number of trees in the random forest algorithm, overfitting problems in classification can be easily overcome.