One of the most crucial parts of Machine Learning is the optimization of its algorithms. Almost all the algorithms in Machine Learning have an optimization algorithm at their base which acts as the core of the algorithm. As we all know, optimization is the ultimate goal of any algorithm even with real-life events or when dealing with a technology-based product in the market.
There are currently a lot of optimization algorithms that are used in several applications such as face recognition, self-driving cars, market-based analysis, etc. Similarly, in Machine Learning such optimization algorithms play an important role. One such widely used optimization algorithm is the Gradient Descent Algorithm which we shall go through in this article.
What is Gradient Descent?
In Machine Learning, the Gradient Descent algorithm is one of the most used algorithms and yet it stupefies most newcomers. Mathematically, Gradient Descent is a first-order iterative optimization algorithm that is used to find the local minimum of a differentiable function. In simple terms, this Gradient Descent algorithm is used to find the values of a function’s parameters (or coefficients) which are used to minimize a cost function as low as possible. The cost function is used to quantify the error between the predicted values and the real values of a Machine Learning model built.
Gradient Descent Intuition
Consider a large bowl with which you would normally keep fruits or eat cereal. This bowl will be the cost function (f).
Now, a random co-ordinate on any part of the surface of the bowl will be the current values of the coefficients of the cost function. The bottom of the bowl is the best set of coefficients and it is the minimum of the function.
Here, the goal is to calculate the different values of the coefficients with each iteration, evaluate the cost and choose the coefficients which have a better cost function value (lower value). On multiple iterations, it would be found that the bottom of the bowl has the best coefficients to minimize the cost function.
In this way, the Gradient Descent algorithm functions to result in minimum cost.
Join the Machine Learning Course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.
Gradient Descent Procedure
This process of gradient descent begins with allocating values initially to the coefficients of the cost function. This could be either a value close to 0 or a small random value.
coefficient = 0.0
Next, the cost of the coefficients is obtained by applying it to the cost function and calculating the cost.
cost = f(coefficient)
Then, the derivative of the cost function is calculated. This derivative of the cost function is obtained by the mathematical concept of differential calculus. It gives us the slope of the function at the given point where its derivative is calculated. This slope is needed to know in which direction the coefficient is to be moved in the next iteration to get a lower cost value. This is done by observing the sign of the derivative calculated.
delta = derivative(cost)
Once we know which direction is downhill from the derivative calculated, we need to update the coefficient values. For this, a parameter is known as the learning parameter, alpha (α) is utilized. This is used to control to what extent the coefficients can change with every update.
coefficient = coefficient – (alpha * delta)
In this way, this process is repeated till the cost of the coefficients is equal to 0.0 or close enough to zero. This is the procedure for the gradient descent algorithm.
Types of Gradient Descent Algorithms
In modern times, there are three basic types of Gradient Descent that are used in modern machine learning and deep learning algorithms. The major difference between each of these 3 types is its computational cost and efficiency. Depending upon the amount of data used, time complexity, and accuracy the following are the three types.
- Batch Gradient Descent
- Stochastic Gradient Descent
- Mini Batch Gradient Descent
Batch Gradient Descent
This is the first and basic version of the Gradient Descent algorithms in which the entire dataset is used at once to compute the cost function and its gradient. As the entire dataset is used in one go for a single update, the calculation of the gradient in this type can be very slow and is not possible with those datasets that are out of the device’s memory capacity.
Thus, this Batch Gradient Descent algorithm is used only for smaller datasets and when the number of training examples is large, the batch gradient descent is not preferred. Instead, the Stochastic and Mini Batch Gradient Descent algorithms are used.
Stochastic Gradient Descent
This is another type of gradient descent algorithm in which only one training example is processed per iteration. In this, the first step is to randomize the entire training dataset. Then, only one training example is used for updating the coefficients. This is in contrast to the Batch Gradient Descent in which the parameters (coefficients) are updated only when all the training examples are evaluated.
Stochastic Gradient Descent (SGD) has the advantage that this type of frequent update gives a detailed rate of improvement. However, in certain cases, this may turn out to be computationally expensive as it processes only one example every iteration which may cause the number of iterations to be very large.
Mini Batch Gradient Descent
This is a recently developed algorithm that is faster than both the Batch and Stochastic Gradient Descent algorithms. It is mostly preferred as it is a combination of both the previously mentioned algorithms. In this, it separates the training set into several mini-batches and performs an update for each of these batches after calculating the gradient of that batch (like in SGD).
Commonly, the batch size varies between 30 to 500 but there isn’t any fixed size as they vary for different applications. Hence, even if there is a huge training dataset, this algorithm processes it in ‘b’ mini-batches. Thus, it is suitable for large datasets with a lesser number of iterations.
If ‘m’ is the number of training examples, then if b==m the Mini Batch Gradient Descent will be similar to the Batch Gradient Descent algorithm.
Variants of Gradient Descent in Machine Learning
With this basis for Gradient Descent, there have been several other algorithms that have been developed from this. A few of them are summarized below.
Vanilla Gradient Descent
This is one of the simplest forms of the Gradient Descent Technique. The name vanilla means pure or without any adulteration. In this, small steps are taken in the direction of the minima by calculating the gradient of the cost function. Similar to the above-mentioned algorithm, the update rule is given by,
coefficient = coefficient – (alpha * delta)
Gradient Descent with Momentum
In this case, the algorithm is such that we know the previous steps before taking the next step. This is done by introducing a new term which is the product of the previous update and a constant known as the momentum. In this, the weight update rule is given by,
update = alpha * delta
velocity = previous_update * momentum
coefficient = coefficient + velocity – update
The term ADAGRAD stands for Adaptive Gradient Algorithm. As the name says, it uses an adaptive technique to update the weights. This algorithm is more suited for sparse data. This optimization changes its learning rates in relation to the frequency of the parameter updates during the training. For example, the parameters which have higher gradients are made to have a slower learning rate so that we do not end up overshooting the minimum value. Similarly, lower gradients have a faster learning rate to get trained more quickly.
Yet another adaptive optimization algorithm that has its roots in the Gradient Descent algorithm is the ADAM which stands for Adaptive Moment Estimation. It is a combination of both the ADAGRAD and the SGD with Momentum algorithms. It is built from the ADAGRAD algorithm and is built further downside. In simple terms ADAM = ADAGRAD + Momentum.
In this way, there are several other variants of Gradient Descent Algorithms that have been developed and are being developed in the world such as AMSGrad, ADAMax.
In this article, we have seen the algorithm behind one of the most commonly used optimization algorithms in Machine Learning, the Gradient Descent Algorithms along with its types and variants that have been developed.
upGrad provides a Executive PG Programme in Machine Learning & AI and a Master of Science in Machine Learning & AI that may guide you toward building a career. These courses will explain the need for Machine Learning and further steps to gather knowledge in this domain covering varied concepts ranging from Gradient Descent in Machine Learning.
Where can Gradient Descent Algorithm contribute maximally?
Optimisation within any machine learning algorithm is incremental to the purity of the algorithm. Gradient Descent Algorithm assists in minimising cost function errors and improving the algorithm’s parameters. Although the Gradient Descent algorithm is used widely in Machine Learning and Deep Learning, its effectiveness can be determined by the quantity of data, amount of iterations and accuracy preferred, and amount of time available. For small-scale datasets, the Batch Gradient Descent is optimal. Stochastic Gradient Descent (SGD) proves to be more efficient for detailed and more extensive data sets. In contrast, Mini Batch Gradient Descent is used for quicker optimisation.
What are the challenges faced in gradient descent?
Gradient Descent is preferred to optimise machine learning models to reduce cost function. However, it has its shortcomings as well. Suppose the Gradient is diminished due to the minimum output functions of the model layers. In that case, the iterations won’t be as effective as the model will not retrain fully, updating its weights and biases. Sometimes an error gradient accumulates loads of weights and biases to keep the iterations updated. However, this gradient becomes too large to manage and is called an exploding gradient. The infrastructure requirements, learning rate balance, momentum need to be addressed.
Does gradient descent always converge?
Convergence is when the gradient descent algorithm successfully minimises its cost function to an optimal level. Gradient Descent Algorithm tries to minimise the cost function through the algorithm parameters. However, it can land on any of the optimal points and not necessarily the one that has a global or local optimum point. One reason for not having optimal convergence is the step size. A more significant step size results in more oscillations and may divert from the global optimal. Hence, gradient descent may not always converge on the best feature, but it still lands on the nearest feature point.