Chain Rule Derivative in Machine Learning : Explained

Machine Learning has evolved to become one of the most talked-about and researched fields in the current years, and for all the good reasons. New models and applications of machine learning are being discovered every day, and researchers around the globe are working towards the next big thing. 

Top Machine Learning and AI Courses Online

As a result, there has been an increased interest in professionals from varied backgrounds to switch to machine learning and be a part of this ongoing revolution. If you’re one such machine learning enthusiast looking to take their first steps, let’s tell you that it begins with understanding the basics of mathematics and statistics before anything else. 

Trending Machine Learning Skills

One such vital topic in Mathematics that is highly relevant to machine learning is derivatives. From your basic understanding of calculus, you’d remember that the derivative of any function is the instantaneous rate of change of that function. In this blog, we’ll dive deeper into derivatives and explore the chain rule. We’ll see how a particular function’s output changes when we change some independent variables in the equation. With the knowledge of chain rule derivatives, you’ll be able to work on differentiating more complex functions that you are sure to encounter in machine learning. 

Get Machine Learning certification online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.

Understanding the Chain Rule Derivative

The chain rule is essentially a mathematical formula that helps you calculate the derivative of a composite function. A composite function is one that is composed of two or more functions. So, if f and g  are two functions, then the chain rule would help us find the derivative of composite functions such as f o g or g o f. 

Considering the composite function f o g, here’s what the chain rule derivative would look like: 


The above rule can also be written as: 


Where the function F is the composition of f and g, in the form of f(g(x)). 

Now, suppose we have three variables such that the third variable (z) depends on the second variable (y), which in turn depends on the first variable (x). In that case, the chain rule derivative would look something like this: 

In terms of deep learning, this is also the formula regularly used to solve backpropagation problems. Now, since we mentioned that z depends on y and y on x, we can write z = f(y) and y = g(x). This substitution would modify our differential equation in the following manner: 

Now, let’s look at some examples of chain rule derivatives to better understand the maths behind them. 

Examples and Applications of Chain Rule Derivative

Let us take a well-known example from Wikipedia to understand the chain rule derivative in a better manner. Assume you’re taking a free fall from the sky. The atmospheric pressure that you encounter during the fall will constantly keep changing. Here is a graph that plots this change of atmospheric pressure with elevation levels:

Suppose your fall started at 4000 meters above sea level. Initially, your velocity was zero, and the acceleration value was 9.8 meters per second squared due to gravity. 

Now, let’s compare this situation with the previous chain rule method. In this example, we’ll be using the variable ‘t’ for time instead of x. 

Then, the variable y = g(t), which tells the distance travelled since the beginning of the fall, can be given as: 

g(t) = 0.5*9.8t^2

And, the height from the sea level can be given by a variable ‘h’, which will be equal to 400-g(t). 

Assume that, based on a model, we can also write the function of the atmospheric pressure at any height h as: 

f(h) = 101325 e−0.0001h

Now, you can distinguish between the two equations based on their dependant variables to get the following results:

g′(t) = −9.8t,

Here, g’(t) tells the value of your velocity at any time t. 

f′(h) = −10.1325e−0.0001h

Here, f′(h) is the rate of change in atmospheric pressure with respect to height h. Now, the question is can we combine these two equations and derive the rate of change of atm pressure wrt the time? Let’s see using the chain rule: 


The final equation that we’ve got provides us with the changing rate of the atmospheric pressure in relation to the time passed since fall. In terms of machine learning, neural networks constantly need weight updates concerning the neuron’s error in prediction. The chain rule helps adjust these weights and take the machine learning model closer to the correct output. 

Popular AI and ML Blogs & Free Courses


As you can see, the chain rule is beneficial for many purposes. Especially when it comes to machine learning or deep learning, the chain rule finds a lot of use in updating the weights of the neurons and improving the overall efficiency of the model. 

Now that you’re aware of the basics of the chain rule go ahead and try a few problems on your own. Lookup a few composite functions and try to find their derivatives. The more you practice, the clearer your concepts will get, and the easier it’ll be for you to train your machine learning models! That said, if you’re a machine learning enthusiast but struggling to take your first steps in this field, upGrad has your back! 

Our Executive PG Programme in Machine Learning & AI is offered in collaboration with IIIT-Bangalore and gives you the choice of six industry-relevant specialisations. The course starts from the ground level and takes you to the apex while providing you with 1-on-1 support from industry experts, a strong peer group of students, and 360-degree career support. 

How are gradients used in machine learning?

The gradient vector is frequently used in classification and regression problems. Gradient descent is a kind of optimization algorithm. Gradient descent is extensively employed in machine learning models to identify the optimum parameters that minimize the model's cost function since it was developed to find the local minimum of a differential function.

What is the purpose of using activation functions in neural networks?

An activation function's goal is to offer a function in a neural network with non-linear features. An artificial neural network with an activation function is used to assist the network in understanding complicated patterns in data. A neural network could only perform linear mappings from inputs to outputs without the activation functions, with the dot-products between an input vector and a weight matrix acting as the mathematical operation during forward propagation. By using activation functions, you can acquire reliable predictions about what the model can create.

Is it important to have a good knowledge of calculus for machine learning?

Calculus is essential for comprehending the internal dynamics of machine learning algorithms like the gradient descent method, which minimizes an error function based on the rate of change calculation. If you are a beginner, you do not need to understand all of the ideas behind calculus to do well in machine learning. You might get by with only knowing the principles of algebra and calculus, but if you're a data scientist and want to know what's going on behind the scenes in your machine learning project, you'll need to know the principles of calculus in depth.

Want to share this article?

Leave a comment

Your email address will not be published. Required fields are marked *

Our Popular Machine Learning Course

Get Free Consultation

Leave a comment

Your email address will not be published. Required fields are marked *

Get Free career counselling from upGrad experts!
Book a session with an industry professional today!
No Thanks
Let's do it
Get Free career counselling from upGrad experts!
Book a Session with an industry professional today!
Let's do it
No Thanks