The last decade has seen tremendous growth in Artificial Intelligence and smarter machines. The field has given rise to many sub-disciplines that are specializing in distinct aspects of human intelligence. For instance, natural language processing tries to understand and model human speech, while computer vision aims to provide human-like vision to machines.
Since we’ll be talking about Convolutional Neural Networks, our focus will mostly be on computer vision. Computer vision aims to enable machines to view the world as we do and solve problems related to image recognition, image classification, and a lot more. Convolutional Neural Networks are used to achieve various tasks of computer vision. Also known as CNN or ConvNet, they follow an architecture that resembles the patterns and connections of neurons in the human brain and are inspired by various biological processes occurring in the brain to make communication happen.
The biological significance of a Convoluted Neural Network
CNNs are inspired by our visual cortex. It is the area of the cerebral cortex that is involved in visual processing in our brain. The visual cortex has various small cellular regions that are sensitive to visual stimuli.
This idea was expanded in 1962 by Hubel and Wiesel in an experiment where it was found that different distinct neuronal cells respond (get fired) to the presence of distinct edges of a specific orientation. For instance, some neurons would fire on detecting horizontal edges, others on detecting diagonal edges, and some others would fire when they detect vertical edges. Through this experiment. Hubel and Wiesel found out that the neurons are organized in a modular manner, and all the modules together are required for producing the visual perception.
This modular approach – the idea that specialized components inside a system have specific tasks – is what forms the basis of the CNNs.
With that settled, let’s move on to how CNNs learn to perceive visual inputs.
Convolutional Neural Network Learning
Images are composed of individual pixels, which is a representation between numbers 0 and 255. So, any image that you see can be converted into a proper digital representation by using these numbers – and that is how computers, too, work with images.
Here are some major operations that go into making a CNN learn for image detection or classification. This will give you an idea of how learning takes place in CNNs.
1. Convolution
Convolution can mathematically be understood as the combined integration of two different functions to find out how the influence of the different function or modify one another. Here’s how it can be defined in mathematical terms:
The purpose of convolution is to detect different visual features in the images, like lines, edges, colors, shadows, and more. This is a very useful property because once your CNN has learned the characteristics of a particular feature in the image, it can later recognize that feature in any other part of the image.
CNNs utilize kernels or filters to detect the different features that are present in any image. Kernels are just a matrix of distinct values (known as weights in the world of Artificial Neural Networks) trained to detect specific features. The filter moves over the entire image to check if the presence of any feature is detected or not. The filter carries out the convolution operation to provide a final value that represents how confident it is that a particular feature is present.
If a feature is present in the image, the result of the convolution operation is a positive number with a high value. If the feature is absent, the convolution operation results in either 0 or a very low-valued number.
Let’s understand this better using an example. In the below image, a filter has been trained for detecting a plus sign. Then, the filter is passed over the original image. Since a part of the original image contains the same features that the filter is trained for, the values in each cell where the feature exists is a positive number. Likewise, the result of a convolution operation will also result in a large number.
However, when the same filter is passed over an image with a different set of features and edges, the output of a convolution operation will be lower – implying there wasn’t any strong presence of any plus sign in the image.
So, in the case of complex images having various features like curves, edges, colours, and so on, we’ll need an N number of such feature detectors.
When this filter is passed through the image, a feature map is generated which is basically the output matrix that stores the convolutions of this filter over different parts of the image. In the case of many filters, we’ll end up with a 3D output. This filter should have the same number of channels as the input image for the convolution operation to take place.
Further, a filter can be slid over the input image at different intervals, using a stride value. The stride value informs how much the filter should move at each step.
The number of output layers of a given convolutional block can therefore be determined using the following formula:
2. Padding
One issue while working with convolutional layers is that some pixels tend to be lost on the perimeter of the original image. Since generally, the filters used are small, the pixels lost per filter might be a few, but this adds up as we apply different convolutional layers, resulting in many pixels lost.
The concept of padding is about adding extra pixels to the image while a filter of a CNN is processing it. This is one solution to help the filter in image processing – by padding the image with zeroes to allow for more space for the kernel to cover the entire image. By adding zero paddings to the filters, the image processing by CNN is much more accurate and exact.
Check the image above – padding has been done by adding additional zeroes at the boundary of the input image. This enables the capture of all the distinct features without losing any pixels.
3. Activation Map
The feature maps need to be passed through a mapping function that is non-linear in nature. The feature maps are included with a bias term and then passed through the activation (ReLu) function, which is non-linear. This function aims to bring some amount of nonlinearity into the CNN since the images that are being detected and examined are also non-linear in nature, being composed of different objects.
4. Pooling Stage
Once the activation phase is over, we move on to the pooling step, wherein the CNN down-samples the convolved features, which help save processing time. This also helps in reducing the overall size of the image, overfitting, and other issues that would occur if the Convoluted Neural Networks are fed with a lot of information – especially if that information is not too relevant in classifying or detecting the image.
Pooling is basically of two types – max pooling and min pooling. In the former, a window is passed over the image according to a set stride value, and at each step, the maximum value included in the window is pooled in the output matrix. In the min pooling, the minimum values are pooled in the output matrix.
The new matrix that’s formed as a result of the outputs is called a pooled feature map.
Out of min and max pooling, one benefit of max-pooling is that it allows the CNN to focus on a few neurons which have high values instead of focusing on all the neurons. Such an approach makes it very less likely to overfit the training data and makes the overall prediction and generalization go well.
5. Flattening
After the pooling is done, the 3D representation of the image has now been converted into a feature vector. This is then passed into a multi-layer perceptron to produce the output. Check out the image below to better understand the flattening operation:
As you can see, the rows of the matrix are concatenated into a single feature vector. If multiple input layers are present, all the rows are connected to form a longer flattened feature vector.
6. Fully Connected Layer (FCL)
In this step, the flattened map is fed to a neural network. The complete connection of a neural network includes an input layer, the FCL, and a final output layer. The fully connected layer can be understood as the hidden layers in Artificial Neural Networks, except, unlike hidden layers, these layers are fully connected. The information passes through the entire network, and a prediction error is calculated. This error is then sent as feedback (backpropagation) through the systems to adjust weights and improve the final output, to make it more accurate.
The final output obtained from the above layer of the neural network doesn’t generally add up to one. These outputs need to be brought down to numbers in the range of [0,1] – which will then represent the probabilities of each class. For this, the Softmax function is used.
The output obtained from the dense layer is fed to the Softmax activation function. Through this, all the final outputs are mapped to a vector where the sum of all the elements comes out to be one.
The fully connected layer works by looking at the previous layer’s output and then determining which feature most correlates to a specific class. Thus, if the program predicts whether or not an image contains a cat, it will have high values in the activation maps that represent features like four legs, paws, tail, and so on. Likewise, if the program is predicting something else, it will have different types of activation maps. A fully connected layer takes care of the different features that strongly correlate to particular classes and weights so that the computation between weights and the previous layer is accurate, and you get correct probabilities for distinct classes of output.
A quick summary of the working of CNNs
Here’s a quick summary of the entire process of how CNN works and helps in computer vision:
- The different pixels from the image are fed to the convolutional layer, where a convolution operation is performed.
- The previous step results in a convolved map.
- This map is passed through a rectifier function to give rise to a rectified map.
- The image is processed with different convolutions and activation functions for locating and detecting different features.
- Pooling layers are used to identify specific, distinct parts of the image.
- The pooled layer is flattened and used as an input to the fully connected layer.
- The fully connected layer calculates the probabilities and gives an output in the range of [0,1].
In Conclusion
The inner functioning of CNN is very exciting and opens a lot of possibilities for innovation and creation. Likewise, other technologies under the umbrella of Artificial Intelligence are fascinating and are trying to work between human capabilities and machine intelligence. Consequently, people from all over the world, belonging to different domains, are realizing their interest in this field and are taking the first steps.
Luckily, the AI industry is exceptionally welcoming and doesn’t distinguish based on your academic background. All you need is working knowledge of the technologies along with basic qualifications, and you’re all set!
If you wish to master the nitty-gritty of ML and AI, the ideal course of action would be to enroll in a professional AI/ML program. For instance, our Executive Programme in Machine Learning and AI is the perfect course for data science aspirants. The program covers subjects like statistics and exploratory data analytics, machine learning, and natural language processing. Also, it includes over 13 industry projects, 25+ live sessions, and 6 capstone projects. The best part about this course is that you get to interact with peers from across the world. It facilitates the exchange of ideas and helps learners build lasting connections with people from diverse backgrounds. Our 360-degree career assistance is just what you need to excel in your ML and AI journey!