Convolutional Neural Network Architecture: What You Need To Know?

Convolutional Neural Networks usually called by the names such as ConvNets or CNN are one of the most commonly used Neural Network Architecture. CNNs are generally used for image based data. Image recognition, image classification, objects detection, etc., are some of the areas where CNNs are widely used.

Top Machine Learning and AI Courses Online

The branch of Applied AI specifically over image data is termed as Computer Vision. There has been a monumental growth in Computer Vision since the introduction of CNNs. The first part of CNN extracts features from images using convolution and activation function for normalisation.

The last block uses these features with Neural Network to solve any specific problem, for example a classification problem will have ‘n’ number of output neurons depending on the number of classes present for classification. Let us try to understand the architecture and working of a CNN.

Trending Machine Learning Skills

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.


Convolution is an image processing technique which uses a weighted kernel (square matrix) to revolve over the image, multiply and add the kernel elements with image pixels. This method can be easily visualised by the image shown below.

Image by: Peltarion

Convolution filter and output

As we can see when we use a 3×3 convolution kennel, 3×3 part of the image is operated on and after multiplication and subsequent addition, one value comes as an output. So on a 4×4 image we’ll get a 2×2 convoluted matrix output given the kernel size is 3×3.

The convoluted output may vary upon the size of the kernel used for convolution. This is the typical starting layer of a CNN. The convoluted output is the features found from the image. This is directly related to the kernel size being used.

If the characteristic of an image is such that even small differences in an image will make it fall in a different output category then a small kernel size is used for feature extraction. Otherwise a bigger kernel can be used. The values used in the kernel are often termed as convolutional weights. These are initialized and then updated upon backpropagation using gradient descent.

Read: TensorFlow Object Detection Tutorial For Beginners

Convolutional Neural Network Architecture

Convolutional neural network architecture is the term for a CNN’s general framework and system of organisation. It consists of several interconnected layers for feature extraction, transformation, and classification. The main elements of CNN architecture for image classification or other applications are convolution layers, non-linearity layers, pooling layers, and fully linked layers.

Convolution Layer

The foundation of CNNs is the convolution layer. After applying several filters or kernels, it uses convolutions to extract local features from the input image. These filters identify patterns for classifying images, such as edges, corners, and textures.

Motivation behind Convolution

Convolution is used in CNNs because visual data is believed to display local spatial correlations. CNNs can recognise complex image patterns by effectively capturing and preserving these correlations through convolution processes.

Non-Linearity Layers

The network’s non-linear activation functions are introduced by non-linearity layers like ReLU (Rectified Linear Unit). CNNs can learn complex representations and become more expressive as a result. Non-linearity layers improve the network’s capacity to represent and identify complex picture features.

Designing a Convolutional Neural Network

The right amount of convolutional layers, kernel sizes, pooling techniques, and fully linked layers must be chosen while designing an effective CNN. The network’s ability to collect high-level features and the danger of overfitting should be balanced by the architecture.

For example, let’s use the convolutional neural network (CNN) classification of handwritten digit pictures from the MNIST dataset. Here is an illustration of the design procedure:

Identify the Issue: 

The challenge is to correctly categorise pictures of handwritten numbers into ranges (0–9).

Data Preparation: 

Divide the MNIST dataset into training and testing sets after reducing the images to a standard size and normalising pixel values between 0 and 1.

Make an architecture of convolutional neural network using convolutional layers, pooling layers, and fully connected layers by selecting the components for the architecture. As an illustration, we may begin with two convolutional layers with 32 filters each, then add a pooling layer. After that, include a second pooling layer and a second convolutional layer with 64 filters. Include layers that are fully connected and the proper sizes last.

Determine Layer Sizes: 

List the number of filters, kernel sizes, and strides for each layer. For the convolutional layers in this illustration, we can use 3×3 filters, and the pooling windows can be 2×2. Depending on the desired level of complexity for the model, the number of neurons in the completely connected layers can be chosen.

Choose hyperparameters like learning rate, batch size, and regularisation strategies. Set the learning rate to 0.001, the batch size to 64, and the dropout regularisation rate to 0.25, for example.

Considering this multi-class classification problem, configure the loss function to quantify the difference between predicted and actual labels. The output layer should employ a softmax activation function.

Backpropagation and stochastic gradient descent train the CNN on the training set. Track the accuracy and loss of the validation set to keep an eye on the training process. If necessary, modify the model and hyperparameters to enhance performance.

Testing and deployment: 

After being pleased with the model’s performance on the validation set, assess its generalizability by evaluating its accuracy on the testing set. Finally, use the CNN to classify fresh, previously undiscovered images of handwritten numbers.


The pooling layer is placed between convolution layers. It is responsible for performing pooling operations on the feature maps sent by a convolution layer. Pooling operation reduces the spatial size of the features also known as dimensionality reduction.

One of the major reasons for pooling is to decrease the required computational power to process the data. Although, a pooling layer reduces the size of the images it preserves their important characteristics. The working is similar to a CNN filter. The kernel goes over the features and aggregates the values covered by the filter.

From the image it is clearly visible that there can be various aggregation functions. Average and max pooling are the most commonly used pooling operations. Pooling reduces the dimensions of the features but keeps the characteristics intact.

By reducing the number of parameters, the calculations also reduce in the network. This reduces over-learning and increases the efficiency of the network. The max-pool is mostly used because max values are spotted less accurately in the pooled map compared to the maps from convolution.

This is good for many cases.Let us say if one want to recognize a dog, its ears do not need to be located as precisely as possible, knowing that they are located almost next to the head is enough.

Max Pooling also performs as a Noise Suppressant. It discards the noisy activations altogether and also performs de-noising along with dimensionality reduction. On the other hand, Average Pooling simply performs dimensionality reduction as a noise suppressing mechanism. Hence, we can say that Max Pooling performs a lot better than Average Pooling.

Activation Function

ReLU (Rectified Linear Units) is the most commonly used activation function layer. 

Equation for the same is: ReLU(x)=max(0,x) 

And graphical representation is given below:

Source: Medium

ReLU representation

ReLU maps the negative values to zero and keeps the positives as it is.

Fully Connected Layer

A fully connected layer is usually the last layer of any neural network. This layer receives input vectors and produces a new output layer. This output layer has n number of neurons where n is the number of classes in the classification of the image. Each element of the vector provides the probability of the image being of a certain class. Hence the sum of all the vectors in the output layer is always 1. 

The calculations happening in the output layer are as follows:

  1. Element multiplied by weight of the neuron
  2. Apply activation function on the layer (logistic when n=2, sigmoid when n>2)

The output will now be the probability of the image belonging to a certain class. The weights of the layer are learnt during training by backpropagation of the gradient.

Also Read: Neural Network Model Introduction

Dropout Layer

Dropout layers work as a regularisation layer that reduces overfitting and improves generalization error. Overfitting is a major concern while using a Neural Network. Dropout as the name suggests drops out some percentage of neuron in the layers after which it is used.

The regularization method employed by dropout is that it approximates training a large number of neural networks with different parallel architectures. During the training period some of the layer outputs are randomly dropped or ignored. This makes the layer look like a layer with different numbers of nodes and some neurons are turned off. Hence the connectivity also changes according to the previous layer. 


There are certain parameters which can be controlled according to the image data being dealt. Each layer of a CNN can be parameterized, be it convolution layer or pooling layer. Parameters affect the size of the feature map that is the output for that specific layer.

Each image(input) or feature map(subsequent outputs of layers) are of the dimensions: W x H x D where W x H is width x height i.e. the size of the map or image. D represents dimension on the basis of color segments. Monochrome images will have D=1 and RGB i.e. colored images will have D=3. 

Convolution Layer hyperparameters

  1. Number of filters (K)
  2. Size of the filter (F) of the dimension FxFxD
  3. Strides: Number of steps taken for the kernel to shift over the image. S=1 means that the kernel will move with 1 pixel as the step.
  4. Zero padding: zero padding is done for images having less size, because convolution and max pool layers reduce the size of the feature map on every iteration. 

Source: XRDS

Zero padding increased the size of the input image

For each input image of size W×H×D, the pooling layer returns a matrix of dimensions Wc×Hc×Dc. Where

Wc= (W-F+2P)/S+1

Hc= (H-F+2P)/S+1

Dc= K

Solving the equations to find the value of Padding(P)=F-½ and Stride(S)=1

In general, we then choose F=3,P=1,S=1 or F=5,P=2,S=1

Pooling Layer hyperparameters

  1. Cell size (F): The square cell size in which the map will be divided for pooling. FxF
  2. Step size (S): Cells are separated by S pixels

For each input image of size W×H×D, the pooling layer returns a matrix of dimensions Wp×Hp×Dp, where

Wp= (W-F)/S+1

Hp= (H-F)/S+1

Dp= D

For the pooling layer, F=2 and S=2 is widely chosen. 75% of the input pixels are eliminated. One can also choose F=3 and S=2. Larger cell size will result in large loss of information, hence suitable only for very big sized input images.

General hyperparameters

  • Learning rate: Optimizers like SGD, AdaGrad or RMSProp can be chosen to optimize learning rate.
  • Epochs: Number of Epochs should be increased until a gap in training and validation error shows up
  • Batch size: 16 to 128 can be selected. Depends on the amount of processing power that one has.
  • Activation Function: Introduces non-linearity to the model. ReLu is typically used for Conv Nets. Other options are: sigmoid, tanh.
  • Dropout: a dropout value of 0.1 drops 10% of the neurons. 0.5 is a good starting point. 0.25 is a good final option.
  • Weight Initialisation: Small random weights can be initialised to deflect the possibility of dead neurons. But not too small for gradient descent. Uniform distribution is suited.
  • Hidden layers: Hidden layers can be increased until the test error is decreasing. Increasing hidden layers will increase computation and require regularisation.

Popular AI and ML Blogs & Free Courses


We have the basic information to create a CNN from scratch. Although it is a comprehensive article that covers everything on a basic level, each parameter or layer can be dived deeper into. The maths behind every concept is also something that can be understood for the betterment of the model

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Want to share this article?

Lead the AI Driven Technological Revolution

Learn More

Leave a comment

Your email address will not be published. Required fields are marked *

Our Popular Machine Learning Course

Get Free Consultation

Leave a comment

Your email address will not be published. Required fields are marked *

Get Free career counselling from upGrad experts!
Book a session with an industry professional today!
No Thanks
Let's do it
Get Free career counselling from upGrad experts!
Book a Session with an industry professional today!
Let's do it
No Thanks