2 minutes to begin Convolutional Neural Network

5 min readNov 13, 2019

Hey All, this is for the beginners who have some knowledge in Neural Network (ANN network) and for those who wanted to know what is CNN. This gives only the brief view on CNN and its important steps. I would like to thank superdatascience.com for making my understandings more simpler.

(This is very basic and its in very High Level) Convolutional Neural Network (CNN):

-It is a kind of Neural Network where the input is an image and the output is usually some labels, which are identified (i.e in the above Figure it is either Happy)

-The different attributes that are derived from the image are called as the “Features”.

Example: In the following image you can see the person in two aspects one as a man seeing straight and another as a person looking side wise. So the brain can intercept that in two ways. Consider these two as “Two features”. Apart from this the eyes, nose, mouth ,hair and all the other features are also calculated. So all the relevant features are collected together (usually the full attributes are in the hidden layer) and it is mapped to the final output label to be predicted (e.g imagine a full body picture so the model on detecting eyebrows ,nose, eyes, mouth etc will predict that is has face from the whole body image ).

Like that it will be able to get different features for a single image (All will be stored as numpy array (3D array for colour and 2Darray for B/W image )).

Usually the four important steps to be known are

1) Convolution Layer:

Make the image smaller (Why ? So it is easy for computation and the process time is reduced. )
Gets the needed features and ignores others . ( The machine will usually be able to predict many features but it should always keep only the valid and required ones)
It takes the input images as an array and multiply with a 3x3 matrix (feature detector/filter) to get the convoluted image as shown below.

Explanation: Consider only the first 3x3 matrix subset from the primary image. Multiply it with the feature detector and check the first box on the right side matrix. Now move that 3x3 matrix assumed on the input image by 1 and ignoring the first column consider the 3x3 matrix and multiply it, now you get 1. Similarly the matrix is moved by 1 row-wise and column-wise and the Features map is obtained. Not that there will be several feature map available for a single image pertaining to different features.

2) ReLU Layer (Rectifier Layer):

This layer just passes the convoluted image to a Linear rectifier. Why? To reduce the non-linearity in the image.

What is removing the non-linearity? Consider the image in the right. It is a convoluted image of buildings. It contains Black and darker black. This variety in blackness will have different variant values(imagine the image in matrix)

This image to the left is after the ReLu layer now it has all the black removed or equalled to zero. This would more appropriate in training and predicting the image.

3) Max Pooling Layer:

To further preserve features in small dimensions of matrix.
Prevents over-fitting as well. It technically removes much of the unwanted values taking only the important values into consideration. How and why to do this ?e images can be usually at different angles, strided or rotated. Example an image which is rotated to its left given to the model should predict the same features as the original one. It should not predict new features from it. We do Max Pooling to avoid this.
For Example: Cheetah can sit in different poses and still it would require that the feature maps are same for cheetah.

You can visualise Pooling, Full connected layer and Convolution in http://scs.ryerson.ca/~aharley/vis/conv/flat.html

The convoluted image is taken and then a stride of 2x2 matrix is taken and the max number in the array is filled in the new Pooled Feature Map. Eg: In the above calculation the value “4” in 2x1 position will be same even if the image is rotated as when the stride comes to that position the max value will still be 4 . (In the above image the stride is that blue box and in that the max is 2 so it is being filled in the pooled feature max).