Residual Networks (ResNet)

6 min readOct 13, 2020

Until about 2014 (when the GoogleNet was introduced), the most significant improvements in deep learning had appeared in the form of increased network depth — from the AlexNet (8 layers) to GoogleNet (22 layers). Some other networks with around 30 layers were also introduced around that time.

Researchers observed that it makes sense to affirm that the deeper the better when it comes to convolutional neural networks. This makes sense, since the models should be more capable (their flexibility to adapt to any space increase because they have a bigger parameter space to explore).

However, it has been noticed that after some depth, the performance degrades. This was one of the bottlenecks of VGG. They couldn’t go as deep as wanted, because they started to lose generalization capability.

The results are not because of overfitting. If that were the case, the deeper net would have achieved much lower training error rate, while the test error would have been high.

The key motivator for the ResNet architecture was the observation that, empirically, adding more layers was not improving the results monotonically. This was counterintuitive because a network with n + 1 layers should be able to learn at least what a network with n layers could learn, plus something more.

ResNet

short for Residual Networks is a classic neural network used as a backbone for many computer vision tasks. This model was the winner of ImageNet challenge in 2015.The fundamental breakthrough with ResNet was it allowed us to train extremely deep neural networks with 150+layers successfully.

ResNet proposed a solution to the “vanishing gradient” problem.

This is because when the network is too deep, the gradients from where the loss function is calculated easily shrink to zero after several applications of the chain rule. This result on the weights never updating its values and therefore, no learning is being performed. With ResNets, the gradients can flow directly through the identity function.

Idea behind ResNet

ResNet helps to build deeper neural network by utilizing skip connections or shortcuts to jump over some layers 🏃‍♀️

To create a residual block, add a shortcut to the main path in the plain neural network.

Skip Connection :

The figure on the left is stacking convolution layers together one after the other. On the right we still stack convolution layers as before but we now also add the original input to the output of the convolution block. This is called skip connection

ResNet solves this using “identity shortcut connections” — layers that initially don’t do anything. In the training process, these identical layers are skipped, reusing the activation functions from the previous layers. This reduces the network into only a few layers, which speeds learning. When the network trains again, the identical layers expand and help the network explore more of the feature space.

How it happens ? 🤷‍♀️

In this network we use a technique called skip connections . The skip connection skips training from a few layers and connects directly to the output.

It is fairly easy to calculate a[l+2] knowing just the value of a[l].

The advantage of adding this type of skip connection is because if any layer hurt the performance of architecture then it will be skipped by regularization. So, this results in training very deep neural network without the problems caused by vanishing/exploding gradient.

In the training process, identity shortcut connections— layers are skipped, reusing the activation functions from the previous layers. This reduces the network into only a few layers, which speeds learning. When the network trains again, the identical layers expand and help the network explore more of the feature space.

Residual Blocks 🧱

Lets understand this by an illustration,

We assume that the desired underlying mapping we want to obtain by learning is f(x), to be used as the input to the activation function on the top. On the left of fig, the portion within the dotted-line box must directly learn the mapping f(x). On the right, the portion within the dotted-line box needs to learn the residual mapping f(x)−x, which is how the residual block derives its name.

If the identity mapping f(x)=x, is the desired underlying mapping, the residual mapping is easier to learn: we only need to push the weights and biases of the upper weight layer (e.g., fully-connected layer and convolutional layer) within the dotted-line box to zero.

The right figure illustrates the residual block of ResNet, where the solid line carrying the layer input x to the addition operator is called a residual connection (or shortcut connection). With residual blocks, inputs can forward propagate faster through the residual connections across layers.

ResNet follows VGG’s full 3×33×3 convolutional layer design. The residual block has two 3×33×3 convolutional layers with the same number of output channels. Each convolutional layer is followed by a batch normalization layer and a ReLU activation function.

Then, we skip these two convolution operations and add the input directly before the final ReLU activation function.

This kind of design requires that the output of the two convolutional layers has to be of the same shape as the input, so that they can be added together.

Types of Residual Blocks

There are two main types of blocks used in ResNet, depending mainly on whether the input and output dimensions are the same or different.

Identity Block: When the input and output activation dimensions are the same.

Convolution Block: When the input and output activation dimensions are different from each other.

One important thing to note here is that the skip connection is applied before the RELU activation as shown in the diagram above. Research has found that this has the best results.

ResNet-152

ResNet was the first network demonstrated to add hundreds or thousands of layers while outperforming shallower networks. Although since its introduction in 2015, newer architectures have been invented which beat ResNet’s performance, it is still a very popular choice for computer vision tasks.

A primary strength of the ResNet architecture is its ability to generalize well to different datasets and problems.

Conclusion

. Learning an additional layer in deep neural networks as an identity function (though this is an extreme case) should be made easy.
The residual mapping can learn the identity function more easily, such as pushing parameters in the weight layer to zero.
We can train an effective deep neural network by having residual blocks. Inputs can forward propagate faster through the residual connections across layers.
ResNet had a major influence on the design of subsequent deep neural networks, both for convolutional and sequential nature.

If you liked the article, show your support by clapping for this article. This article is basically a colab of many articles from medium , analytical vidya , upgrad material etc which can help beginners to get all the topics at one place.

If you are also learning Machine learning like me follow me, for more articles. Lets go on this trip together :)

You can also follow me on Linkedin