Batch Normalization and its Advantages

Ramji Balasubramanian
3 min readMar 6, 2021

--

Recently, I was reading about NFNets, a state-of-the-art algorithm in image classification without Normalization by Deepmind. Understanding the functionality of Batch-Normalization in Deep Neural Networks and it effects was one the key element to understand NFNets Algorithm. Just thought of sharing my understanding of Batch-Normalization.

Training Deep Neural Network with n number of layers is challenging as they are sensitive to few parameters like initial random weights, learning rate, etc., One possible reason for this is the distribution of the inputs to layers deep in the network may change after each mini-batch when backpropagates. This change in distribution of inputs to layers in deep neural networks is referred as “internal covariance shift”.

What is Batch Normalization?

Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs required to train deep networks.

Algorithm

image from paper

The BN transform can be added to a network to manipulate any activation. In the notation y = BNγ,β(x), we indicate that the parameters γ and β are to be learned, but it should be noted that the BN transform does not independently process the activation in each training example. Rather, BNγ,β(x) depends both on the training example and the other examples in the mini-batch. The scaled and shifted values y are passed to other network layers. The normalized activations xb are internal to our transformation, but their presence is crucial. The distributions of values of any xb has the expected value of 0 and the variance of 1, as long as the elements of each mini-batch are sampled from the same distribution, and if we neglect epsilon. Each normalized activation xb (k) can be viewed as an input to a sub-network composed of the linear transform y (k) = γ (k)xb (k) + β (k) , followed by the other processing done by the original network. These sub-network inputs all have fixed means and variances, and although the joint distribution of these normalized xb (k) can change over the course of training, we expect that the introduction of normalized inputs accelerates the training of the sub-network and, consequently, the network as a whole.

Accelerating Batch Normalization Networks

Simply adding Batch Normalization to a network does not take full advantage of our method. To do so, we applied the following modifications:

  • Increase learning rate: In a batch-normalized model, they have been able to achieve a training speedup from higher learning rates, with no ill side effects
  • Remove Dropout: Google research team found that removing Dropout from BN-Inception allows the network to achieve higher validation accuracy. We conjecture that Batch Normalization provides similar regularization benefits as Dropout, since the activations observed for a training example are affected by the random selection of examples in the same mini-batch.
  • Shuffle training examples more thoroughly: Google research team enabled within-shard shuffling of the training data, which prevents the same examples from always appearing in a mini-batch together. This led to about 1% improvement in the validation accuracy, which is consistent with the view of Batch Normalization as a regularizer: the randomization inherent in our method should be most beneficial when it affects an example differently each time it is seen.
  • Reduce the L2 weight regularization: While in Inception an L2 loss on the model parameters controls overfitting, in modified BN-Inception the weight of this loss is reduced by a factor of 5. they found that this improves the accuracy on the held-out validation data.
  • Accelerate the learning rate decay. In training Inception, learning rate was decayed exponentially. Because our network trains faster than Inception, we lower the learning rate 6 times faster.
  • Remove Local Response Normalization While Inception and other networks (Srivastava et al., 2014) benefit from it, we found that with Batch Normalization it is not necessary. Reduce the photometric distortions. Because batch normalized networks train faster and observe each training example fewer times, we let the trainer focus on more “real” images by distorting them less.

The whole article is a compilation of important information from paper. I hope this helps to understand batch normalization better and faster.

Reference

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43442.pdf

--

--

Ramji Balasubramanian
Ramji Balasubramanian

No responses yet