Regularization

Ramji Balasubramanian
4 min readMar 7, 2021

“Many strategies used in machine learning are explicitly designed to reduce the test error, possibly at the expense of increased training error. These strategies are collectively known as regularization.” — Goodfellow et al.

In Neural Networks, obtaining a generalized model by choosing the right set of parameters(Weights and Biases) is a key factor, to lessen the effect of overfitting. How can we obtain such parameters?. The answer is regularization. Regularization can be splinted into two buckets.

  • Implicit Regularization Methods: Data augmentation and early stopping
  • Explicit Regularization Methods: Dropout

There are various types of regularization techniques, such as L1 regularization, L2 regularization (commonly called “weight decay”), and Elastic Net, that are used by updating the loss function itself, adding an additional parameter to constrain the capacity of the model.

Why do we need Regularization?

Regularization helps us control our model capacity, ensuring that our models are better at making (correct) classifications on data points that they were not trained on, which we call the ability to generalize. If we don’t apply regularization, our classifiers can easily become too complex and overfit to our training data, in which case we lose the ability to generalize to our testing data (and data points outside the testing set as well, such as new images in the wild).

However, too much regularization can be a bad thing. We can run the risk of underfitting, in which case our model performs poorly on the training data and is not able to model the relationship between the input data and output class labels.

blue line: overfit, orange line: Underfit

The goal of regularization is to obtain these types of “green functions” that fit our training data nicely, but avoid overfitting to our training data (blue) or failing to model the underlying relationship (orange).

Loss and Weight Update to include Regularization

The loss over the entire training set can be written as:

Now, let’s say that we have obtained a weight matrix W such that every data point in our training set is classified correctly, which means that our loss L = 0. Awesome, we’re getting 100% accuracy — but let me ask you a question about this weight matrix — is it unique? Or, in other words, are there better choices of W that will improve our model’s ability to generalize and reduce overfitting?

If there is such a W, how do we know? And how can we incorporate this type of penalty into our loss function? The answer is to define a regularization penalty, a function that operates on our weight matrix. The regularization penalty is commonly written as a function, R(W).

L2 Regularization

To mitigate the effect various dimensions have on our output classifications, we apply regularization, thereby seeking W values that take into account all of the dimensions rather than the few with large values. In practice you may find that regularization hurts your training accuracy slightly, but actually increases your testing accuracy.

The second term is new — this is our regularization penalty. The \lambda variable is a hyperparameter that controls the amount or strength of the regularization we are applying. In practice, both the learning rate \alpha and the regularization term \lambda are the hyperparameters that you’ll spend the most time tuning.

weight updating during backpropagation

Here we are adding a negative linear term to our gradients (i.e., gradient descent), penalizing large weights, with the end goal of making it easier for our model to generalize.

Reference:

Deep Learning for Computer Vision with Python by Adrian Rosebrock (Starter Bundle)

--

--