“Many strategies used in machine learning are explicitly designed to reduce the test error, possibly at the expense of increased training error. These strategies are collectively known as regularization.” — Goodfellow et al.

In Neural Networks, obtaining a generalized model by choosing the right set of parameters(Weights and Biases) is a key factor, to lessen the effect of overfitting. How can we obtain such parameters?. The answer is regularization. Regularization can be splinted into two buckets.

  • Implicit Regularization Methods: Data augmentation and early stopping
  • Explicit Regularization Methods: Dropout

There are various types of regularization techniques, such as L1 regularization, L2…


Recently, I was reading about NFNets, a state-of-the-art algorithm in image classification without Normalization by Deepmind. Understanding the functionality of Batch-Normalization in Deep Neural Networks and it effects was one the key element to understand NFNets Algorithm. Just thought of sharing my understanding of Batch-Normalization.

Training Deep Neural Network with n number of layers is challenging as they are sensitive to few parameters like initial random weights, learning rate, etc., One possible reason for this is the distribution of the inputs to layers deep in the network may change after each mini-batch when backpropagates. …


Gradient Descent, a first order optimization used to learn the weights of classifier. However, this implementation of gradient descent will be computationally slow to reach the global minima. If you want to read and understand about gradient descent read my previous article on “Understanding Gradient Descent”.

Stochastic Gradient Descent(SGD) is a simple modification to the standard gradient descent where the weight matrix W is updated for a smaller batch of dataset instead of updating at the end of every epoch.

Mini-batch SGD

Reviewing the gradient descent algorithm, it should be obvious that the method will run very slowly on large datasets. The…


R-CNN is a region based Object Detection Algorithm developed by Girshick et al., from UC Berkeley in 2014. Before jumping into the algorithm lets try to understand what object detection actually means and how it differs from image classification.


We have been studying matrix operations like Ax = b from schooling, but have we ever realized how its simplified linear regression (y=mx+c) problem. Yes, Today we are going to use matrix operations to find slope(m) and intercept(c) for predicting the car resales value.

Before getting into problem let us try to recap Ax=b, with an example as shown below.

Where A is matrix of size m x n, linearly combined with x (n x 1) to obtain the target variable b (n x 1)

How to relate this to Linear regression Problem?

In our Car sales data, we have…


If you are dropped in some part of Himalayas around 8 AM, How will we think to get to the safe place before sunset ? — Gradient Descent is the actual solution behind your thought process. Yes, we need to find the direction in which we have to move based on our initial assumptions such that we will reach the safe place on or before time.

In my previous articles, I have explained about Parameterized learning and basic loss function techniques. In case if you have not read, Do read those for better understanding of Gradient Descent.

In Deep Learning…


On 27 December,2020, I was travelling to my native town for my vacation. During the travel we crossed 4 toll plazas and wondered how FASTag works. And as a coincidence in todays news Indian Government has mandated the use of FASTag from January 1 for all vehicles to pass through toll plazas. Now, Nitin Gadkari, Union minister for road transport, highways and MSMEs has confirmed that the FASTags would be mandatory for paying at toll plazas and enable contactless as well as electronic payments for toll payments.

What is FASTag?

A FASTag is a sticker that is attached to the windshield of your…


During my early 2000's my father used to drive me to my home town for vacation. which will take approximately 8 hours from were I live. During our journey my primary role is to sit beside him and wake him up whenever his eyes getting closed. What if I am not near him? Isn’t it scary? Yeah it was. After reading this case study from OpenCV where we can use a small camera to keep tracking your eyes, felt happy and imagined about helpful it will be to avoid lots of accidents on highways.

Steps to track your eye

  1. Detect your face
  2. Crop the face…


Why Loss function is more important in Machine Learning Applications? In my last article we discussed about parameterized learning, such types of learnings will take some data as input and class labels. Actually, a function will be learnt to map the input to predicted class labels by defining few parameters like weights and biases.

First will see how a loss curve will look a like and understand a bit before getting into SVM and Cross Entropy loss functions.

Loss curve

Above graph, is a loss curve for two different models trained on CIFAR-10 dataset. …


Parameterized Learning is a learning model which can summarize the data with a set of parameters of fixed size. No matter how much data you throw at the parametric Model, it wont change its mind about how many parameters it needs. — Russel and Norvig(2009)

Parameterized Model includes four major components — includes Data, Scoring , Loss Function and Weights and Biases. We are going to see each components with a practical example.

Data

Data will be our input component that we are going to use to train our model. This data includes input data as well as their corresponding class labels.(Supervised Learning data).

Our dataset includes three different classes — cats, dogs and pandas (each with 1000 images) but in this article we are not going to see…

Ramji Balasubramanian

Machine Learning Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store