Gradient descent vs mini batch

the complextiy of both algos look same to me, i.e (n^2)*m wherer n is no of features and m is the sample size, then whats the diff

Hi Shubham,
This is a very common question when we study about gradient descent.

So, the time complexity you have given is for the code that you have implemented. But we don’t decide which algo is faster whether mini-batch gradient descent or batch gradient descent based on their implementation. Because it is possible to implement gradient descent without using even a single for loop that’s called vectorization implementation.

You know there are 3 variants of GD:

  • SGD : it calculates the error and updates the model for each example in the training dataset.
    Since it updates weight every iteration, so it is quite fast compare to others in aspect of weight updates. But it can also be stuck in a local minima.
    Updating the model so frequently is more computationally expensive than other variations of gradient descent, taking significantly longer to train models on large datasets.

  • Batch GD : it calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated. One cycle through the entire training dataset is called a training epoch.
    Since weights are updated after all the training examples so it is computationally faster than SGD. But Model updates may become very slow for large datasets.

  • Mini batch GD : it splits the training dataset into small batches that are used to calculate model error and update model coefficients.
    Advantage: The model update frequency is higher than batch gradient descent which allows for a more robust convergence, avoiding local minima.
    The batched updates provide a computationally more efficient process than stochastic gradient descent.

Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.

So, mini batch has takes the advantage of both the variants of GD. But the batch_size is a hyperparameter that we need to set wisely.

Final Note : It is not about the the for loops implementation. But weight updation plays major role.

I hope I’ve cleared your doubt. I ask you to please rate your experience here
Your feedback is very important. It helps us improve our platform and hence provide you
the learning experience you deserve.

On the off chance, you still have some questions or not find the answers satisfactory, you may reopen
the doubt.