What is concept behind this?

As we are again traversing the whole data set but just in parts so how time spent can be improved?

Hi @Deepanshu_garg,

Suppose we have 1000 examples in our dataset.
In batch gradient, it will go through all the 1000 examples calculate the avg error and gradient and then update the parameters the first time, then again go through all the training examples once again then update the weights the second time… So, you can see it is quite slow in updating parameters or we can say slow in learning. But the weights that this process update will be very very accurate, the loss will always decrease since we have considered all the examples.

On the other hand, if we see the mini-batch gradient descent, (batch size = 32) it will consider 32 examples and update the weights parameters, then more 32 examples updation happen. So, you can see the learning is quite fast as compared to the batch gradient descent. But it might fluctuate from giving global minima, but it will surely take to around global minima.

One another variant called stochastic gradient descent update the weight after every example, which is the fastest for learning, but since it is considering only 1 example and according to that only updates the weights, you might get more fluctuations.

Even though all three will take you towards the minimum…

In the practical point of view, mini-batch gradient descent is most popular since it gives the speed of stochastic, and also somewhat appropriate as batch gradient descent…

This image summarizes what I have said here.