Mini Batch Gradient Descent

Is the mini batch just a trick to divide the summation for optimizing computation.

For ex: if I have 400 samples,
then that sigma is further distributed into 4 mini batches of 100 samples each.

And, finally we add it to the total gradient.

Hi @mananaroramail,

In mini-batch Gradient Descent, the gradient calculated for a single mini-batch are used / updated the moment they are calculated. We do not calculate the gradients for all the mini batches before updating the original weights because that will beat the purpose of the introduction of mini batch Gradient Descent.

Hope this helps!

But, in the code :

for batch_start in (0,m,batch_size):

            # Assuming 0 gradient for the batch
            gradw = 0
            gradb = 0
            
            # Iterating over all samples in the mini batch
            for j in range(batch_start,batch_start+batch_size):
                
                if j < m:
                    idx = ids[j]
                    
                    ti = Y[idx]*(np.dot(W,X[idx].T) + b)
                    
                    if ti>1:
                        gradw += 0
                        gradb += 0
                    else:
                        gradw += self.c*Y[idx]*X[idx]
                        gradb += self.c*Y[idx]
                        
            # The mini-batch to be added
            # Gradient Descent
            W  = W - learning_rate*W + learning_rate*gradw
            b  = b + learning_rate*gradb

The total samples taken is 400.
Batch size is kept to 100.

So, after shuffling of ids and all that stuff,
we come to this loop:

for batch_start in (0,m,batch_size):

Then, we take the first 100 ids ( index = 0 to 99 )and update the gradient accordingly.

Then we jump to second mini-batch ( from index 100 to 199 ) and then update the gradient.

and so on…

Therefore, we could have done it directly !!
like it was done during the Linear Regression Algorithm.

According to what I searched about mini batch Stochastic Gradient Descent :

We take a random sample from a very bid dataset and assume as if we are updating the gradient for the whole sample ( to achieve light computation which also results in noise )

d(loss)/d( Wsample ) is approximately equal to d( loss )/ d( Wdataset) is the assumption we take while applying Gradient Descent on mini-batches.

This is different from directly updating because, as we can see, the W is different for the second mini-batch, same for third and so on. The model improves after every batch (a mini batch is considered as the representative for the whole dataset). In Batch Gradient Descent, the model stays the same until we have evaluated for all the data samples.
This has plenty if benefits:
Computational Efficiency: Many times, the whole dataset cannot fit in the memory.
Increased Update Speed: You can update the weights faster (Without waiting for processing to be finished for all the batches).
The only downside is increased fluctuations in training (The noise you mentioned).

Ok, so in layman terms it is just normal gradient descent but we divide it in subgroups for the above metioned benefits

Yes, exactly Manan. And if we select the batch size as 1, then it is called Stochastic Gradient Descent.