Why overall sum is taken instead of average sum while computing dL/db?

preetishvij · April 7, 2020, 11:04am

In online theory “MLP08-Vectorizing Backpropagation for m examples”,.

change in loss/ change in bias i.e, dL/db = 1/m(np.sum(delta,axis=0)).

In this we are taking average gradient for all m examples.

In Online Video- “NN- Implementation BackPropogation” the above was computed as -

dw1 = np.dot(X.T,delta1)
db1 = np.sum(delta1,axis=0)/float(m)

But, In the video “NN-Training Your Model” the above code was changed to-

dw1 = np.dot(X.T,delta1)
db1 = np.sum(delta1,axis=0)

When I didn’t remove float(m), my loss curve was fluctuating and accuracy was low but when i removed the float(m), I got the exact result as in the “NN-Training your model” video.
Why did this happen? Why are we not taking avg gradient as explained in theory but just taking sum of the error?

S18CRX0120 · April 7, 2020, 11:30am

Hey @preetishvij, if you want to divide by float(m) than divide both statements like this,
dw1 = np.dot(X.T,delta1)/float(m)
db1 = np.sum(delta1,axis=0)/float(m).

Than your loss function will be smooth. What you did, is you just divided bias by float(m), so your final loss will decrease, but it will have lot of spikes, in it.

Also if you decide not to divide by float(m), than this this thing can be compensated in learning_rate itself, m is constant, So its your call how you want to do, bothway is fine.

HOpe this resolved your doubt.
Plz mark the doubt as resolved in my doubts section.

S18CRX0120 · April 12, 2020, 12:46am

I hope I’ve cleared your doubt. I ask you to please rate your experience here
Your feedback is very important. It helps us improve our platform and hence provide you
the learning experience you deserve.

On the off chance, you still have some questions or not find the answers satisfactory, you may reopen
the doubt.