at 21:20 in the video,why we are dividing the dw by the total number of examples ie float(m), i did’nt understood the reason
Code not understood
Hey @saksham_thukral, its a way of normalizing term, we always attempt to divide gradients by float(m) so that we need not to change learning rate.
For example, a training dataset contains 100 examples and in second case it contains 1B examples, now suppose we set learning rate 0.01
-
if we did not divide by float(m) than the model works fine for 100 examples, but for 1B examples, the value of gradients may be quite high, and may result in sudden jumps, thus our model may or may not reach optimal state of parameters.
-
if we do divide by float(m), our model works fine for both case, and reach state of optimal parameters successfully.
Hope this cleared your doubt.
Don’t forget to mark it as resolved
I hope I’ve cleared your doubt. I ask you to please rate your experience here
Your feedback is very important. It helps us improve our platform and hence provide you
the learning experience you deserve.
On the off chance, you still have some questions or not find the answers satisfactory, you may reopen
the doubt.