What is meant by function is not differentiable logically (do not give mathematics definition)? Here Prateek Bhaiya said that y=mod(x) is not differentiable so he started discussing squaring of the equation. Why squaring is a better option to consider for good measure of error? ANd What problem we would face if we consider mod function itself.
Good Measure Of Error Linear Regression
Lets, first answer why squared error is a good measure of performance, there is actually multiple interpretation of the same, here, i will provide two of them, if you go through the Multivariate Regression ML Notes - Linear & Logistic Regression, there you can see the Probabilistic interpretation of Regressions and there you will see that maximizing the likelihood of the observed data is equivalent to minimizing squared error (it’s with proof there).
Secondly, as you wanted, lets keep the Math away, and get the logic on the table (although you should know, Math is the supreme logic ).
In machine learning, when considering datasets, it is always seen that there is some sorts of noise present in the data. In such situation, we cannot get a 0 error model. So, we must keep in mind that there will be some error accompanied with our model predictions, rephrasing, there will be some mistakes our model does while predicting target values. Now comes the tricky part, how do you want to manage those errors, i.e, how do you want your model to react to such errors.
One option is to let the model consider all errors as it is, small or large while training, or, you can let the model work better on places where the error is very large and silently rule-out small errors.
Logically, the second one seems better, keeping in mind that the predictions will always have some error.
Lets keep this aside and do some calculations,
Let y1 = 0.2 and y2 = 2 then,
y12 = 0.04 while y22 = 4
Also,
| y1 | = 0.2 and | y2 | = 2,
If we look at these calculations, we can infer for a fact that what we logically thought of concentrating on large errors can be accomplished by squaring the error values, i.e an error which has low value will be reduced even more while high error values are escalated. Hence we use squared error loss.
NOTE :- This interpretation has no actual ground truth to be considered as a correct answer. What we have discussed here is only to give you a better understanding intuitively of the concept. The real reason behind why squared error loss is considered more suitable than the absolute error relates to the link between Frequentist and Bayesian inference, which is provided in the notes under Probabilistic Interpretation section. Therefore, it is greatly suggested that you go through the same for a conceptual understanding.
Coming to what is meant by function is not differentiable?
For optimization, we are using Gradient Descent. Gradient Descent works only for a differentiable function as it needs the gradient at every step to decide in which direction to move. The function | x | is differentiable everywhere but 0 (assuming you know the reason why), it means that at x = 0, you will not have a direction to move.
So, we can’t use Gradient Descent right? Wrong, there is places where you will use L1-norm or absolute value function with Gradient Descent, at those cases, we use something called a Sub-Gradient Vector, but that’s a discussion of another topic.
And now, problems we might face if we consider absolute error?
The main problem that will bother you using absolute error would be slow convergence as compared to squared error. To give you and all the other readers a better understanding of the same, i have wrote a script which you can find here. (make sure you try all sorts of things with the script, like changing max_steps, learning_rate etc. and observe how it changes the training procedure.)
Hope you got a better understanding as to why we choose squared error over absolute error.
Do rate the answer.
Happy Learning
ThankYou, great Job bt let me know that what is meant by noise in the dataset.Great Explanation. Now i am feeling pretty confident about the topic.
Here, confining the answer to only linear regression, noise is the error in estimating y with the best fit line there is, i.e the error that you get when you try to estimate the target value with the best line that is possible with the data.