Suppose instead of using negative of log-likelihood as the loss, you decide to use least square loss which is defined as Σ(y - y_pred)^2, assuming binary classification, which of the following statements is true about the loss -
- Loss function can’t optimized using gradient descent
- We can’t perform classification using least squares
- Loss function will converge to a local minima
- Loss function will converge to global minima
How can we prove the shape of the graph? Whether it would be convex or concave or kind of mix of them, so multiple minima and maxima will be present