Normalization of Parameters

  1. First of all Sir normalized the X_train but not the Y_train.

  2. If we normalize both X_Train and Y_Train, It doesn’t change the value of theta to be calculated. Is that correct ??

  3. I have just understood that Normalization is used to bring the ranges of parameters close to each other.

Then what will happen if only the larger parameter ( like the salary ) is normalized and not the smaller parameter( like the age ) ?

Hello @mananaroramail,

You have already explained a reason behind why normalization is used in your 3rd point.

Here is a reason as to why the y_train is not normalized.

Normalizing the response variable or the dependant variable has nothing to do with why the normalization was introduced in the first place. The only assumption of the dependant variable is that the residual follows a normal distribution. So, normalizing the dependant variable (y_train) has no benefits other than making the unnormalization of the variable complex (if you wanted to predict the real values). In regression tasks there sure exist some good transforms that people tend to use on the dependant variable(y_train), like log transforms. But as I said, normalizing the dependant variable is never a case in most of the tasks. Crunching down the range of your predicted variable to 0-1 can even make the model underperform in certain cases(if not in all).

Check this notebook
You can here see how scaling with the same value(lamda) results in the same theta.

I believe this is intuitive enough for you to get the answer. Just focus on the fact that why normalization was ever introduced and what will happen in your case. If you still fail to understand, let me know here.

Okay, from this code, I observed that the normalization is scaling the variables to a smaller domain, in which each feature has equal contribution.
And the reason for normalization is that a bigger feature should not solely decide the output on behalf of all other features which are smaller in magnitude/scale as compared to it.
Is this conclusion alright?

means the LOSS/ERROR ( i.e " y(hypothesis) - y(train) " ) should also be NORMALIZED.

Yes, you can kinda say like that. Also, in methods involving gradient descent, scaling the feature can increase convergence speed.

You are getting confused with NORMAL DISTRIBUTION and NORMALIZATION. Normalization gets your scale in range 0 - 1. This is not the behaviour of normal distribution. The thing you are looking for is standardization. The residual following a normal distribution is our assumption, it is a mandatory one in case of OLS. So, if it is not, the approximation wont be reliable and/or correct. One simply wont scale their model’s errors, or the output/response variable. But can sure make it follow a normal distribution if its a bit skewed or has a abnormal kurtosis, this can only be done if the response variable was infact following a normal distribution.

I hope this doesn’t confused you much. Its okay if you dont fully understand this now.
Happy Learning :slightly_smiling_face:

Yeah !!!

Some conclusions which I made are:

  1. Normalisation is a Scaling technique, such that the training data can be viewed as a NORMAL DISTRIBUTION.

  2. We don’t normalise the dependent feature ( Of course )

No, again, you are getting confused with Normalisation and Normal Distribution. Normalisation is a scaling technique, correct, but it wont let your data be viewed as a Normal Distribution. Normalisation scales to range 0-1. It has nothing to do with converting a data to have normal distribution. Its just a scaling technique, like standardization, it brings every feature to have the same scale.

See, when Sir taught about Standard Normal Distribution, He said that the distribution is standardized by using the formula:
Xs = ( X - mean ) / std

where X is a normal distribution

It’s the same formula we are using while performing Normalization.

Then, how the dataset so formed is not a Normal Distribution.
I even verified and the mean was very close to zero and the std was 1.
( The problem was HARDWORK PAYS OFF)

Also, on googling I encountered that for Normalisation, the formula used was:

X = ( X - Xmin) / ( Xmax - Xmin )
but in the videos , the formula used was:
X = ( X - mean ) / std

This is clear to me.

Any data distribution can be standardized, but it doesnt mean the data was or will form a normal distribution.
Check this out. Data from a uniform distribution when standardized still remains a uniform distribution. So, loosely speaking, its not necessary that a data must be from a normal distribution to perform standardization and it wont miraculously form a normal distribution after scaling.
Also, if a distribution has 0 mean and 1 std, it doesnt mean it’s a standard normal distribution.

It all comes down to our assumption. That’s for another discussion. Anyways, the normalization and standardization formula is exactly what you have found in the internet. But sometimes they are used interchangebaly, so its all fine.

1 Like

Ok, got it finally !!!

But one last thing :

but this formula gives 0 mean and 1 std.
Therefore scaling ranges from( majorly ) from -1 to 1 and not 0 to 1

The formula X=(X - mean)/std is of standardization not normalization.