How to do Label Encoding when we have missing values in text based categorical Data. In case of numerical data we would simply replace NaN by the mean value of that column but what to do when the data is text?
House Price Prediction
Hey @raunaqsingh10, there are multiple ways to achieve the said task you did if you strictly want to apply Label Encoding,
- Leave the column, if more than 60% or rows are empty than drop the column.
- If the number of rows missing the values is very less like 5 in 10K than you can simply drop those rows.
- If say huge number of rows have the same values like say 90% of rows have same value 9% another, 1% rows are missing than you could put values of that majority class in it.
- If none of above alternative is possible, than the only way out it to treat this empty column as a feature itself meaning assign “UNK” to the column and than get itself a label from label encoder.
The best other alternative which i would prefer would be one hot encoding, so if the value is missing all columns for the class would have 0 in it. And results would be more preferable than others.
Hope this resolved your doubt.
Plz mark the doubt as resolved in my doubts section.
Why would we prefer one hot encoding over method 4. Wouldn’t onehot encoding create more features?
Hey @raunaqsingh10, yes number of features would be more, but label encoding assigns integer to number. This means class having value ‘5’ is considered to be greater than ‘0’ but actually all classes are at same level. Like while predicting class for ‘horse’, ‘human’, ‘elephant’ etc we use one hot encoding.
But if the classes are like mild, normal, high than its better to assign them weights, i.e. 0,1,2, 2 being the highest belonging to high and 0 belonging to mild.
So if its possible to process that number of features we would use one hot, otherwise we will switch to label encoder
Hope this resolved your doubt.
Plz mark the doubt as resolved in my doubts section.
I cant seem to understand why my code isn’t working. I think there is a problem with the way I’ve done normalization, but I cannot understand the problem. Apart from train I’ve created a function train_validation which tells me how my model is acting on unseen data. My model is handling that also well. I’m attaching the google drive link - https://drive.google.com/open?id=1wF484JPSRyipENVZJ810nry_phJ0VPWz
Hey @raunaqsingh10, first of all build a keras model, if you want us to debug the code, remove column ‘FireplaceQu’,‘LotFrontage’ as count of missing values is very high. For each column make a different label_encoder object and not the same object.
Also since number of features won’t be large try your model for same after applying one hot vector.
Hope this resolved your doubt.
Plz mark the doubt as resolved in my doubts section.
Can you please take a look at the part where I normalize the data. Am I normalizing the data correctly. I’m simply taking the mean(u) and the variance(v) and then applying the transformation x = (x-u)/v and I am doing the same for target values(‘SalePrice’) also. I am applying normalization for both text and numerical features
Hey @raunaqsingh10, you are taking variance but actually you should take std - standard deviation. rest all is fine,
u = np.mean(x_raw,axis = 0)
u.shape
var = np.std(x_raw,axis=0)
var.shape
x = (x_raw-u)/var
print(x.mean(axis = 0))
I hope I’ve cleared your doubt. I ask you to please rate your experience here
Your feedback is very important. It helps us improve our platform and hence provide you
the learning experience you deserve.
On the off chance, you still have some questions or not find the answers satisfactory, you may reopen
the doubt.