Feature Extraction in the dataset below

samalprathamesh123 · April 6, 2021, 10:03am

In this dataset should i encode the columns which have string datatype ?

For example in car_ownership shall i replace “no” with 0 and “yes” with 1 ?

What would be an ideal approach to extract features from this dataset?

prashant_ml · April 6, 2021, 2:56pm

hey @samalprathamesh123 ,
yeah you can encode them in that way.

Actual there is no ideal way or perfect way to do so , it depends upon what model you are going to use further.
but some things that you can try is.

Label Encode them ( the above way )
One Hot encode
Target Encode

and further you can go for other feature extraction techniques to generate more features.

samalprathamesh123 · April 6, 2021, 3:13pm

Got the logic sir. Thank You. One more doubt was that when i am using Decision Tree Classifier it works fine but upon using Random forests the prediction contains only zero values. Is there any reason to it ?

prashant_ml · April 6, 2021, 5:26pm

it might be because the model is getting overfitted.
try using class weights , if its is highly unbalanced .

samalprathamesh123 · April 6, 2021, 7:04pm

got it. by the could you tell me just the steps that i need to follow in order to work with this dataset. like should i take the top 5 features and make a model or something else. i am bit confused because categorical data is also present.

i tried encoding the categorical data into numerics but its lowering the accuracy.

thank you.

samalprathamesh123 · April 7, 2021, 7:59am

i have done with the encoding

samalprathamesh123 · April 7, 2021, 8:02am

Now using feature selection i have selected top 7 features out of 12 features but this error is coming when i am training the data : “ValueError: Number of features of the model must match the input. Model n_features is 7 and input n_features is 12”. I have done the feature selection process but i dont know how to train the data as the number of features in train and test is different and this error is coming

prashant_ml · April 7, 2021, 8:06am

hey @samalprathamesh123 ,
its not always that you need to select a bunch of features only.
The selection and feature engineering depends highly on what model you prefer to choose .
For example if you choose linear regression , then it has an assumption that the data should be linearly independent and should mean as to be 0 and Std to be 1.

So , similarly tree models have different workings and learn in a different manner.
so you need to quite confident and good on selecting those features.

prashant_ml · April 7, 2021, 8:07am

for this ,
you might have got the selected features , but you also need to select those only from the test data too , so that your model can work.

samalprathamesh123 · April 7, 2021, 8:13am

oh got it . actually i have two different files one is for training data and other is for the test data. so now while predicting i should use only those columns of test data which are present in training data /

samalprathamesh123 · April 7, 2021, 8:17am

sir is there any way i can share my notebook to you and you could just have a look whether my approach is correct or not ? I shall be grateful

prashant_ml · April 7, 2021, 9:44am

yeah do one thing
just upload your data and code on github and share me the link here

samalprathamesh123 · April 7, 2021, 10:41am

Sure I will do that asap

samalprathamesh123 · April 7, 2021, 10:51am

prashant_ml · April 7, 2021, 4:14pm

some tips:

Before using SelectKBest and Chi-square , first understand how do they work. There working depends a lot on data. So , sometimes they might be good , sometimes don’t.
instead of factorize , use LabelEncoder ( from sklearn.preprocessing )
for testing , you have those seven_features. so just do , test_data[seven_features ] it will work.

samalprathamesh123 · April 7, 2021, 4:32pm

Thank you for your tips. I have noted these. I just want to ask whether the overall approach of solving the problem (in the notebook)is somewhat excluding the points you mentioned , since i will now correct these points mentioned by you. Another doubt was that you mentioned to use test_data[seven_features] but i have already used it. can you please specify the exact line wherein i need to change this ?

prashant_ml · April 7, 2021, 5:05pm

no it isn’t , your approach is quite good.
Its just some improvements.

its correct.
i didn’t checked that.

samalprathamesh123 · April 7, 2021, 5:43pm

got it. Thank you so much for the help.

prashant_ml · April 7, 2021, 5:57pm

I hope I’ve cleared your doubt. I ask you to please rate your experience here
Your feedback is very important. It helps us improve our platform and hence provide you
the learning experience you deserve.

On the off chance, you still have some questions or not find the answers satisfactory, you may reopen
the doubt.

samalprathamesh123 · April 12, 2021, 12:26pm

Good evening sir. I have some more doubts in the same problem statement. Actually my ROC AUC score is 0.74 and I want to improve this score to 0.80+ . Can you please help me out. I am attaching the github link for the updated files.