Challenge - Movie Ratings(sentiment-analysis)

Sir here data is actully so messy i mean so shabby there were https, br, title:xyz, promotions nd much more like this are prsent in btwn the sentences. Yaa im trying to get rid out off this using regx and some other filtration methods but still i don’t know :expressionless: im not so sure about this can you please help me? anyone??
And one more thing is, data is actully too big so like if im trying to break sentncs like bigrams, trigrams as i tends towards n-grams dimensionality count of my data also increses like for trigrams it is around 48000 -_-
how do i deal with this?

hey @snehill090 ,
the first step in data cleaning , that you are working on correctly , keep going with that.

what you don’t know ??

you might be storing them as array using toarray function , just dont do that ,and use it directly as it is it will work.

what you don’t know ??

how to clean all that big data as I already mentioned there were some words in every sentence which i think are completely irrelevant.
Like If say okky I’m going to read every sentence -_- so that I can filter out those irrelevant words. but is it worth? I don’t think so…
So how do I do this work? How do I filter those words?

hey @snehill090 ,
see you can’t be perfect in that , and so there will some such words there in the corpus which you can’t deal with , but there frequency of occurrence will be so low , that it won’t effect your modelling much.

Hence , to whatever extent you can go and understand such differences , that is way better than actual data.

I hope this helps.

1 Like

data_pre_poc.py look this is how i am doing…

so what is the problem in this ? its correct to go with.

See there are many other things to try on , but you will come to know about them while practicing only , so search and then try them, in some or the other way they will surely help you only.

1 Like

whole data-set gonna take lot of computational power :confused: which is -_-
but thank you sir :slight_smile: for clearing my doubts…

one last question…
Is there any way I can reduce the dementiality? like mujhe smjh nhi aarha kaise :expressionless:

yes it will. No other option for that.

yeah , so once you have generated the features using count Vectorizer or tfidf , then you can PCA or TSNE , to reduce the dimensions of your data and then further use it for prediction purposes.

1 Like

TSNE( i haven’t read it yet; practically)

Today you have cleared many of my doubts …
so once again thank you :slight_smile:

snehill signing off :confused: - till next time
:slight_smile:

I hope I’ve cleared your doubt. I ask you to please rate your experience here
Your feedback is very important. It helps us improve our platform and hence provide you
the learning experience you deserve.

On the off chance, you still have some questions or not find the answers satisfactory, you may reopen
the doubt.

1 Like

bcz countvectorizer method returns sparce matrix…
how to apply pca on sparce matrix??
actully im trying to apply pca(otherwise taking to much time finding accuracy -_-)

okay , in that case you need to truncatedSVD

similar to PCA , but for sparse matrix

somebody suggested me that use pca_v = PCA.fit_transform(sparce_mjatrix.A) and ya it works!!
But now the main problem comes whether I use PCA or SVD both make some values of my data set negative…nd when i finally apply mnd.fit(pca_v,Y) it says value error :Negative values in data passed to MultinomialNB (input X)

What should i do?

didn’t you used truncated SVD ??

what is this ? which model.

what is this ?which model

sorry its typo -_- it means multinomial naive bayes.

didn’t you used truncated SVD ??
nahh.

try that once.

its should work i guess , else once try with gaussian naive bayes also.