Movie Review Prediction - Using Multinomial Naive Bayes

In this, You are saying that to detect ‘not’ word use bigram. but Not removes in during stopword remove. ‘Not’ is a stopword. So how can we get "not, no, don’t " type words after removing stopwords.

Hello Vinay,

You can create your own stopword list by removing the desired words from the default stopword list.

Thanks :slight_smile:

Hi @Manu-Pillai-1566551720093198
thanks for the reply but it’s not a proper solution to creat own stopword. Let’s assume if you work on french or other language data set than how can you create own stopword list.
I am asking all negative words.

Hello @Vinay-Mittal-2143588655728125

The reason why we remove stopwords from the text corpuses are to remove those words that occurs on almost every sentences and gives no particular characterstics to a corpus if it appears in it. Like counting the number of “is” in a corpus will not give me any information as it appears in almost all sentences. Hence after removing such words, the words that remain are those words that has some information contained in them that we or our algorithm can exploit to achieve convergence(hypothesis). Creating your own stopword will not do any harm to your algorithm, but as you said why people dont do it because, there is a huge chance that while creating a stopword list, one might forget some words. But if say, when attempting a challenge, i saw that most of the classes that are negative has “not” in it or say most of them might or can have “not”, then I can safely include it in my corpuses while doing CountVectorizer or other Feature Extraction/ Vectorization methods. Removing stopwords and that too using from the nltk package is no necessary and neither its a magic.
I have personally done competitions and projects where I created my own kind of stopwords. (Here by creating i mean i have added some and removed some).

Also, by “not” or “none”, it might have been given as an example. It is also possible that “not” has a chance of occuring in most of the sentences, so while doing a classification for negative and positive reviews, we think of words that are not so frequent but are negative in sense, like, “disappointment”, “bored” etc which will give us the information we are looking for. But as I said, its just hypothesis or my prior beliefs that i am imposing on my model.
Hence, if you thing that your classes has “not” as a distinct feature then you sure can include them. There is no error in that.

Ps. including “not” or “none” or other such words are all problem dependant because, there is a sentence, “the movie was not that good” also a sentence, “everybody said the movie was not good, but i enjoyed it”. Therefore, if both exists in your dataset, you should remove “not” so as to be in the safe side. :wink: Hence the default stopwords.

Thanks :slight_smile:

3 Likes

I hope I’ve cleared your doubt. I ask you to please rate your experience here
Your feedback is very important. It helps us improve our platform and hence provide you
the learning experience you deserve.

On the off chance, you still have some questions or not find the answers satisfactory, you may reopen
the doubt.