Word termination in natural language Python
I am just doing some research in NLP with Python and I found something strange.
When considering the following negative tweets:
neg_tweets = [('I do not like this car', 'negative'),
('This view is horrible', 'negative'),
('I feel tired this morning', 'negative'),
('I am not looking forward to the concert', 'negative'),<---
('He is my enemy', 'negative')]
And with some processing by removing the stop words.
clean_data = []
stop_words = set(stopwords.words("english"))
for (words, sentiment) in pos_tweets + neg_tweets:
words_filtered = [e.lower() for e in words.split() if e not in stop_words]
clean_data.append((words_filtered, sentiment))
Part of the output:
(['i', 'looking', 'forward', 'concert'], 'negative')
I'm trying to understand why stop words include "no", which can affect the mood of a tweet.
My understanding is that stop words are irrelevant in terms of sentiment.
So my question is, why is "not" included in the stop word list?
source to share
Stop words in the sentence "in general" are used little or not at all. As the Stanford NLP group said:
Sometimes, some extremely common words that do not seem to be of great importance in helping to select documents that suit the user's needs are completely excluded from the dictionary. These words are called stop words
Why is the word "no" ?: Simply because it appears very often in English vocabulary and "usually" has little or no meaning, for example if you are doing text generalization where these stop words are hardly used, and all this determined by the frequency distribution of words (eg tf-idf
.
So what can you do? Well this is a very broad topic known as Negation Handling . This is a very broad field with many different methods. One of my favorites is to simply add previous or subsequent negatives before removing stopwords or calculating word vectors. For example, you can convert not looking
to not_looking
, which will be very different when calculated and converted to vector space. You can find code to do something like this in the SO post here .
Hope this helps!
source to share