Is there a way to improve the performance of the nltk.sentiment.vader sentiment analyzer?

My text is from a social network, so you can imagine the nature, I think the text is clean and minimal as far as I could imagine; after completing the following sanitary procedures:

  • no urls, no usernames
  • no punctuation marks, no accents
  • no numbers
  • no stopwords (I think vader does it anyway)

I think the runtime is linear and I am not going to do any parallelizations due to the amount of effort required to change the available code, For example, for about 1000 texts in the range of ~ 50kb to ~ 150kb bytes, it takes near

and running time is about 10 minutes on my machine.

Is there a better way to speed up the cooking time in feeding the algorithm? The code is as simple as SentimentIntensityAnalyzer is designed to work, here is the main part

sid = SentimentIntensityAnalyzer()

c.execute("select body, creation_date, group_id from posts where (substring(lower(body) from (%s))=(%s)) and language=\'en\' order by creation _ date DESC (s,s,)")
conn.commit()
if(c.rowcount>0):
                dump_fetched = c.fetchall()

textsSql=pd.DataFrame(dump_fetched,columns=['body','created_at', 'group_id'])
del dump_fetched
gc.collect()
texts = textsSql['body'].values
# here, some data manipulation: steps listed above
polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]

      

+3


source to share


1 answer


/1. You don't need to remove the temp words, nltk + vader does that already.

/ 2. You don't need to remove punctuation, as this also affects the wader polarity calculations, apart from processing costs. So, continue with punctuation.

    >>> txt = "this is superb!"
    >>> s.polarity_scores(txt)
    {'neg': 0.0, 'neu': 0.313, 'pos': 0.687, 'compound': 0.6588}
    >>> txt = "this is superb"
    >>> s.polarity_scores(txt)
    {'neg': 0.0, 'neu': 0.328, 'pos': 0.672, 'compound': 0.6249}

      



/ 3. You should also introduce sentence tokenization as this will improve precision and then calculate the average polarity for the sentence based paragraph. An example is here: https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vaderSentiment.py#L517

/4. The polarity calculations are completely independent of each other and can use a multiprocessor pool for a small size, say 10, to provide a nice speed boost.

polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]

+1


source







All Articles