Accelerate NLTK POS Tagger in pandas frame
So I am trying to use the Stanford POS nltk tagger on a pandas dataframe column. It is a column of rows and I am trying to extract the number of nouns in each row. Here's what I have:
from nltk.tag.stanford import POSTagger from collections import Counter import os java_path = "C:/Program Files (x86)/Java/jre1.8.0_45/bin/java.exe" os.environ['JAVAHOME'] = java_path st = POSTagger('.../stanford-postagger-2015-04-20/models/english-left3words-distsim.tagger', '.../stanford-postagger-2015-04-20/stanford-postagger.jar') def noun_count(x): listoflists = st.tag(x.split()) flat = [y for z in listoflists for y in z] POS = [row for row in flat] c = Counter(POS) nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS'] return nouns
Then I'll try to use apply to run it on the dataframe to create a new column.
df['noun_count'] = df['string'].apply(lambda row: noun_count(row),1)
And it works ... after a while. As if for at least an hour, maybe longer, I stopped watching it after that. Is there a way to speed up the process? What I notice is that it seems to call the .jar file and close it for every entry, of which I have quite a few (the task manager keeps showing java starting and stopping). Any ideas to make this work faster?
source to share
No one has answered this question yet
Check out similar questions: