Accelerate NLTK POS Tagger in pandas frame
So I am trying to use the Stanford POS nltk tagger on a pandas dataframe column. It is a column of rows and I am trying to extract the number of nouns in each row. Here's what I have:
from nltk.tag.stanford import POSTagger
from collections import Counter
import os
java_path = "C:/Program Files (x86)/Java/jre1.8.0_45/bin/java.exe"
os.environ['JAVAHOME'] = java_path
st = POSTagger('.../stanford-postagger-2015-04-20/models/english-left3words-distsim.tagger',
'.../stanford-postagger-2015-04-20/stanford-postagger.jar')
def noun_count(x):
listoflists = st.tag(x.split())
flat = [y for z in listoflists for y in z]
POS = [row[1] for row in flat]
c = Counter(POS)
nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
return nouns
Then I'll try to use apply to run it on the dataframe to create a new column.
df['noun_count'] = df['string'].apply(lambda row: noun_count(row),1)
And it works ... after a while. As if for at least an hour, maybe longer, I stopped watching it after that. Is there a way to speed up the process? What I notice is that it seems to call the .jar file and close it for every entry, of which I have quite a few (the task manager keeps showing java starting and stopping). Any ideas to make this work faster?
+3
source to share
No one has answered this question yet
Check out similar questions: