Accelerate NLTK POS Tagger in pandas frame

Question

Accelerate NLTK POS Tagger in pandas frame

So I am trying to use the Stanford POS nltk tagger on a pandas dataframe column. It is a column of rows and I am trying to extract the number of nouns in each row. Here's what I have:

from nltk.tag.stanford import POSTagger
from collections import Counter
import os
java_path = "C:/Program Files (x86)/Java/jre1.8.0_45/bin/java.exe"
os.environ['JAVAHOME'] = java_path
st = POSTagger('.../stanford-postagger-2015-04-20/models/english-left3words-distsim.tagger', 
                '.../stanford-postagger-2015-04-20/stanford-postagger.jar')

def noun_count(x):
    listoflists = st.tag(x.split())
    flat = [y for z in listoflists for y in z]
    POS = [row[1] for row in flat]
    c = Counter(POS)
    nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
    return nouns

Then I'll try to use apply to run it on the dataframe to create a new column.

df['noun_count'] = df['string'].apply(lambda row: noun_count(row),1)

And it works ... after a while. As if for at least an hour, maybe longer, I stopped watching it after that. Is there a way to speed up the process? What I notice is that it seems to call the .jar file and close it for every entry, of which I have quite a few (the task manager keeps showing java starting and stopping). Any ideas to make this work faster?

+3

python pandas nltk stanford-nlp pos-tagger

halycos May 09 '15 at 11:43

source to share