Accelerate NLTK POS Tagger in pandas frame

Question

Accelerate NLTK POS Tagger in pandas frame

So I am trying to use the Stanford POS nltk tagger on a pandas dataframe column. It is a column of rows and I am trying to extract the number of nouns in each row. Here's what I have:

from nltk.tag.stanford import POSTagger
from collections import Counter
import os
java_path = "C:/Program Files (x86)/Java/jre1.8.0_45/bin/java.exe"
os.environ['JAVAHOME'] = java_path
st = POSTagger('.../stanford-postagger-2015-04-20/models/english-left3words-distsim.tagger', 
                '.../stanford-postagger-2015-04-20/stanford-postagger.jar')

def noun_count(x):
    listoflists = st.tag(x.split())
    flat = [y for z in listoflists for y in z]
    POS = [row[1] for row in flat]
    c = Counter(POS)
    nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
    return nouns

Then I'll try to use apply to run it on the dataframe to create a new column.

df['noun_count'] = df['string'].apply(lambda row: noun_count(row),1)

And it works ... after a while. As if for at least an hour, maybe longer, I stopped watching it after that. Is there a way to speed up the process? What I notice is that it seems to call the .jar file and close it for every entry, of which I have quite a few (the task manager keeps showing java starting and stopping). Any ideas to make this work faster?

+3

python pandas nltk stanford-nlp pos-tagger

halycos May 09 '15 at 11:43

source to share

No one has answered this question yet

Check out similar questions:

1553

Renaming columns in pandas

1462

How to iterate over rows in a DataFrame in Pandas?

1419

Select rows from DataFrame based on values in column in pandas

1033

Remove column from panda DataFrame

889

Selecting multiple columns in pandas dataframe

879

Get list from pandas DataFrame column headers

873

Big data workflows using pandas

815

Adding a new column to an existing DataFrame in Python pandas

571

Writing DataFrame for pandas to CSV file

7

"ImportError: Unable to import StanfordNERTagger name" in NLTK

Accelerate NLTK POS Tagger in pandas frame

More articles: