Accelerate NLTK POS Tagger in pandas frame

So I am trying to use the Stanford POS nltk tagger on a pandas dataframe column. It is a column of rows and I am trying to extract the number of nouns in each row. Here's what I have:

from nltk.tag.stanford import POSTagger
from collections import Counter
import os
java_path = "C:/Program Files (x86)/Java/jre1.8.0_45/bin/java.exe"
os.environ['JAVAHOME'] = java_path
st = POSTagger('.../stanford-postagger-2015-04-20/models/english-left3words-distsim.tagger', 
                '.../stanford-postagger-2015-04-20/stanford-postagger.jar')

def noun_count(x):
    listoflists = st.tag(x.split())
    flat = [y for z in listoflists for y in z]
    POS = [row[1] for row in flat]
    c = Counter(POS)
    nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
    return nouns

      

Then I'll try to use apply to run it on the dataframe to create a new column.

df['noun_count'] = df['string'].apply(lambda row: noun_count(row),1)

      

And it works ... after a while. As if for at least an hour, maybe longer, I stopped watching it after that. Is there a way to speed up the process? What I notice is that it seems to call the .jar file and close it for every entry, of which I have quite a few (the task manager keeps showing java starting and stopping). Any ideas to make this work faster?

+3
python pandas nltk stanford-nlp pos-tagger


source to share


No one has answered this question yet

Check out similar questions:

1553
Renaming columns in pandas
1462
How to iterate over rows in a DataFrame in Pandas?
1419
Select rows from DataFrame based on values ​​in column in pandas
1033
Remove column from panda DataFrame
889
Selecting multiple columns in pandas dataframe
879
Get list from pandas DataFrame column headers
873
Big data workflows using pandas
815
Adding a new column to an existing DataFrame in Python pandas
571
Writing DataFrame for pandas to CSV file
7
"ImportError: Unable to import StanfordNERTagger name" in NLTK



All Articles
Loading...
X
Show
Funny
Dev
Pics