Efficient implementation of words counting across multiple lists using Python

I have a list of comments in the following format:

Comments=[['hello world'], ['would', 'hard', 'press'],['find', 'place', 'less'']]



I want to have a table or dataframe that has wordet as index and individual counts for each comment in the comments

I have worked with the following code that provides the required dataframe. And this is very important and I am looking for an efficient implementation. Since the corpus is large, this has a huge impact on the performance of our ranking algorithm.

        for comment in Comments:
            for items in comment:
            result = pd.concat(frames)



Expected Result:

        0   1   2
hello   1   0   0
world   1   0   0
would   0   1   0
press   0   1   0
find    0   0   1
place   0   0   1
less    0   0   1
hard    0   1   0



source to share

2 answers

Try this approach:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

text = pd.Series(Comments).str.join(' ')
X = vect.fit_transform(text)

r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())



In [49]: r
   find  hard  hello  less  place  press  world  would
0     0     0      1     0      0      0      1      0
1     0     1      0     0      0      1      0      1
2     1     0      0     1      1      0      0      0

In [50]: r.T
       0  1  2
find   0  0  1
hard   0  1  0
hello  1  0  0
less   0  0  1
place  0  0  1
press  0  1  0
world  1  0  0
would  0  1  0


Pure Pandas solution:

In [61]: pd.get_dummies(text.str.split(expand=True), prefix_sep='', prefix='')
   find  hello  would  hard  place  world  less  press
0     0      1      0     0      0      1     0      0
1     0      0      1     1      0      0     0      1
2     1      0      0     0      1      0     1      0




I think your nested loop is getting more complicated. I am writing code that replaces 2 for loops with a single map function . I only write code up to the part where for each comment in the comments you get count_dictionary for "Hello" and "World". Please copy over the rest of the table code using pandas.

from collections import Counter
import funcy
from funcy import project
def fun(comment):
    temp_dict_comment = Counter(comment)
    temp_dict_comment = dict(temp_dict_comment)
    final_dict = project(temp_dict_comment,wordset)
    print final_dict
Comments=[['hello', 'world'], ['would', 'hard', 'press'],['find', 'place', 'less', 'excitingit', 'wors', 'watch', 'paint', 'dri']]


This should help as it only contains a single card instead of 2 for loops.



All Articles