Efficient implementation of words counting across multiple lists using Python
I have a list of comments in the following format:
Comments=[['hello world'], ['would', 'hard', 'press'],['find', 'place', 'less'']]
wordset={'hello','world','hard','would','press','find','place','less'}
I want to have a table or dataframe that has wordet as index and individual counts for each comment in the comments
I have worked with the following code that provides the required dataframe. And this is very important and I am looking for an efficient implementation. Since the corpus is large, this has a huge impact on the performance of our ranking algorithm.
result=pd.DataFrame()
for comment in Comments:
worddict_terms=dict.fromkeys(wordset,0)
for items in comment:
worddict_terms[items]+=1
df_comment=pd.DataFrame.from_dict([worddict_terms])
frames=[result,df_comment]
result = pd.concat(frames)
Comments_raw_terms=result.transpose()
Expected Result:
0 1 2
hello 1 0 0
world 1 0 0
would 0 1 0
press 0 1 0
find 0 0 1
place 0 0 1
less 0 0 1
hard 0 1 0
source to share
Try this approach:
import pandas as pd from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer() text = pd.Series(Comments).str.join(' ') X = vect.fit_transform(text) r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
Result:
In [49]: r
Out[49]:
find hard hello less place press world would
0 0 0 1 0 0 0 1 0
1 0 1 0 0 0 1 0 1
2 1 0 0 1 1 0 0 0
In [50]: r.T
Out[50]:
0 1 2
find 0 0 1
hard 0 1 0
hello 1 0 0
less 0 0 1
place 0 0 1
press 0 1 0
world 1 0 0
would 0 1 0
Pure Pandas solution:
In [61]: pd.get_dummies(text.str.split(expand=True), prefix_sep='', prefix='')
Out[61]:
find hello would hard place world less press
0 0 1 0 0 0 1 0 0
1 0 0 1 1 0 0 0 1
2 1 0 0 0 1 0 1 0
source to share
I think your nested loop is getting more complicated. I am writing code that replaces 2 for loops with a single map function . I only write code up to the part where for each comment in the comments you get count_dictionary for "Hello" and "World". Please copy over the rest of the table code using pandas.
from collections import Counter
import funcy
from funcy import project
def fun(comment):
wordset={'hello','world'}
temp_dict_comment = Counter(comment)
temp_dict_comment = dict(temp_dict_comment)
final_dict = project(temp_dict_comment,wordset)
print final_dict
Comments=[['hello', 'world'], ['would', 'hard', 'press'],['find', 'place', 'less', 'excitingit', 'wors', 'watch', 'paint', 'dri']]
map(fun,Comments)
This should help as it only contains a single card instead of 2 for loops.
source to share