Efficient plotting of sparse bijagency matrix in Numpy
I am trying to load this CSV file into a numpy sparse matrix that will represent the adjacency matrix of this bi-directional plot from user to subredd: http://figshare.com/articles/reddit_user_posting_behavior/874101
Here's an example:
603,politics,trees,pics
604,Metal,AskReddit,tattoos,redditguild,WTF,cocktails,pics,funny,gaming,Fitness,mcservers,TeraOnline,GetMotivated,itookapicture,Paleo,trackers,Minecraft,gainit
605,politics,IAmA,AdviceAnimals,movies,smallbusiness,Republican,todayilearned,AskReddit,WTF,IWantOut,pics,funny,DIY,Frugal,relationships,atheism,Jeep,Music,grandrapids,reddit.com,videos,yoga,GetMotivated,bestof,ShitRedditSays,gifs,technology,aww
There are 876,961 lines (one for each user) and 15,122 subreddits and a total of 8,495,957 associations from user to subreddit.
Here is the code I have now that takes 20 minutes to run on my MacBook Pro:
import numpy as np
from scipy.sparse import csr_matrix
row_list = []
entry_count = 0
all_reddits = set()
with open("reddit_user_posting_behavior.csv", 'r') as f:
for x in f:
pieces = x.rstrip().split(",")
user = pieces[0]
reddits = pieces[1:]
entry_count += len(reddits)
for r in reddits: all_reddits.add(r)
row_list.append(np.array(reddits))
reddits_list = np.array(list(all_reddits))
# 5s to get this far
rows = np.zeros((entry_count,))
cols = np.zeros((entry_count,))
data = np.ones((entry_count,))
i=0
user_idx = 0
for row in row_list:
for reddit_idx in np.nonzero(np.in1d(reddits_list,row))[0]:
cols[i] = user_idx
rows[i] = reddit_idx
i+=1
user_idx+=1
adj = csr_matrix( (data,(rows,cols)), shape=(len(reddits_list), len(row_list)) )
It seems hard to believe this is as fast as it can happen ... Loading an 82MB file into a list of lists takes 5 seconds, but building a sparse matrix takes 200 times longer. What can I do to speed up this? Is there some file format that I can convert to this CSV in less than 20 minutes that will import faster? Is there some obviously expensive operation I am doing here that is not good? I tried to create a dense matrix and I tried to create lil_matrix
and dok_matrix
and assign 1
one at a time, not faster.
source to share
Couldn't sleep, tried one more thing ... I was able to bring it down to 10 seconds this way, at the end:
import numpy as np
from scipy.sparse import csr_matrix
user_ids = []
subreddit_ids = []
subreddits = {}
i=0
with open("reddit_user_posting_behavior.csv", 'r') as f:
for line in f:
for sr in line.rstrip().split(",")[1:]:
if sr not in subreddits:
subreddits[sr] = len(subreddits)
user_ids.append(i)
subreddit_ids.append(subreddits[sr])
i+=1
adj = csr_matrix(
( np.ones((len(userids),)), (np.array(subreddit_ids),np.array(user_ids)) ),
shape=(len(subreddits), i) )
source to share
To begin with, you can replace in the inner for
one with something like:
reddit_idx = np.nonzero(np.in1d(reddits_list,row))[0]
sl = slice(i,i+len(reddit_idx))
cols[sl] = user_idx
rows[sl] = reddit_idx
i = sl.stop
Using it nonzero(in1d())
to find matches looks good, but I haven't looked into alternatives. An alternative to slice assignment is a list extend
, but this is probably slower, especially with many lines.
Constructing strings, cols is by far the slowest part. The challenge csr_matrix
is small.
Since there are many more rows (users) than subredassets, it may be necessary to collect a list of user IDs for each subredd. You are already collecting subreddits in a set. Instead, you can collect them in the default dictionary and build a matrix from that. When tested on 3 lines replicated 100,000 times, it is noticeably faster.
from collections import defaultdict
red_dict = defaultdict(list)
user_idx = 0
with open("reddit_user_posting_behavior.csv", 'r') as f:
for x in f:
pieces = x.rstrip().split(",")
user = pieces[0]
reddits = pieces[1:]
for r in reddits:
red_dict[r] += [user_idx]
user_idx += 1
print 'done 2nd'
x = red_dict.values()
adj1 = sparse.lil_matrix((len(x), user_idx), dtype=int)
for i,j in enumerate(x):
adj1[i,j] = 1
source to share