Speeding up the summation of specific columns in a matrix
Shortly speaking
Given a large sparse csr_matrix A
and a numpy array B
, what's the fastest way to build a numpy matrix C
such that C[i,j] = sum(A[k,j])
for everyone k
where B[k] == i
?
Details of the question
I found a solution for this, but I'm not very happy with how long it takes. I'll explain the problem first, then my solution, then show my code, and then show my timings.
Problem I am working on a clustering algorithm in Python and I would like to speed it up. I have a rare csr_matrix pam
in which I have per person in each article it says how many items they bought in that article. Also, I have a numpy array clustering
that denotes the cluster the person belongs to. Example:
pam pam.T clustering
article person
p [[1 0 0 0] a
e [0 2 0 0] r [[1 0 1 0 0 0] [0 0 0 0 1 1]
r [1 1 0 0] t [0 2 1 0 0 0]
s [0 0 1 0] i [0 0 0 1 0 1]
o [0 0 0 1] c [0 0 0 0 1 2]]
n [0 0 1 2]] l
e
What I like to calculate acm
is the number of items that all people have bought in one cluster. This means that for each column i
to acm
add the columns p
of pam.T
, for whom clustering[p] == i
.
acm
cluster
a
r [[2 0]
t [3 0]
i [1 1]
c [0 3]]
l
e
Solution First, I create another sparse matrix pcm
in which I point to an element [i,j]
if the person i
is in a cluster j
. Result (when clicking on dense matrix):
pcm
cluster
p [[False True]
e [False True]
r [ True False]
s [False True]
o [False True]
n [ True False]]
I then multiplies the matrix pam.T
on pcm
to get me the correct matrix.
code I wrote the following program to test the duration of this method in practice.
import numpy as np
from scipy.sparse.csr import csr_matrix
from timeit import timeit
def _clustering2pcm(clustering):
'''
Converts a clustering (np array) into a person-cluster matrix (pcm)
'''
N_persons = clustering.size
m_person = np.arange(N_persons)
clusters = np.unique(clustering)
N_clusters = clusters.size
m_data = [True] * N_persons
pcm = csr_matrix( (m_data, (m_person, clustering)), shape = (N_persons, N_clusters))
return pcm
def pam_clustering2acm():
'''
Convert a person-article matrix and a given clustering into an
article-cluster matrix
'''
global clustering
global pam
pcm = _clustering2pcm(clustering)
acm = csr_matrix.transpose(pam).dot(pcm).todense()
return acm
if __name__ == '__main__':
global clustering
global pam
N_persons = 200000
N_articles = 400
N_shoppings = 400000
N_clusters = 20
m_person = np.random.choice(np.arange(N_persons), size = N_shoppings, replace = True)
m_article = np.random.choice(np.arange(N_articles), size = N_shoppings, replace = True)
m_data = np.random.choice([1, 2], p = [0.99, 0.01], size = N_shoppings, replace = True)
pam = csr_matrix( (m_data, (m_person, m_article)), shape = (N_persons, N_articles))
clustering = np.random.choice(np.arange(N_clusters), size = N_persons, replace = True)
print timeit(pam_clustering2acm, number = 100)
Timing Turns out I need 5.1 seconds for these 100 runs. 3.6 seconds of them are spent on creation pcm
. I have a feeling there might be a faster way to calculate this matrix without creating a temporary sparse matrix, but I can't see it without a loop. Is there a faster way to build it?
EDIT
Following Martino's answer, I tried to implement a loop over clusters and a slicing algorithm, but it is even slower. It acm
takes 12.5 seconds to calculate 100 times, of which 4.1 seconds are left if I delete the row acm[:,i] = pam[p,:].sum(axis = 0)
.
def pam_clustering2acm_loopoverclusters():
global clustering
global pam
N_articles = pam.shape[1]
clusters = np.unique(clustering)
N_clusters = clusters.size
acm = np.zeros([N_articles, N_clusters])
for i in clusters:
p = np.where(clustering == i)[0]
acm[:,i] = pam[p,:].sum(axis = 0)
return acm
source to share
This is about 50 times faster than your function _clustering2pcm
:
def pcm(clustering):
n = clustering.size
data = np.ones((n,), dtype=bool)
indptr = np.arange(n+1)
return csr_matrix((data, clustering, indptr))
I haven't looked at the source code, but when you pass the CSR constructor a structure (data, (rows, cols))
, it almost certainly uses that to create a COO matrix and then convert it to CSR. Since your matrix is so simple, it is very easy to put the CSR matrix description arrays together as above and skip it all.
This almost cuts the execution time down to three:
In [38]: %timeit pam_clustering2acm()
10 loops, best of 3: 36.9 ms per loop
In [40]: %timeit pam.T.dot(pcm(clustering)).A
100 loops, best of 3: 12.8 ms per loop
In [42]: np.all(pam.T.dot(pcm(clustering)).A == pam_clustering2acm())
Out[42]: True
source to share
I refer you to the scipy.sparse docs ( http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix ). Where it is said that cutting rows is efficient (as opposed to splicing columns), so it is probably best to stick to a non-transposed matrix. Then, if you scroll down, there is a sum function where you can specify the axis. It is probably best to use the methods that come with your object, as they can use compiled code. It does this by looping through the clusters (of which I assume there aren't many).
source to share