Speeding up the summation of specific columns in a matrix

Shortly speaking

Given a large sparse csr_matrix A

and a numpy array B

, what's the fastest way to build a numpy matrix C

such that C[i,j] = sum(A[k,j])

for everyone k

where B[k] == i

?

Details of the question

I found a solution for this, but I'm not very happy with how long it takes. I'll explain the problem first, then my solution, then show my code, and then show my timings.

Problem I am working on a clustering algorithm in Python and I would like to speed it up. I have a rare csr_matrix pam

in which I have per person in each article it says how many items they bought in that article. Also, I have a numpy array clustering

that denotes the cluster the person belongs to. Example:

      pam                pam.T              clustering
    article              person
p [[1 0 0 0]      a 
e  [0 2 0 0]      r [[1 0 1 0 0 0]         [0 0 0 0 1 1]
r  [1 1 0 0]      t  [0 2 1 0 0 0]
s  [0 0 1 0]      i  [0 0 0 1 0 1]
o  [0 0 0 1]      c  [0 0 0 0 1 2]]
n  [0 0 1 2]]     l
                  e

      

What I like to calculate acm

is the number of items that all people have bought in one cluster. This means that for each column i

to acm

add the columns p

of pam.T

, for whom clustering[p] == i

.

    acm
  cluster
a 
r [[2 0]
t  [3 0]
i  [1 1]
c  [0 3]]
l
e

      

Solution First, I create another sparse matrix pcm

in which I point to an element [i,j]

if the person i

is in a cluster j

. Result (when clicking on dense matrix):

        pcm
      cluster
p [[False  True]
e  [False  True]
r  [ True False]
s  [False  True]
o  [False  True]
n  [ True False]]

      

I then multiplies the matrix pam.T

on pcm

to get me the correct matrix.

code I wrote the following program to test the duration of this method in practice.

import numpy as np
from scipy.sparse.csr import csr_matrix
from timeit import timeit

def _clustering2pcm(clustering):
    '''
    Converts a clustering (np array) into a person-cluster matrix (pcm)
    '''        
    N_persons = clustering.size
    m_person = np.arange(N_persons)
    clusters = np.unique(clustering)
    N_clusters = clusters.size
    m_data = [True] * N_persons

    pcm = csr_matrix( (m_data, (m_person, clustering)), shape = (N_persons, N_clusters))

    return pcm

def pam_clustering2acm():
    '''
    Convert a person-article matrix and a given clustering into an 
    article-cluster matrix
    '''
    global clustering
    global pam

    pcm = _clustering2pcm(clustering)    
    acm = csr_matrix.transpose(pam).dot(pcm).todense()
    return acm

if __name__ == '__main__':
    global clustering
    global pam

    N_persons = 200000
    N_articles = 400
    N_shoppings = 400000
    N_clusters = 20

    m_person = np.random.choice(np.arange(N_persons), size = N_shoppings, replace = True)
    m_article = np.random.choice(np.arange(N_articles), size = N_shoppings, replace = True)
    m_data = np.random.choice([1, 2], p = [0.99, 0.01], size = N_shoppings, replace = True)
    pam = csr_matrix( (m_data, (m_person, m_article)), shape = (N_persons, N_articles))

    clustering = np.random.choice(np.arange(N_clusters), size = N_persons, replace = True)
    print timeit(pam_clustering2acm, number = 100)

      

Timing Turns out I need 5.1 seconds for these 100 runs. 3.6 seconds of them are spent on creation pcm

. I have a feeling there might be a faster way to calculate this matrix without creating a temporary sparse matrix, but I can't see it without a loop. Is there a faster way to build it?

EDIT

Following Martino's answer, I tried to implement a loop over clusters and a slicing algorithm, but it is even slower. It acm

takes 12.5 seconds to calculate 100 times, of which 4.1 seconds are left if I delete the row acm[:,i] = pam[p,:].sum(axis = 0)

.

def pam_clustering2acm_loopoverclusters():
    global clustering
    global pam

    N_articles = pam.shape[1]
    clusters = np.unique(clustering)
    N_clusters = clusters.size
    acm = np.zeros([N_articles, N_clusters])
    for i in clusters:
        p = np.where(clustering == i)[0]
        acm[:,i] = pam[p,:].sum(axis = 0)

    return acm

      

+3


source to share


2 answers


This is about 50 times faster than your function _clustering2pcm

:

def pcm(clustering):
    n = clustering.size
    data = np.ones((n,), dtype=bool)
    indptr = np.arange(n+1)
    return csr_matrix((data, clustering, indptr))

      

I haven't looked at the source code, but when you pass the CSR constructor a structure (data, (rows, cols))

, it almost certainly uses that to create a COO matrix and then convert it to CSR. Since your matrix is ​​so simple, it is very easy to put the CSR matrix description arrays together as above and skip it all.



This almost cuts the execution time down to three:

In [38]: %timeit pam_clustering2acm()
10 loops, best of 3: 36.9 ms per loop

In [40]: %timeit pam.T.dot(pcm(clustering)).A
100 loops, best of 3: 12.8 ms per loop

In [42]: np.all(pam.T.dot(pcm(clustering)).A == pam_clustering2acm())
Out[42]: True

      

+4


source


I refer you to the scipy.sparse docs ( http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix ). Where it is said that cutting rows is efficient (as opposed to splicing columns), so it is probably best to stick to a non-transposed matrix. Then, if you scroll down, there is a sum function where you can specify the axis. It is probably best to use the methods that come with your object, as they can use compiled code. It does this by looping through the clusters (of which I assume there aren't many).



+1


source







All Articles