Is sparse multiplication of matrix vectors faster in Matlab than in Python?

Edit: See this question where I learned how to parallelize sparse transformation of matrix vectors in Python using Numba and was able to link with Matlab.


Original question:

I observe that matrix vector multiplication is allowed about 4 or 5 times faster in Matlab than in Python (using scipy sparse matrices). Here are some details from the Matlab command line:

>> whos A
  Name          Size                    Bytes  Class     Attributes

  A         47166x113954            610732376  double    sparse    

>> whos ATrans
  Name             Size                   Bytes  Class     Attributes

  ATrans      113954x47166            610198072  double    sparse    

>> nnz(A)/numel(A)

ans =

    0.0071

>> whos x
  Name           Size             Bytes  Class     Attributes

  x         113954x1             911632  double              

>> myFun = @() A*x; timeit(myFun)

ans =

    0.0601

>> myFun = @() ATrans'*x; timeit(myFun)

ans =

    0.0120

      

The ATrans matrix is ​​the transposition of A. Note that calculating A times x in Matlab takes about 0.6 seconds, but if I use the weird "transpose trick" and calculate the times ATrans x (which gives the same result as A times x ), the calculation takes 0.012 seconds. (I don't understand why this trick works.)

Below are some of the sync results from the Python command line:

In[50]: type(A)
Out[50]: 
scipy.sparse.csc.csc_matrix
In[51]: A.shape
Out[51]: 
(47166, 113954)
In[52]: type(x)
Out[52]: 
numpy.ndarray
In[53]: x.shape
Out[53]: 
(113954L, 1L)
In[54]: timeit.timeit('A.dot(x)',setup="from __main__ import A,x",number=100)/100.0
   ...: 
Out[54]: 
0.054835451461831324

      

So, the Python runtime to run time x is about 4.5 times longer than the Matlab runtime, for the same matrix A. If I store A in csr format, the Python runtime is slightly worse:

In[63]: A = sp.sparse.csr_matrix(A)
In[64]: timeit.timeit('A.dot(x)',setup="from __main__ import A,x",number=100)/100.0
   ...: 
Out[64]: 
0.0722580226496575

      

Here is some information on which version of python and which version of anaconda I am using:

In[2]: import sys; print('Python %s on %s' % (sys.version, sys.platform))
Python 2.7.12 |Anaconda 4.2.0 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)] on win32

      

Question: Why is this sparse transformation of matrix vectors faster in Matlab than in Python? And how can I do this equally quickly in Python?


Edit 1: Here's the key. In Python, if I set the number of threads to 1, the runtime for dense matrix vector multiplication suffers a lot, but the runtime for sparse vector matrix multiplication hardly changes.

In[48]: M = np.random.rand(1000,1000)
In[49]: y = np.random.rand(1000,1)
In[50]: import mkl
In[51]: mkl.get_max_threads()
Out[51]: 
20
In[52]: timeit.timeit('M.dot(y)', setup = "from __main__ import M,y", number=100) / 100.0
Out[52]: 
7.232593519574948e-05
In[53]: mkl.set_num_threads(1)
In[54]: timeit.timeit('M.dot(y)', setup = "from __main__ import M,y", number=100) / 100.0
Out[54]: 
0.00044465965093536396
In[56]: type(A)
Out[56]: 
scipy.sparse.csc.csc_matrix
In[57]: timeit.timeit('A.dot(x)', setup = "from __main__ import A,x", number=100) / 100.0
Out[57]: 
0.055780856886028685
In[58]: mkl.set_num_threads(20)
In[59]: timeit.timeit('A.dot(x)', setup = "from __main__ import A,x", number=100) / 100.0
Out[59]: 
0.05550840215802509

      

Thus, for dense matrix-vector products, setting the number of threads to 1 reduced the execution time by about 6 times. But for a sparse matrix-vector product, reducing the number of threads to 1 did not change the execution time.

I think this suggests that in Python sparse matrix-vector multiplication is not done in parallel, whereas dense matrix-vector multiplication uses all available kernels. Do you agree with this conclusion? And if so, is there a way to use all available kernels for sparse multiplication of matrix vectors in Python?

Please note that Anaconda should use the default Intel MKL optimizations: https://www.continuum.io/blog/developer-blog/anaconda-25-release-now-mkl-optimizations


Edit 2: I read here that "for sparse matrices, all level 2 [BLAS] operations except sparse triangular solvers are numbered" in Intel MKL. This suggests that scipy does not use Intel MKL to perform sparse multiplication of matrix vectors. It looks like @hpaulj (in the answer posted below) has confirmed this conclusion by checking the code for the csr_matvec function. So, can I just call the Intel MKL vector matrix multiplication function directly? How should I do it?


Edit 3: Here's another example. The maslabal resolved matrix-vector implementation does not change when I set the maximum number of threads to 1.

>> maxNumCompThreads

ans =

    20

>> myFun = @() ATrans'*x; timeit(myFun)

ans =

   0.012545604076342

>> maxNumCompThreads(1) % set number of threads to 1

ans =

    20

>> maxNumCompThreads % Check that max number of threads is 1

ans =

     1

>> myFun = @() ATrans'*x; timeit(myFun)

ans =

   0.012164191957568

      

This leads me to question my previous theory that Matlab's advantage is due to multithreading.

+3


source to share


2 answers


We get variations over time depending on the mixture of rare and dense:

In [40]: A = sparse.random(1000,1000,.1, format='csr')
In [41]: x = np.random.random((1000,1))
In [42]: Ad = A.A
In [43]: xs = sparse.csr_matrix(x)

      

sparse

c dense

produces dense, but sparse

c sparse

produces thin:

In [47]: A.dot(xs)
Out[47]: 
<1000x1 sparse matrix of type '<class 'numpy.float64'>'
    with 1000 stored elements in Compressed Sparse Row format>

In [48]: np.allclose(A.dot(x), Ad.dot(x))
Out[48]: True
In [49]: np.allclose(A.dot(x), A.dot(xs).A)
Out[49]: True

      

Compared to the alternatives, sparse with dense looks pretty good:

In [50]: timeit A.dot(x)    # sparse with dense
137 µs ± 269 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [51]: timeit Ad.dot(x)    # dense with dense
1.03 ms ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [52]: timeit A.dot(xs)   # sparse with sparse
1.44 ms ± 644 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

      

With much larger sizes, the relative times can be different. Ditto for different sparsity.

I don't have access to MATLAB, so I can't do an equivalent test, although I could try it on Octave.

This is on a basic Linux machine (ubuntu) with updated numpy and scipy counts.

My previous research, Directly use the Intel mkl library on a Scipy sparse matrix to compute the AT point with less memory , dealt with sparse matrix with matrix computations. Rare with dense should use different code. We would have to track it to determine if it is delegating anything special to the core math libraries.




Format

csc

for this calculation is a bit slower - probably because the main iteration is line by line.

In [80]: Ac = A.tocsc()
In [81]: timeit Ac.dot(x)
216 µs ± 268 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

      

A.dot(x)

can be traced back to this call to the compiled code:

In [70]: result = np.zeros((1000,1))
In [71]: sparse._sparsetools.csr_matvec(1000,1000,A.indptr, A.indices, A.data, x, result)

      

_sparsetools

is a file .so

probably compiled from Cython code (.pyx).

Into the sparsetools/csr.h

code (template) for matvec

:

void csr_matvec(const I n_row,
                const I n_col,
                const I Ap[],
                const I Aj[],
                const T Ax[],
                const T Xx[],
                      T Yx[])
{
    for(I i = 0; i < n_row; i++){
        T sum = Yx[i];
        for(I jj = Ap[i]; jj < Ap[i+1]; jj++){
            sum += Ax[jj] * Xx[Aj[jj]];
        }
        Yx[i] = sum;
    }
}

      

It's like C ++ straightforward iteration over attributes csr

( indptr

, indices

), multiplication and summation. There is no attempt to use optimized math libraries or parallel cores.

https://github.com/scipy/scipy/blob/master/scipy/sparse/sparsetools/csr.h

+2


source


Well it depends on what you are comparing.

Mostly MATLAB uses Intel MKL for its linear algebra calculations (at least most of them).

In Python, the back end of linear algebra calculations depends on the package you are using (like Numpy) and Distributon (like Anaconda).



If you are using Anaconda or Intel Distribution , Numpy uses Intel MKL, which means you should expect similar performance.

In other cases, it really depends on what you have.

0


source







All Articles