Calling BLAS / LAPACK directly using SciPy and Cython interface

There was a post here: https://gist.github.com/JonathanRaiman/f2ce5331750da7b2d4e9 which shows a significant speed improvement just by calling the Fortran libraries (BLAS / LAPACK / Intel MKL / OpenBLAS / whatever you installed with NumPy). After many hours of working on it (due to the outdated SciPy libraries) I finally got it to compile with no results. It was 2x faster than NumPy. Unfortunately, as another user pointed out, the Fortran routine always adds the output matrix to the new calculated results, so it only matches NumPy in the first run. That is A := alpha*x*y.T + A

. So it has yet to be addressed with a quick fix.

[UPDATE: YOU ARE LOOKING FOR YOU TO USE THE SCIPY INTERFACE ONLY FIND HERE https://github.com/scipy/scipy/blob/master/scipy/linalg/cython_blas.pyx HOW THEY ARE SURE THEY ARE SURE TO OPTIMIZE CPAS / LEFT / PASTS IN YOUR CAPTURE SCRIPT # Python-accessible wrappers for testing:

Also at the link above cython_lapack.pyx is available, but does not have test scripts Cython]

SCRIPT TEST

import numpy as np;
from cyblas import outer_prod;
a=np.random.randint(0,100, 1000);
b=np.random.randint(0,100, 1000);
a=a.astype(np.float64)
b=b.astype(np.float64)
cy_outer=np.zeros((a.shape[0],b.shape[0]));
np_outer=np.zeros((a.shape[0],b.shape[0]));

%timeit outer_prod(a,b,cy_outer)
#%timeit outer_prod(a,b) #use with fixed version instead of above line, results will automatically update cy_outer
%timeit np.outer(a,b, np_outer)
100 loops, best of 3: 2.83 ms per loop
100 loops, best of 3: 6.58 ms per loop

      

# END TEST SCRIPT

PYX file to compile cyblas.pyx (mostly np.ndarray version)

import cython
import numpy as np
cimport numpy as np

from cpython cimport PyCapsule_GetPointer 
cimport scipy.linalg.cython_blas
cimport scipy.linalg.cython_lapack
import scipy.linalg as LA

REAL = np.float64
ctypedef np.float64_t REAL_t
ctypedef np.uint64_t  INT_t

cdef int ONE = 1
cdef REAL_t ONEF = <REAL_t>1.0

ctypedef void (*dger_ptr) (const int *M, const int *N, const double *alpha, const double *X, const int *incX, double *Y, const int *incY, double *A, const int * LDA) nogil
cdef dger_ptr dger=<dger_ptr>PyCapsule_GetPointer(LA.blas.dger._cpointer, NULL)  # A := alpha*x*y.T + A

cpdef outer_prod(_x, _y, _output):
#cpdef outer_prod(_x, _y): #comment above line & use this to use the reset output matrix to zeros
    cdef REAL_t *x = <REAL_t *>(np.PyArray_DATA(_x))
    cdef int M = _y.shape[0]
    cdef int N = _x.shape[0]
    #cdef np.ndarray[np.float64_t, ndim=2, order='c'] _output = np.zeros((M,N)) #slow fix to uncomment to reset output matrix to zeros
    cdef REAL_t *y = <REAL_t *>(np.PyArray_DATA(_y))
    cdef REAL_t *output = <REAL_t *>(np.PyArray_DATA(_output))
    with nogil:
        dger(&M, &N, &ONEF, y, &ONE, x, &ONE, output, &M)

      

Great importance. Hope this saves other people for a while (it FULLY works) - in fact, when I commented it works 1x and is NumPy compliant, then each subsequent call adds to the AGAIN result matrix. If I reset the output matrix to 0 and repeat the results, then it matches NumPy. Strange ... although, if you take apart a few lines above, it will work, albeit only in NumPy. An alternative was found memset

and will be in another post ... I just didn't understand what to call it.

0


source to share


1 answer


According to netlib dger(M, N, ALPHA, X INCX, Y, INCY, A, LDA)

does A := alpha*x*y**T + A

. Thus, A

must be all zero to obtain the external product X

and Y

.



+1


source







All Articles