Sgemm not multithreaded when dgemm is running - Intel MKL

I am using Intel MKL GEMM functions for matrix multiplication. Consider the following two matrix multiplications:

            cblas_?gemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m,n,k,
                            1.0,
                            Matrix1,k,
                            Matrix2,n,
                            0.0,
                            A,n);

      

where m = 1E5, and n = 1E4, k = 5. When I use pca_dgemm and pca_sgemm, these users use all 12 cores and run fine.

However, when I do the following matrix multiplication:

    cblas_?gemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m,l,n,
                    1.0,
                    A,n,
                    Ran,l,
                    0.0,
                    Q,l);

      

where m = 1E5, n = 1E5 and l = 7 (note that the order of passed parameters is differnet, but it is (m, n) * (n, l)). pca_dgemm uses all 12 cores and works great.

However, pca_sgemm does not. It only uses 1 core and of course takes much longer. Of course, for sgemm, I use float arrays, whereas for dgemm, I use double arrays.

Why might this be? Both give exact results, but sgemm only multithreads on the first, whereas dgemm multithreads both! How can you simply change the datatype to make such a difference?

Note that all arrays were allocated using mkl_malloc using 64 alignment.

Edit 2: Please also note that when l = 12, in other words with a larger matrix, it executes a thread in sgemm. In other words, it is obvious that the sgemm version requires large matrices for parallelization, but dgemm does not have this requirement. Why is this?

+3


source to share


1 answer


The MKL functions do quite a bit of work to try and guess which will be the fastest way to complete the operation, so it should come as no surprise that a different solution comes up when handling double or singles.

When deciding which strategy to take, he must weigh the cost of running the operation in a single thread versus the overhead of running threads to run in parallel. One factor that will come into play is that SSE instructions can operate on single precision numbers twice as fast as double precision numbers, so the heuristic may well decide that the singles operation will most likely be performed as SSE SIMD operations on a single core, rather than kicking twelve threads to do it in parallel. How much this can do in parallel will depend on the details of your processor architecture; For example, SSE2 can operate on four single operands or two double operands, whereas newer SSE instruction sets support wider data.



I have found in the past that for small matrices / vectors it is often faster to fold your own functions than using MKL. For example, if all your operations are on 3-vectors and 3x3 matrices, it is sufficient to simply write your own BLAS functions in plain C and faster to optimize them with SSE (if you can run into alignment constraints). For a mixture of 3- and 6-vectors, it is even faster to write your own optimized version of SSE. This is because the cost of the MKL version, determining which strategy to use, becomes a significant overhead on small operations.

+2


source







All Articles