How does architecture affect the performance of a numpy array?

I have Ubuntu 14.04 with "Anaconda" Python distribution with Intel Core Math Library (MKL) installed. My processor is an Intel Xeon with 8 cores and no Hyperthreading (so there are only 8 threads).

For me, numpy is tensordot

consistently superior einsum

for large arrays. However, others have found very little difference between the two, or even that einsum can outperform multiple transactions for some transactions .

For people with a distro numpy

built with a fast library, I'm wondering why this might be happening. Is MKL slower on non-Intel processors? Or einsum

is it faster on newer Intel processors with better streaming capabilities?

Here is some sample code to compare performance on my machine:

In  [27]: a = rand(100,1000,2000)

In  [28]: b = rand(50,1000,2000)

In  [29]: time cten = tensordot(a, b, axes=[(1,2),(1,2)])
CPU times: user 7.85 s, sys: 29.4 ms, total: 7.88 s
Wall time: 1.08 s

In  [30]: "FLOPS TENSORDOT: {}.".format(cten.size * 1000 * 2000 / 1.08)
Out [30]: 'FLOPS TENSORDOT: 9259259259.26.'

In  [31]: time cein = einsum('ijk,ljk->il', a, b)
CPU times: user 42.3 s, sys: 7.58 ms, total: 42.3 s
Wall time: 42.4 s

In  [32]: "FLOPS EINSUM: {}.".format(cein.size * 1000 * 2000 / 42.4)
Out [32]: 'FLOPS EINSUM: 235849056.604.'

      

The tensor operations with the tensorot are performed sequentially in the range of 5-20 GFLOPs. I am getting 0.2 GFLOPS with einsum.

+3


source to share


1 answer


You are essentially comparing two different things:

  • np.einsum

    computes tensor product with loops for

    in C. It has some SIMD optimizations, but is not multithreaded and does not use MLK.

  • np.tensordot

    which is about converting / broadcasting the input arrays and then calling BLAS (MKL, OpenBLAS, etc.) for matrix multiplication. The modify / broadcast phase introduces some additional overhead, however matrix multiplication is extremely optimized with SIMD, some assemblers, and multithreading.



As a result, it tensordot

will usually be faster thaneinsum

single-core execution if small array sizes are not used (and then change / translation overhead becomes negligible). This is even more true, since the first approach is multithreaded, but later it is not.

In conclusion, your results are perfectly normal and are likely to be generally correct (Intel / non-Intel processor, modern or not, multi-core or not, using MKL or OpenBLAS, etc.).

+2


source







All Articles