How does architecture affect the performance of a numpy array?
I have Ubuntu 14.04 with "Anaconda" Python distribution with Intel Core Math Library (MKL) installed. My processor is an Intel Xeon with 8 cores and no Hyperthreading (so there are only 8 threads).
For me, numpy is tensordot
consistently superior einsum
for large arrays. However, others have found very little difference between the two, or even that einsum can outperform multiple transactions for some transactions .
For people with a distro numpy
built with a fast library, I'm wondering why this might be happening. Is MKL slower on non-Intel processors? Or einsum
is it faster on newer Intel processors with better streaming capabilities?
Here is some sample code to compare performance on my machine:
In [27]: a = rand(100,1000,2000) In [28]: b = rand(50,1000,2000) In [29]: time cten = tensordot(a, b, axes=[(1,2),(1,2)]) CPU times: user 7.85 s, sys: 29.4 ms, total: 7.88 s Wall time: 1.08 s In [30]: "FLOPS TENSORDOT: {}.".format(cten.size * 1000 * 2000 / 1.08) Out [30]: 'FLOPS TENSORDOT: 9259259259.26.' In [31]: time cein = einsum('ijk,ljk->il', a, b) CPU times: user 42.3 s, sys: 7.58 ms, total: 42.3 s Wall time: 42.4 s In [32]: "FLOPS EINSUM: {}.".format(cein.size * 1000 * 2000 / 42.4) Out [32]: 'FLOPS EINSUM: 235849056.604.'
The tensor operations with the tensorot are performed sequentially in the range of 5-20 GFLOPs. I am getting 0.2 GFLOPS with einsum.
source to share
You are essentially comparing two different things:
-
np.einsum
computes tensor product with loopsfor
in C. It has some SIMD optimizations, but is not multithreaded and does not use MLK. -
np.tensordot
which is about converting / broadcasting the input arrays and then calling BLAS (MKL, OpenBLAS, etc.) for matrix multiplication. The modify / broadcast phase introduces some additional overhead, however matrix multiplication is extremely optimized with SIMD, some assemblers, and multithreading.
As a result, it tensordot
will usually be faster thaneinsum
single-core execution if small array sizes are not used (and then change / translation overhead becomes negligible). This is even more true, since the first approach is multithreaded, but later it is not.
In conclusion, your results are perfectly normal and are likely to be generally correct (Intel / non-Intel processor, modern or not, multi-core or not, using MKL or OpenBLAS, etc.).
source to share