Numba function is slower than C ++ and reordering the loop slows down x10
The following code simulates extracting binary words from different locations within a set of images.
Numba's wrapping function, wordcalc in the code below, has 2 problems:
- This is 3x slower compared to a similar C ++ implementation.
- The strangest thing is, if you switch the order of "ibase" and "ibit" for loops, the speed drops 10 times (!). This does not happen in the C ++ implementation, which remains unaffected.
I am using Numba 0.18.2 from WinPython 2.7
What could be causing this?
imDim = 80 numInsts = 10**4 numInstsSub = 10**4/4 bitsNum = 13; Xs = np.random.rand(numInsts, imDim**2) iInstInds = np.array(range(numInsts)[::4]) baseInds = np.arange(imDim**2 - imDim*20 + 1) ofst1 = np.random.randint(0, imDim*20, bitsNum) ofst2 = np.random.randint(0, imDim*20, bitsNum) @nb.jit(nopython=True) def wordcalc(Xs, iInstInds, baseInds, ofst, bitsNum, newXz): count = 0 for i in iInstInds: Xi = Xs[i] for ibit in range(bitsNum): for ibase in range(baseInds.shape[0]): u = Xi[baseInds[ibase] + ofst[0, ibit]] > Xi[baseInds[ibase] + ofst[1, ibit]] newXz[count, ibase] = newXz[count, ibase] | np.uint16(u * (2**ibit)) count += 1 return newXz ret = wordcalc(Xs, iInstInds, baseInds, np.array([ofst1, ofst2]), bitsNum, np.zeros((iInstInds.size, baseInds.size), dtype=np.uint16))
source to share
I get 4x speed up by changing from np.uint16(u * (2**ibit))
to np.uint16(u << ibit)
; those. replace cardinality of 2 with bit-shift, which should be equivalent (for integers).
It seems reasonable that your C ++ compiler can make this replacement itself.
Swapping the order of the two loops makes a little difference for me both for your original version (5%) and my optimized version (15%), so I can't think I can make a helpful comment.
If you really wanted to compare Numba and C ++, you can look at the compiled Numba function by executing os.environ['NUMBA_DUMP_ASSEMBLY']='1'
before importing Numba. (It was clearly involved, though).
For reference, I am using Numba 0.19.1.
source to share