Numba function is slower than C ++ and reordering the loop slows down x10

Question

Numba function is slower than C ++ and reordering the loop slows down x10

The following code simulates extracting binary words from different locations within a set of images.

Numba's wrapping function, wordcalc in the code below, has 2 problems:

This is 3x slower compared to a similar C ++ implementation.
The strangest thing is, if you switch the order of "ibase" and "ibit" for loops, the speed drops 10 times (!). This does not happen in the C ++ implementation, which remains unaffected.

I am using Numba 0.18.2 from WinPython 2.7

What could be causing this?

imDim = 80
numInsts = 10**4
numInstsSub = 10**4/4
bitsNum = 13;

Xs = np.random.rand(numInsts, imDim**2)       
iInstInds = np.array(range(numInsts)[::4])
baseInds = np.arange(imDim**2 - imDim*20 + 1)
ofst1 = np.random.randint(0, imDim*20, bitsNum)
ofst2 = np.random.randint(0, imDim*20, bitsNum)

@nb.jit(nopython=True)
def wordcalc(Xs, iInstInds, baseInds, ofst, bitsNum, newXz):
    count = 0
    for i in iInstInds:
        Xi = Xs[i]        
        for ibit in range(bitsNum):
            for ibase in range(baseInds.shape[0]):                    
                u = Xi[baseInds[ibase] + ofst[0, ibit]] > Xi[baseInds[ibase] + ofst[1, ibit]]
                newXz[count, ibase] = newXz[count, ibase] | np.uint16(u * (2**ibit))
        count += 1
    return newXz

ret = wordcalc(Xs, iInstInds, baseInds, np.array([ofst1, ofst2]), bitsNum, np.zeros((iInstInds.size, baseInds.size), dtype=np.uint16))

+3

performance python numba

Leo June 19 15 at 14:19

source to share

1 answer

DavidW · Accepted Answer · 2015-06-21T09:00:25+0000

I get 4x speed up by changing from np.uint16(u * (2**ibit))

to np.uint16(u << ibit)

; those. replace cardinality of 2 with bit-shift, which should be equivalent (for integers).

It seems reasonable that your C ++ compiler can make this replacement itself.

Swapping the order of the two loops makes a little difference for me both for your original version (5%) and my optimized version (15%), so I can't think I can make a helpful comment.

If you really wanted to compare Numba and C ++, you can look at the compiled Numba function by executing os.environ['NUMBA_DUMP_ASSEMBLY']='1'

before importing Numba. (It was clearly involved, though).

For reference, I am using Numba 0.19.1.

Numba function is slower than C ++ and reordering the loop slows down x10

More articles: