Slow speed when returning operator

I have this cython code for testing:

cimport cython

cpdef loop(int k):
    return real_loop(k)

@cython.cdivision
cdef real_loop(int k):
    cdef int i
    cdef float a
    for i in xrange(k):
        a = i
        a = a**2 / (a + 1)
    return a

      

And I am checking the speed difference between this Cython code and the same code in pure python using a script like this:

import mymodule

print(mymodule.loop(100000))

      

I am getting 80x faster. But if I remove the two return statements in the cython code, I get 800-900 times faster. What for?

Another thing is if I run this code (backtrack) on my old ACER Aspire ONE laptop, I get 700 times faster and on the new i7 desktop PC at home, 80 times faster.

Does anyone know why?

+3


source to share


1 answer


I tested your problem with the following code:

#cython: wraparound=False
#cython: boundscheck=False
#cython: cdivision=True
#cython: nonecheck=False
#cython: profile=False

def loop(int k):
 return real_loop(k)

def loop2(int k):
 cdef float a
 real_loop2(k, &a)
 return a

def loop3(int k):
    real_loop3(k)
    return None

def loop4(int k):
    return real_loop4(k)

def loop5(int k):
 cdef float a
 real_loop5(k, &a)
 return a

cdef float real_loop(int k):
    cdef int i
    cdef float a
    a = 0.
    for i in range(k):
        a += a**2 / (a + 1)
    return a

cdef void real_loop2(int k, float *a):
    cdef int i
    a[0] = 0.
    for i in range(k):
        a[0] += a[0]**2 / (a[0] + 1)

cdef void real_loop3(int k):
    cdef int i
    cdef float a
    a = 0.
    for i in range(k):
        a += a**2 / (a + 1)

cdef float real_loop4(int k):
    cdef int i
    cdef float a
    a = 0.
    for i in range(k):
        a += a*a / (a + 1)
    return a

cdef void real_loop5(int k, float *a):
    cdef int i
    a[0] = 0.
    for i in range(k):
        a[0] += a[0]*a[0] / (a[0] + 1)

      

where real_loop()

is close to your function, with modified formula for a

since the original one seems strange.

The function real_loop2()

does not return any value, just updating a

by reference.

The function real_loop3()

does not return any value.



By checking the generated code C

for real_loop3()

, one can see that the loop exists and the code is being called ... but I had the same conclusion as @ dmytro, the change k

will not change the timing significantly ... so there must be a point I am missing here.

From the timings below, we can say that return

it is not a bottleneck, since real_loop2()

both real_loop5()

do not return any value, and their performance is the same as real_loop()

and, real_loop4()

respectively.

In [2]: timeit _stack.loop(100000)
1000 loops, best of 3: 1.71 ms per loop

In [3]: timeit _stack.loop2(100000)
1000 loops, best of 3: 1.69 ms per loop

In [4]: timeit _stack.loop3(100000)
10000000 loops, best of 3: 78.5 ns per loop

In [5]: timeit _stack.loop4(100000)
1000 loops, best of 3: 913 µs per loop

In [6]: timeit _stack.loop5(100000)
1000 loops, best of 3: 979 µs per loop

      

Note the change in speed ~ 2X a**2

to a*a

, as it a**2

requires a function call powf()

inside the loop.

0


source







All Articles