Slow speed when returning operator
I have this cython code for testing:
cimport cython
cpdef loop(int k):
return real_loop(k)
@cython.cdivision
cdef real_loop(int k):
cdef int i
cdef float a
for i in xrange(k):
a = i
a = a**2 / (a + 1)
return a
And I am checking the speed difference between this Cython code and the same code in pure python using a script like this:
import mymodule
print(mymodule.loop(100000))
I am getting 80x faster. But if I remove the two return statements in the cython code, I get 800-900 times faster. What for?
Another thing is if I run this code (backtrack) on my old ACER Aspire ONE laptop, I get 700 times faster and on the new i7 desktop PC at home, 80 times faster.
Does anyone know why?
source to share
I tested your problem with the following code:
#cython: wraparound=False
#cython: boundscheck=False
#cython: cdivision=True
#cython: nonecheck=False
#cython: profile=False
def loop(int k):
return real_loop(k)
def loop2(int k):
cdef float a
real_loop2(k, &a)
return a
def loop3(int k):
real_loop3(k)
return None
def loop4(int k):
return real_loop4(k)
def loop5(int k):
cdef float a
real_loop5(k, &a)
return a
cdef float real_loop(int k):
cdef int i
cdef float a
a = 0.
for i in range(k):
a += a**2 / (a + 1)
return a
cdef void real_loop2(int k, float *a):
cdef int i
a[0] = 0.
for i in range(k):
a[0] += a[0]**2 / (a[0] + 1)
cdef void real_loop3(int k):
cdef int i
cdef float a
a = 0.
for i in range(k):
a += a**2 / (a + 1)
cdef float real_loop4(int k):
cdef int i
cdef float a
a = 0.
for i in range(k):
a += a*a / (a + 1)
return a
cdef void real_loop5(int k, float *a):
cdef int i
a[0] = 0.
for i in range(k):
a[0] += a[0]*a[0] / (a[0] + 1)
where real_loop()
is close to your function, with modified formula for a
since the original one seems strange.
The function real_loop2()
does not return any value, just updating a
by reference.
The function real_loop3()
does not return any value.
By checking the generated code C
for real_loop3()
, one can see that the loop exists and the code is being called ... but I had the same conclusion as @ dmytro, the change k
will not change the timing significantly ... so there must be a point I am missing here.
From the timings below, we can say that return
it is not a bottleneck, since real_loop2()
both real_loop5()
do not return any value, and their performance is the same as real_loop()
and, real_loop4()
respectively.
In [2]: timeit _stack.loop(100000)
1000 loops, best of 3: 1.71 ms per loop
In [3]: timeit _stack.loop2(100000)
1000 loops, best of 3: 1.69 ms per loop
In [4]: timeit _stack.loop3(100000)
10000000 loops, best of 3: 78.5 ns per loop
In [5]: timeit _stack.loop4(100000)
1000 loops, best of 3: 913 µs per loop
In [6]: timeit _stack.loop5(100000)
1000 loops, best of 3: 979 µs per loop
Note the change in speed ~ 2X a**2
to a*a
, as it a**2
requires a function call powf()
inside the loop.
source to share