Slow speed when returning operator
I have this cython code for testing:
cimport cython
cpdef loop(int k):
    return real_loop(k)
@cython.cdivision
cdef real_loop(int k):
    cdef int i
    cdef float a
    for i in xrange(k):
        a = i
        a = a**2 / (a + 1)
    return a
      
        
        
        
      
    
And I am checking the speed difference between this Cython code and the same code in pure python using a script like this:
import mymodule
print(mymodule.loop(100000))
      
        
        
        
      
    
I am getting 80x faster. But if I remove the two return statements in the cython code, I get 800-900 times faster. What for?
Another thing is if I run this code (backtrack) on my old ACER Aspire ONE laptop, I get 700 times faster and on the new i7 desktop PC at home, 80 times faster.
Does anyone know why?
I tested your problem with the following code:
#cython: wraparound=False
#cython: boundscheck=False
#cython: cdivision=True
#cython: nonecheck=False
#cython: profile=False
def loop(int k):
 return real_loop(k)
def loop2(int k):
 cdef float a
 real_loop2(k, &a)
 return a
def loop3(int k):
    real_loop3(k)
    return None
def loop4(int k):
    return real_loop4(k)
def loop5(int k):
 cdef float a
 real_loop5(k, &a)
 return a
cdef float real_loop(int k):
    cdef int i
    cdef float a
    a = 0.
    for i in range(k):
        a += a**2 / (a + 1)
    return a
cdef void real_loop2(int k, float *a):
    cdef int i
    a[0] = 0.
    for i in range(k):
        a[0] += a[0]**2 / (a[0] + 1)
cdef void real_loop3(int k):
    cdef int i
    cdef float a
    a = 0.
    for i in range(k):
        a += a**2 / (a + 1)
cdef float real_loop4(int k):
    cdef int i
    cdef float a
    a = 0.
    for i in range(k):
        a += a*a / (a + 1)
    return a
cdef void real_loop5(int k, float *a):
    cdef int i
    a[0] = 0.
    for i in range(k):
        a[0] += a[0]*a[0] / (a[0] + 1)
      
        
        
        
      
    
where real_loop()
      
        
        
        
      
    is close to your function, with modified formula for a
      
        
        
        
      
    since the original one seems strange.
The function real_loop2()
      
        
        
        
      
    does not return any value, just updating a
      
        
        
        
      
    by reference.
The function real_loop3()
      
        
        
        
      
    does not return any value.
By checking the generated code C
      
        
        
        
      
    for real_loop3()
      
        
        
        
      
    , one can see that the loop exists and the code is being called ... but I had the same conclusion as @ dmytro, the change k
      
        
        
        
      
    will not change the timing significantly ... so there must be a point I am missing here.
From the timings below, we can say that return
      
        
        
        
      
    it is not a bottleneck, since real_loop2()
      
        
        
        
      
    both real_loop5()
      
        
        
        
      
    do not return any value, and their performance is the same as real_loop()
      
        
        
        
      
    and, real_loop4()
      
        
        
        
      
    respectively.
In [2]: timeit _stack.loop(100000)
1000 loops, best of 3: 1.71 ms per loop
In [3]: timeit _stack.loop2(100000)
1000 loops, best of 3: 1.69 ms per loop
In [4]: timeit _stack.loop3(100000)
10000000 loops, best of 3: 78.5 ns per loop
In [5]: timeit _stack.loop4(100000)
1000 loops, best of 3: 913 µs per loop
In [6]: timeit _stack.loop5(100000)
1000 loops, best of 3: 979 µs per loop
      
        
        
        
      
    
Note the change in speed ~ 2X a**2
      
        
        
        
      
    to a*a
      
        
        
        
      
    , as it a**2
      
        
        
        
      
    requires a function call powf()
      
        
        
        
      
    inside the loop.