# Vectorizing python loop over numpy array

I need to speed up this loop as it is very slow. But I don't know how to vectorize it as the result of one value depends on the result of the previous value. Any suggestions?

```
import numpy as np
sig = np.random.randn(44100)
alpha = .9887
beta = .999
out = np.zeros_like(sig)
for n in range(1, len(sig)):
if np.abs(sig[n]) >= out[n-1]:
out[n] = alpha * out[n-1] + (1 - alpha) * np.abs( sig[n] )
else:
out[n] = beta * out[n-1]
```

source to share

## Low potential for vectorization by code "with feedback"

most of your "vectorization" parallelism is out of the picture after the dependency analysis. (The JIT compiler cannot vectorize "against" such a dependency barrier either)

you can pre-compute some reusable values ββin vectorized form, but there is no direct python syntax (no external JIT compiler bypass) to arrange for loopback computation into CPU register vector aligned parallel computation:

```
from zmq import Stopwatch # ok to use pyzmq 2.11 for [usec] .Stopwatch()
aStopWATCH = Stopwatch() # a performance measurement .Stopwatch() instance
sig = np.abs(sig) # self-destructive calc/assign avoids memalloc-OPs
aConst = ( 1 - alpha ) # avoids many repetitive SUB(s) in the loop
for thisPtr in range( 1, len( sig ) ): # FORWARD-SHIFTING-DEPENDENCE LOOP:
prevPtr = thisPtr - 1 # prevPtr->"previous" TimeSlice in out[] ( re-used 2 x len(sig) times )
if sig[thisPtr] < out[prevPtr]: # 1st re-use
out[thisPtr] = out[prevPtr] * beta # 2nd
else:
out[thisPtr] = out[prevPtr] * alpha + ( aConst * sig[thisPtr] ) # 2nd
```

A good example of vectorized acceleration can be seen in cases where the calculation strategy can be parallelized / passed along the 1D, 2D, or even 3D structure of numpy's own matrix. For about 100x acceleration see Accelerate processing of RGBA-2D matrix to Vectorized code for processing PNG images (OpenGL shader pipeline)

## Performance increased by another 3x

Even this simple version of the code `python`

has increased the speed by more than 2.8x (right now, that is, without installation to allow the use of a dedicated JIT-optimizing compiler):

```
>>> def aForwardShiftingDependenceLOOP(): # proposed code-revision
... aStopWATCH.start() # ||||||||||||||||||.start
... for thisPtr in range( 1, len( sig ) ):
... # |vvvvvvv|------------# FORWARD-SHIFTING-LOOP DEPENDENCE
... prevPtr = thisPtr - 1 #|vvvvvvv|--STEP-SHIFTING avoids Numpy syntax
... if ( sig[ thisPtr] < out[prevPtr] ):
... out[ thisPtr] = out[prevPtr] * beta
... else:
... out[ thisPtr] = out[prevPtr] * alpha + ( aConst * sig[thisPtr] )
... usec = aStopWATCH.stop() # ||||||||||||||||||.stop
... print usec, " [usec]"
>>> aForwardShiftingDependenceLOOP()
57593 [usec]
57879 [usec]
58085 [usec]
>>> def anOriginalForLOOP():
... aStopWATCH.start()
... for n in range( 1, len( sig ) ):
... if ( np.abs( sig[n] ) >= out[n-1] ):
... out[n] = out[n-1] * alpha + ( 1 - alpha ) * np.abs( sig[n] )
... else:
... out[n] = out[n-1] * beta
... usec = aStopWATCH.stop()
... print usec, " [usec]"
>>> anOriginalForLOOP()
164907 [usec]
165674 [usec]
165154 [usec]
```

source to share

A Numba just-in-time compiler has to deal with indexing overhead that you deal with quite well when compiling a function to native code at first execution time:

```
In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import numpy as np
:
:sig = np.random.randn(44100)
:alpha = .9887
:beta = .999
:
:def nonvectorized(sig):
: out = np.zeros_like(sig)
:
: for n in range(1, len(sig)):
: if np.abs(sig[n]) >= out[n-1]:
: out[n] = alpha * out[n-1] + (1 - alpha) * np.abs( sig[n] )
: else:
: out[n] = beta * out[n-1]
: return out
:--
In [2]: nonvectorized(sig)
Out[2]:
array([ 0. , 0.01862503, 0.04124917, ..., 1.2979579 ,
1.304247 , 1.30294275])
In [3]: %timeit nonvectorized(sig)
10 loops, best of 3: 80.2 ms per loop
In [4]: from numba import jit
In [5]: vectorized = jit(nonvectorized)
In [6]: np.allclose(vectorized(sig), nonvectorized(sig))
Out[6]: True
In [7]: %timeit vectorized(sig)
1000 loops, best of 3: 249 Β΅s per loop
```

EDIT: As pointed out in the comment, adding jit tests. `jit(nonvectorized)`

creates a lightweight wrapper and hence a cheap operation.

```
In [8]: %timeit jit(nonvectorized)
10000 loops, best of 3: 45.3 Β΅s per loop
```

The function itself is compiled on first execution (hence just in time), which takes a while, but probably not that much:

```
In [9]: %timeit jit(nonvectorized)(sig)
10 loops, best of 3: 169 ms per loop
```

source to share