# Vectorizing python loop over numpy array

I need to speed up this loop as it is very slow. But I don't know how to vectorize it as the result of one value depends on the result of the previous value. Any suggestions?

``````import numpy as np

sig = np.random.randn(44100)
alpha = .9887
beta = .999

out = np.zeros_like(sig)

for n in range(1, len(sig)):
if np.abs(sig[n]) >= out[n-1]:
out[n] = alpha * out[n-1] + (1 - alpha) * np.abs( sig[n] )
else:
out[n] = beta * out[n-1]
```

```
+3

source to share

## Low potential for vectorization by code "with feedback"

most of your "vectorization" parallelism is out of the picture after the dependency analysis. (The JIT compiler cannot vectorize "against" such a dependency barrier either)

you can pre-compute some reusable values ​​in vectorized form, but there is no direct python syntax (no external JIT compiler bypass) to arrange for loopback computation into CPU register vector aligned parallel computation:

``````from zmq import Stopwatch    # ok to use pyzmq 2.11 for [usec] .Stopwatch()
aStopWATCH =    Stopwatch()  # a performance measurement .Stopwatch() instance

sig    = np.abs(sig)         # self-destructive calc/assign avoids memalloc-OPs
aConst = ( 1 - alpha )       # avoids many repetitive SUB(s) in the loop

for thisPtr in range( 1, len( sig ) ): # FORWARD-SHIFTING-DEPENDENCE LOOP:
prevPtr = thisPtr - 1              # prevPtr->"previous" TimeSlice in out[] ( re-used 2 x len(sig) times )
if sig[thisPtr] < out[prevPtr]:                                    # 1st re-use
out[thisPtr] = out[prevPtr] * beta                              # 2nd
else:
out[thisPtr] = out[prevPtr] * alpha + ( aConst * sig[thisPtr] ) # 2nd
```

```

A good example of vectorized acceleration can be seen in cases where the calculation strategy can be parallelized / passed along the 1D, 2D, or even 3D structure of numpy's own matrix. For about 100x acceleration see Accelerate processing of RGBA-2D matrix to Vectorized code for processing PNG images (OpenGL shader pipeline)

## Performance increased by another 3x

Even this simple version of the code `python`

has increased the speed by more than 2.8x (right now, that is, without installation to allow the use of a dedicated JIT-optimizing compiler):

``````>>> def aForwardShiftingDependenceLOOP(): # proposed code-revision
...     aStopWATCH.start()                # ||||||||||||||||||.start
...     for thisPtr in range( 1, len( sig ) ):
...         #        |vvvvvvv|------------# FORWARD-SHIFTING-LOOP DEPENDENCE
...         prevPtr = thisPtr - 1  #|vvvvvvv|--STEP-SHIFTING avoids Numpy syntax
...         if ( sig[ thisPtr] < out[prevPtr] ):
...             out[  thisPtr] = out[prevPtr] * beta
...         else:
...             out[  thisPtr] = out[prevPtr] * alpha + ( aConst * sig[thisPtr] )
...     usec = aStopWATCH.stop()          # ||||||||||||||||||.stop
...     print usec, " [usec]"

>>> aForwardShiftingDependenceLOOP()
57593  [usec]
57879  [usec]
58085  [usec]

>>> def anOriginalForLOOP():
...     aStopWATCH.start()
...     for n in range( 1, len( sig ) ):
...         if ( np.abs( sig[n] ) >= out[n-1] ):
...             out[n] = out[n-1] * alpha + ( 1 - alpha ) * np.abs( sig[n] )
...         else:
...             out[n] = out[n-1] * beta
...     usec = aStopWATCH.stop()
...     print usec, " [usec]"

>>> anOriginalForLOOP()
164907  [usec]
165674  [usec]
165154  [usec]
```

```
+1

source

A Numba just-in-time compiler has to deal with indexing overhead that you deal with quite well when compiling a function to native code at first execution time:

``````In : %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:import numpy as np
:
:sig = np.random.randn(44100)
:alpha = .9887
:beta = .999
:
:def nonvectorized(sig):
:    out = np.zeros_like(sig)
:
:    for n in range(1, len(sig)):
:        if np.abs(sig[n]) >= out[n-1]:
:            out[n] = alpha * out[n-1] + (1 - alpha) * np.abs( sig[n] )
:        else:
:            out[n] = beta * out[n-1]
:    return out
:--

In : nonvectorized(sig)
Out:
array([ 0.        ,  0.01862503,  0.04124917, ...,  1.2979579 ,
1.304247  ,  1.30294275])

In : %timeit nonvectorized(sig)
10 loops, best of 3: 80.2 ms per loop

In : from numba import jit

In : vectorized = jit(nonvectorized)

In : np.allclose(vectorized(sig), nonvectorized(sig))
Out: True

In : %timeit vectorized(sig)
1000 loops, best of 3: 249 µs per loop
```

```

EDIT: As pointed out in the comment, adding jit tests. `jit(nonvectorized)`

creates a lightweight wrapper and hence a cheap operation.

``````In : %timeit jit(nonvectorized)
10000 loops, best of 3: 45.3 µs per loop
```

```

The function itself is compiled on first execution (hence just in time), which takes a while, but probably not that much:

``````In : %timeit jit(nonvectorized)(sig)
10 loops, best of 3: 169 ms per loop
```

```
+3

source

All Articles