Vectorizing python loop over numpy array
I need to speed up this loop as it is very slow. But I don't know how to vectorize it as the result of one value depends on the result of the previous value. Any suggestions?
import numpy as np sig = np.random.randn(44100) alpha = .9887 beta = .999 out = np.zeros_like(sig) for n in range(1, len(sig)): if np.abs(sig[n]) >= out[n-1]: out[n] = alpha * out[n-1] + (1 - alpha) * np.abs( sig[n] ) else: out[n] = beta * out[n-1]
source to share
Low potential for vectorization by code "with feedback"
most of your "vectorization" parallelism is out of the picture after the dependency analysis. (The JIT compiler cannot vectorize "against" such a dependency barrier either)
you can pre-compute some reusable values in vectorized form, but there is no direct python syntax (no external JIT compiler bypass) to arrange for loopback computation into CPU register vector aligned parallel computation:
from zmq import Stopwatch aStopWATCH = Stopwatch() sig = np.abs(sig) aConst = ( 1 - alpha ) for thisPtr in range( 1, len( sig ) ): # FORWARD-SHIFTING-DEPENDENCE LOOP: prevPtr = thisPtr - 1 if sig[thisPtr] < out[prevPtr]: out[thisPtr] = out[prevPtr] * beta else: out[thisPtr] = out[prevPtr] * alpha + ( aConst * sig[thisPtr] )
A good example of vectorized acceleration can be seen in cases where the calculation strategy can be parallelized / passed along the 1D, 2D, or even 3D structure of numpy's own matrix. For about 100x acceleration see Accelerate processing of RGBA-2D matrix to Vectorized code for processing PNG images (OpenGL shader pipeline)
Performance increased by another 3x
Even this simple version of the code
has increased the speed by more than 2.8x (right now, that is, without installation to allow the use of a dedicated JIT-optimizing compiler):
def aForwardShiftingDependenceLOOP(): # proposed code-revision aStopWATCH.start() # ||||||||||||||||||.start for thisPtr in range( 1, len( sig ) ): # |vvvvvvv|------------# FORWARD-SHIFTING-LOOP DEPENDENCE prevPtr = thisPtr - 1 #|vvvvvvv|--STEP-SHIFTING avoids Numpy syntax if ( sig[ thisPtr] < out[prevPtr] ): out[ thisPtr] = out[prevPtr] * beta else: out[ thisPtr] = out[prevPtr] * alpha + ( aConst * sig[thisPtr] ) usec = aStopWATCH.stop() # ||||||||||||||||||.stop print usec, " [usec]" aForwardShiftingDependenceLOOP() 57593 [usec] 57879 [usec] 58085 [usec] def anOriginalForLOOP(): aStopWATCH.start() for n in range( 1, len( sig ) ): if ( np.abs( sig[n] ) >= out[n-1] ): out[n] = out[n-1] * alpha + ( 1 - alpha ) * np.abs( sig[n] ) else: out[n] = out[n-1] * beta usec = aStopWATCH.stop() print usec, " [usec]" anOriginalForLOOP() 164907 [usec] 165674 [usec] 165154 [usec]
source to share
A Numba just-in-time compiler has to deal with indexing overhead that you deal with quite well when compiling a function to native code at first execution time:
In : %cpaste Pasting code; enter '--' alone on the line to stop or use Ctrl-D. :import numpy as np : :sig = np.random.randn(44100) :alpha = .9887 :beta = .999 : :def nonvectorized(sig): : out = np.zeros_like(sig) : : for n in range(1, len(sig)): : if np.abs(sig[n]) >= out[n-1]: : out[n] = alpha * out[n-1] + (1 - alpha) * np.abs( sig[n] ) : else: : out[n] = beta * out[n-1] : return out :-- In : nonvectorized(sig) Out: array([ 0. , 0.01862503, 0.04124917, ..., 1.2979579 , 1.304247 , 1.30294275]) In : %timeit nonvectorized(sig) 10 loops, best of 3: 80.2 ms per loop In : from numba import jit In : vectorized = jit(nonvectorized) In : np.allclose(vectorized(sig), nonvectorized(sig)) Out: True In : %timeit vectorized(sig) 1000 loops, best of 3: 249 µs per loop
EDIT: As pointed out in the comment, adding jit tests.
creates a lightweight wrapper and hence a cheap operation.
In : %timeit jit(nonvectorized) 10000 loops, best of 3: 45.3 µs per loop
The function itself is compiled on first execution (hence just in time), which takes a while, but probably not that much:
In : %timeit jit(nonvectorized)(sig) 10 loops, best of 3: 169 ms per loop
source to share