Pandas applies when min / max is running slowly over changing swap window
I am calculating values over a time series (represented via myvalues). The code below specifies the locations where the event occurs (cross_indices) and then counts the last 8 events (n_crosses). The index of the 8th cross relative to each line time is set in the max_lookback.
It takes ~ 0.5 seconds to set max_lookback of all code. However, when I run pd.apply () to get the minimum and maximum myvalues from the current index to max_lookback, the code takes ~ 22 seconds to run.
I thought that apply () should have broken lines much faster than a for loop. Why is the code taking a long time and how can I speed it up?
Program output
total time minmax is 22.469 seconds
the total execution time is 22.93 seconds.
import pandas as pd
import numpy as np
import timeit
complete_start = timeit.default_timer()
indices = pd.Series( range(20000), name='Index')
sample_from = np.append(np.zeros(9), 1) #10% odds of selecting 1
cross = pd.Series( np.random.choice( sample_from, size=len(indices) ), name='Cross' )
#cross = pd.Series(
cross_indices = np.flatnonzero( cross )
n_crosses = 8
def set_max_lookback(index):
sub = cross_indices[ cross_indices <= index ]
#get integer index where crosses occurred
if len( sub ) < n_crosses:
return int( 0 )
return int( sub[ len(sub) - n_crosses ] )
max_lookback = pd.Series( indices.apply( set_max_lookback ), name='MaxLookback' )
start = timeit.default_timer()
myvalues = pd.Series( np.random.randint(-100,high=100, size=len(indices) ), name='Random' )
def minmax_of_zero_crosses(index):
sub = myvalues.iloc[ range( max_lookback[index], index+1 ) ]
return ( sub.min(), sub.max() )
minmax_as_tuple_series = pd.Series( indices.apply( minmax_of_zero_crosses ), name='Min' )
minmax_df = pd.DataFrame( minmax_as_tuple_series.tolist() )
minmax_df.columns = [ 'Min', 'Max' ]
maxz = minmax_df['Max']
minz = minmax_df['Min']
end = timeit.default_timer()
print('total time of minmax is ' + str(end-start) + ' seconds.')
complete_end = timeit.default_timer()
print('total runtime is ' + str(complete_end-complete_start) + ' seconds.')
Edit 1
Based on Mitch's comment, I double checked the max_lookback setting. By using n_crosses = 3, you can see that the correct index 19 981 is selected for row 19.995. Column labels that are not visible in the picture are index, myvalues, cross, max_lookback.
df = pd.DataFrame([myvalues, cross, max_lookback, maxz, minz ] ).transpose() print(df.tail(n=60))
Using the image as an example, for line 19.999, I would like to find the min / max myvalues between line 19 981 (max_lookback column) and 19999 which is -95 and +97.
source to share
apply
not really a very efficient solution at all, as it is effective just for the hinge directly under the hood.
Vector approach:
indices = pd.Series(range(20000))
sample_from = np.append(np.zeros(9), 1) #10% odds of selecting 1
cross = pd.Series(np.random.choice(sample_from, size=indices.size))
myvalues = pd.DataFrame(dict(Random=np.random.randint(-100,
100,
size=indices.size)))
n_crosses = 8
nonzeros = cross.nonzero()[0]
diffs = (nonzeros-np.roll(nonzeros, n_crosses-1)).clip(0)
myvalues['lower'] = np.nan
myvalues.loc[nonzeros, 'lower'] = diffs
myvalues.lower = ((myvalues.index.to_series() - myvalues.lower)
.fillna(method='ffill')
.fillna(0).astype(np.int))
myvalues.loc[:(cross.cumsum() < n_crosses).sum()+1, 'lower'] = 0
reducer = np.empty((myvalues.shape[0]*2,), dtype=myvalues.lower.dtype)
reducer[::2] = myvalues.lower.values
reducer[1::2] = myvalues.index.values + 1
myvalues.loc[myvalues.shape[0]] = [0,0]
minmax_df = pd.DataFrame(
{'min':np.minimum.reduceat(myvalues.Random.values, reducer)[::2],
'max':np.maximum.reduceat(myvalues.Random.values, reducer)[::2]}
)
This gives the same min / max DataFrame as your current solution. The basic idea is to generate bounds for min / max for each index in myvalues
, and then use it ufunc.reduceat
to calculate those min / exhausted.
On my machine, your current solution takes 8.1 ~ s per cycle, whereas the above decision takes 7.9 ~ ms per cycle, about 1025% acceleration.
source to share
This answer is based on Mitch's excellent work. I added comments to the code as it took me a significant amount of time to figure out the solution. I also found some minor issues.
The solution depends on the numpy reduceat function .
import pandas as pd
import numpy as np
indices = pd.Series(range(20000))
sample_from = np.append(np.zeros(2), 1) #10% odds of selecting 1
cross = pd.Series(np.random.choice(sample_from, size=indices.size))
myvalues = pd.DataFrame(dict(Random=np.random.randint(-100,
100,
size=indices.size)))
n_crosses = 3
#eliminate nonzeros to speed up processing
nonzeros = cross.nonzero()[0]
#find the number of rows between each cross
diffs = (nonzeros-np.roll(nonzeros, n_crosses-1)).clip(0)
myvalues['lower'] = np.nan
myvalues.loc[nonzeros, 'lower'] = diffs
#set the index where a cross occurred
myvalues.lower = myvalues.index.to_series() - myvalues.lower
#fill the NA values with the previous cross index
myvalues.lower = myvalues.lower.fillna(method='ffill')
#fill the NaN values at the top of the series with 0
myvalues.lower = myvalues.lower.fillna(0).astype(np.int)
#set lower to 0 where crossses < n_crosses at the head of the Series
myvalues.loc[:(cross.cumsum() < n_crosses).sum()+1, 'lower'] = 0
#create a numpy array that lists the start and end index of events for each
# row in alternating order
reducer = np.empty((myvalues.shape[0]*2,), dtype=myvalues.lower.dtype)
reducer[::2] = myvalues.lower
reducer[1::2] = indices+1
reducer[len(reducer)-1] = indices[len(indices)-1]
myvalues['Cross'] = cross
#use reduceat to dramatically lower total execution time
myvalues['MinZ'] = np.minimum.reduceat( myvalues.iloc[:,0], reducer )[::2]
myvalues['MaxZ'] = np.maximum.reduceat( myvalues.iloc[:,0], reducer )[::2]
lastRow = len(myvalues)-1
#reduceat does not correctly identify the minimumu and maximum on the last row
#if a new min/max occurs on that row. This is a manual override
if myvalues.ix[lastRow,'MinZ'] >= myvalues.iloc[lastRow, 0]:
myvalues.ix[lastRow,'MinZ'] = myvalues.iloc[lastRow, 0]
if myvalues.ix[lastRow,'MaxZ'] <= myvalues.iloc[lastRow, 0]:
myvalues.ix[lastRow,'MaxZ'] = myvalues.iloc[lastRow, 0]
print( myvalues.tail(n=60) )
source to share