Efficient pandas / numpy function over time since change
Given Series
      
        
        
        
      
    , I would like to efficiently calculate how many observations have passed since the change took place. Here's a simple example:
ser = pd.Series([1.2,1.2,1.2,1.2,2,2,2,4,3])
print(ser)
0    1.2
1    1.2
2    1.2
3    1.2
4    2.0
5    2.0
6    2.0
7    4.0
8    3.0
      
        
        
        
      
    
I would like to apply a function to ser
      
        
        
        
      
    which will result in:
0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    0
8    0
      
        
        
        
      
    
Since I am dealing with large series, I would prefer a quick solution that does not involve a loop. Thanks to
Edit . If possible, I would like the function to work also for series with the same values โโ(which would result in a whole series of integers incremented by 1)
+3 
splinter 
source
to share
      
2 answers
      
        
        
        
      
    
Here's one NumPy approach -
def array_cumcount(a):
    idx = np.flatnonzero(a[1:] != a[:-1])+1
    shift_arr = np.ones(a.size,dtype=int)
    shift_arr[0] = 0
    if len(idx)>=1:
        shift_arr[idx[0]] = -idx[0]+1
        shift_arr[idx[1:]] = -idx[1:] + idx[:-1] + 1
    return shift_arr.cumsum()
      
        
        
        
      
    
Example run -
In [583]: ser = pd.Series([1.2,1.2,1.2,1.2,2,2,2,4,3,3,3,3])
In [584]: array_cumcount(ser.values)
Out[584]: array([0, 1, 2, 3, 0, 1, 2, 0, 0, 1, 2, 3])
      
        
        
        
      
    
Runtime test -
In [601]: ser = pd.Series(np.random.randint(0,3,(10000)))
# @Psidom soln
In [602]: %timeit ser.groupby(ser).cumcount()
1000 loops, best of 3: 729 ยตs per loop
In [603]: %timeit array_cumcount(ser.values)
10000 loops, best of 3: 85.3 ยตs per loop
In [604]: ser = pd.Series(np.random.randint(0,3,(1000000)))
# @Psidom soln
In [605]: %timeit ser.groupby(ser).cumcount()
10 loops, best of 3: 30.1 ms per loop
In [606]: %timeit array_cumcount(ser.values)
100 loops, best of 3: 11.7 ms per loop
      
        
        
        
      
     
+2 
Divakar 
source
to share
      You can use groupby.cumcount
      
        
        
        
      
    :
ser.groupby(ser).cumcount()
#0    0
#1    1
#2    2
#3    3
#4    0
#5    1
#6    2
#7    0
#8    0
#dtype: int64
      
        
        
        
      
     
+2 
Psidom 
source
to share