Numpy: cumulative multiplicity rate

I have a sorted array of ints that can have repetitions. I would like to count consecutive equal values, restarting from zero when the value is different from the previous one. This is the expected output, implemented with a simple python loop:

import numpy as np

def count_multiplicities(a):
    r = np.zeros(a.shape, dtype=a.dtype)
    for i in range(1, len(a)):
        if a[i] == a[i-1]:
            r[i] = r[i-1]+1
        else:
            r[i] = 0
    return r

a = (np.random.rand(20)*5).astype(dtype=int)
a.sort()

print "given sorted array: ", a
print "multiplicity count: ", count_multiplicities(a)

      

Output:

given sorted array:  [0 0 0 0 0 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4]
multiplicity count:  [0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 0 1 2 3]

      

How can I get the same result in an efficient way using numpy? The array is very long, but there are only a few repetitions (say, no more than ten).

In my special case, I also know that values ​​start at zero and that the difference between successive values ​​is 0 or 1 (no spaces in the values).

+3


source to share


2 answers


Here's one cumsum

based on a vectorized approach -

def count_multiplicities_cumsum_vectorized(a):      
    out = np.ones(a.size,dtype=int)
    idx = np.flatnonzero(a[1:] != a[:-1])+1
    out[idx[0]] = -idx[0] + 1
    out[0] = 0
    out[idx[1:]] = idx[:-1] - idx[1:] + 1
    np.cumsum(out, out=out)
    return out

      

Example run -

In [58]: a
Out[58]: array([0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4])

In [59]: count_multiplicities(a) # Original approach
Out[59]: array([0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2])

In [60]: count_multiplicities_cumsum_vectorized(a)
Out[60]: array([0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2])

      



Runtime test -

In [66]: a = (np.random.rand(200000)*1000).astype(dtype=int)
    ...: a.sort()
    ...: 

In [67]: a
Out[67]: array([  0,   0,   0, ..., 999, 999, 999])

In [68]: %timeit count_multiplicities(a)
10 loops, best of 3: 87.2 ms per loop

In [69]: %timeit count_multiplicities_cumsum_vectorized(a)
1000 loops, best of 3: 739 Β΅s per loop

      

Related post

...

+3


source


I would use numba for problems like this

import numba
nb_count_multiplicities = numba.njit("int32[:](int32[:])")(count_multiplicities)
X=nb_count_multiplicities(a)

      



Without rewriting your code at all, it's about 50 percent faster than Divakar's vectorial solution.

Veclisation is useful many times if it results in shorter and possibly more comprehensible codes, but if you are determined to vectorize code, which can also be a problem for a fairly experienced numba programmer, this is the way to go.

+1


source







All Articles