Numpy: cumulative multiplicity rate
I have a sorted array of ints that can have repetitions. I would like to count consecutive equal values, restarting from zero when the value is different from the previous one. This is the expected output, implemented with a simple python loop:
import numpy as np
def count_multiplicities(a):
r = np.zeros(a.shape, dtype=a.dtype)
for i in range(1, len(a)):
if a[i] == a[i-1]:
r[i] = r[i-1]+1
else:
r[i] = 0
return r
a = (np.random.rand(20)*5).astype(dtype=int)
a.sort()
print "given sorted array: ", a
print "multiplicity count: ", count_multiplicities(a)
Output:
given sorted array: [0 0 0 0 0 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4]
multiplicity count: [0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 0 1 2 3]
How can I get the same result in an efficient way using numpy? The array is very long, but there are only a few repetitions (say, no more than ten).
In my special case, I also know that values ββstart at zero and that the difference between successive values ββis 0 or 1 (no spaces in the values).
source to share
Here's one cumsum
based on a vectorized approach -
def count_multiplicities_cumsum_vectorized(a):
out = np.ones(a.size,dtype=int)
idx = np.flatnonzero(a[1:] != a[:-1])+1
out[idx[0]] = -idx[0] + 1
out[0] = 0
out[idx[1:]] = idx[:-1] - idx[1:] + 1
np.cumsum(out, out=out)
return out
Example run -
In [58]: a
Out[58]: array([0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4])
In [59]: count_multiplicities(a) # Original approach
Out[59]: array([0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2])
In [60]: count_multiplicities_cumsum_vectorized(a)
Out[60]: array([0, 1, 2, 3, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2])
Runtime test -
In [66]: a = (np.random.rand(200000)*1000).astype(dtype=int)
...: a.sort()
...:
In [67]: a
Out[67]: array([ 0, 0, 0, ..., 999, 999, 999])
In [68]: %timeit count_multiplicities(a)
10 loops, best of 3: 87.2 ms per loop
In [69]: %timeit count_multiplicities_cumsum_vectorized(a)
1000 loops, best of 3: 739 Β΅s per loop
Related post
...
source to share
I would use numba for problems like this
import numba
nb_count_multiplicities = numba.njit("int32[:](int32[:])")(count_multiplicities)
X=nb_count_multiplicities(a)
Without rewriting your code at all, it's about 50 percent faster than Divakar's vectorial solution.
Veclisation is useful many times if it results in shorter and possibly more comprehensible codes, but if you are determined to vectorize code, which can also be a problem for a fairly experienced numba programmer, this is the way to go.
source to share