Efficient use of numpy_indexed output
>>> import numpy_indexed as npi
>>> import numpy as np
>>> a = np.array([[0,0,1,1,2,2], [4,4,8,8,10,10]]).T
>>> a
array([[ 0, 4],
[ 0, 4],
[ 1, 8],
[ 1, 8],
[ 2, 10],
[ 2, 10]])
>>> npi.group_by(a[:, 0]).sum(a[:,1])
(array([0, 1, 2]), array([ 8, 16, 20], dtype=int32))
I want to perform calculations on subsets of the second column grouped by the first column on large sets (~ 1m rows). Is there an effective (and / or vector) a method of using the output group_by
on numpy_indexed
to add a new column with the results of these calculations? In the example sum
as above, I would like to give the result below.
If there is an efficient way to do it without using numpy_indexed
in the first place, this is also very helpful.
array([[ 0, 4, 8],
[ 0, 4, 8],
[ 1, 8, 16],
[ 1, 8, 16],
[ 2, 10, 20],
[ 2, 10, 20]])
source to share
One approach is with np.unique
for creating these unique tags and spacing change indices and then np.add.reduceat
for intervaled-summing
-
_,idx,tags = np.unique(a[:,0], return_index=1, return_inverse=1)
out = np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]
Another way that avoids usage np.unique
and can be beneficial for performance would be like this:
idx = np.r_[0,np.flatnonzero(a[1:,0] > a[:-1,0])+1]
tag_arr = np.zeros(a.shape[0], dtype=int)
tag_arr[idx[1:]] = 1
tags = tag_arr.cumsum()
out = np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]
To further improve performance, we must use np.bincount
. So it np.add.reduceat(a[:,1],idx)
can be replaced with np.bincount(tags, a[:,1])
.
Example run -
In [271]: a # Using a more generic sample
Out[271]:
array([[11, 4],
[11, 4],
[14, 8],
[14, 8],
[16, 10],
[16, 10]])
In [272]: _,idx,tags = np.unique(a[:,0], return_index=1, return_inverse=1)
In [273]: np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]
Out[273]:
array([[11, 4, 8],
[11, 4, 8],
[14, 8, 16],
[14, 8, 16],
[16, 10, 20],
[16, 10, 20]])]
Now the listed approaches assume that the first column is already sorted. If this is not the case, we need to sort the array by the first column argsort
and then use the suggested method. Thus, for the unordered case, we need the following preprocessing -
a = a[a[:,0].argsort()]
Battle against np.unique
Allow time to use your own method flatnonzero
+ cumsum
for embedded np.unique
to create a shift indexes: idx
and identifiers / tags based on the uniqueness of: tags
. For a case like this, where we know beforehand that the label column is already sorted, we avoid any sorting, as is done with np.unique
. This gives us a performance advantage. So, let's check it out.
Approaches -
def nonzero_cumsum_based(A):
idx = np.concatenate(( [0] ,np.flatnonzero(A[1:] > A[:-1])+1 ))
tags = np.zeros(len(A), dtype=int)
tags[idx[1:]] = 1
np.cumsum(tags, out = tags)
return idx, tags
def unique_based(A):
_,idx,tags = np.unique(A, return_index=1, return_inverse=1)
return idx, tags
An example of a run with a custom function -
In [438]: a
Out[438]:
array([[11, 4],
[11, 4],
[14, 8],
[14, 8],
[16, 10],
[16, 10]])
In [439]: idx, tags = nonzero_cumsum_based(a[:,0])
In [440]: idx
Out[440]: array([0, 2, 4])
In [441]: tags
Out[441]: array([0, 0, 1, 1, 2, 2])
Timing -
In [444]: a = np.c_[np.sort(randi(10,10000,(100000))), randi(0,10000,(100000))]
In [445]: %timeit unique_based(a[:,0])
100 loops, best of 3: 4.3 ms per loop
In [446]: %timeit nonzero_cumsum_based(a[:,0])
1000 loops, best of 3: 486 Β΅s per loop
In [447]: a = np.c_[np.sort(randi(10,10000,(1000000))), randi(0,10000,(1000000))]
In [448]: %timeit unique_based(a[:,0])
10 loops, best of 3: 50.2 ms per loop
In [449]: %timeit nonzero_cumsum_based(a[:,0])
100 loops, best of 3: 3.98 ms per loop
source to share
Each index object has an inverse property that maps the cast values ββback to their original range; for illustration, we could write:
index = npi.as_index(keys)
unique_keys = index.unique
unique_keys[index.inverse] == keys # <- should be all true
And this property is also displayed on the GroupBy object; since indeed matching grouped values ββback to their input range is usually a useful operation:
groups = npi.group_by(a[:, 0]) unique, sums = groups.sum(a[:, 1]) new_column = sums[groups.inverse]
In general, the numpy_indexed source can be a source of inspiration for how to do such general operations; For example, group_by.var runs into the same problem, passing funds for each group back to each member of the group it was formed from to calculate errors in each group. But the best tutorials certainly don't hurt.
Could you please provide a higher level description of the problem you are trying to solve? Chances are you can simplify your code even further from a higher level when you get a more comfortable thinking in terms of the design that npi makes usable.
source to share