Efficient use of numpy_indexed output

>>> import numpy_indexed as npi
>>> import numpy as np
>>> a = np.array([[0,0,1,1,2,2], [4,4,8,8,10,10]]).T
>>> a
array([[ 0,  4],
       [ 0,  4],
       [ 1,  8],
       [ 1,  8],
       [ 2, 10],
       [ 2, 10]])
>>> npi.group_by(a[:, 0]).sum(a[:,1])

(array([0, 1, 2]), array([ 8, 16, 20], dtype=int32))

      

I want to perform calculations on subsets of the second column grouped by the first column on large sets (~ 1m rows). Is there an effective (and / or vector) a method of using the output group_by

on numpy_indexed

to add a new column with the results of these calculations? In the example sum

as above, I would like to give the result below.

If there is an efficient way to do it without using numpy_indexed

in the first place, this is also very helpful.

array([[ 0,  4,  8],
       [ 0,  4,  8],
       [ 1,  8, 16],
       [ 1,  8, 16],
       [ 2, 10, 20],
       [ 2, 10, 20]])

      

+3


source to share


2 answers


One approach is with np.unique

for creating these unique tags and spacing change indices and then np.add.reduceat

for intervaled-summing

-

_,idx,tags = np.unique(a[:,0], return_index=1, return_inverse=1)
out = np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]

      

Another way that avoids usage np.unique

and can be beneficial for performance would be like this:

idx = np.r_[0,np.flatnonzero(a[1:,0] > a[:-1,0])+1]
tag_arr = np.zeros(a.shape[0], dtype=int)
tag_arr[idx[1:]] = 1
tags = tag_arr.cumsum()
out = np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]

      

To further improve performance, we must use np.bincount

. So it np.add.reduceat(a[:,1],idx)

can be replaced with np.bincount(tags, a[:,1])

.

Example run -

In [271]: a    # Using a more generic sample
Out[271]: 
array([[11,  4],
       [11,  4],
       [14,  8],
       [14,  8],
       [16, 10],
       [16, 10]])

In [272]: _,idx,tags = np.unique(a[:,0], return_index=1, return_inverse=1)

In [273]: np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]
Out[273]: 
array([[11,  4,  8],
       [11,  4,  8],
       [14,  8, 16],
       [14,  8, 16],
       [16, 10, 20],
       [16, 10, 20]])]

      

Now the listed approaches assume that the first column is already sorted. If this is not the case, we need to sort the array by the first column argsort

and then use the suggested method. Thus, for the unordered case, we need the following preprocessing -

a = a[a[:,0].argsort()]

      




Battle against np.unique

Allow time to use your own method flatnonzero

+ cumsum

for embedded np.unique

to create a shift indexes: idx

and identifiers / tags based on the uniqueness of: tags

. For a case like this, where we know beforehand that the label column is already sorted, we avoid any sorting, as is done with np.unique

. This gives us a performance advantage. So, let's check it out.

Approaches -

def nonzero_cumsum_based(A):
    idx = np.concatenate(( [0] ,np.flatnonzero(A[1:] > A[:-1])+1 ))
    tags = np.zeros(len(A), dtype=int)
    tags[idx[1:]] = 1
    np.cumsum(tags, out = tags)
    return idx, tags

def unique_based(A):
    _,idx,tags = np.unique(A, return_index=1, return_inverse=1)
    return idx, tags

      

An example of a run with a custom function -

In [438]: a
Out[438]: 
array([[11,  4],
       [11,  4],
       [14,  8],
       [14,  8],
       [16, 10],
       [16, 10]])

In [439]: idx, tags = nonzero_cumsum_based(a[:,0])

In [440]: idx
Out[440]: array([0, 2, 4])

In [441]: tags
Out[441]: array([0, 0, 1, 1, 2, 2])

      

Timing -

In [444]: a = np.c_[np.sort(randi(10,10000,(100000))), randi(0,10000,(100000))]

In [445]: %timeit unique_based(a[:,0])
100 loops, best of 3: 4.3 ms per loop

In [446]: %timeit nonzero_cumsum_based(a[:,0])
1000 loops, best of 3: 486 Β΅s per loop

In [447]: a = np.c_[np.sort(randi(10,10000,(1000000))), randi(0,10000,(1000000))]

In [448]: %timeit unique_based(a[:,0])
10 loops, best of 3: 50.2 ms per loop

In [449]: %timeit nonzero_cumsum_based(a[:,0])
100 loops, best of 3: 3.98 ms per loop

      

+3


source


Each index object has an inverse property that maps the cast values ​​back to their original range; for illustration, we could write:

index = npi.as_index(keys)
unique_keys = index.unique
unique_keys[index.inverse] == keys  # <- should be all true

      

And this property is also displayed on the GroupBy object; since indeed matching grouped values ​​back to their input range is usually a useful operation:



groups = npi.group_by(a[:, 0])
unique, sums = groups.sum(a[:, 1])
new_column = sums[groups.inverse]

      

In general, the numpy_indexed source can be a source of inspiration for how to do such general operations; For example, group_by.var runs into the same problem, passing funds for each group back to each member of the group it was formed from to calculate errors in each group. But the best tutorials certainly don't hurt.

Could you please provide a higher level description of the problem you are trying to solve? Chances are you can simplify your code even further from a higher level when you get a more comfortable thinking in terms of the design that npi makes usable.

+1


source







All Articles