Using NumPy reduceat to calculate group averages

import numpy as np
import pandas as pd
dummies = np.array(pd.get_dummies(list('abdccadab'))) #categorical IV
groupIDs = np.array([10,10,10,10,20,20,30,30,30]) #groups(/strata)
_,idx,tags = np.unique(groupIDs, return_index=1, return_inverse=1)

      

I know we can do sums, multiplications, etc. per group, per column, for example.

np.multiply.reduceat(dummies,idx)[tags]

      

but is there a way to calculate the funds of these bins?

np.mean.reduceat

and np.average.reduceat

don't work because

AttributeError: 'function' object has no attribute 'reduceat'

      

+3


source to share


2 answers


Use np.add.reduceat

to get the sum of the columns of a data array dummies

based on interval shifts idx

and then divide them by the interval lengths computed with np.bincount

-

np.add.reduceat(dummies, idx, axis=0)/np.bincount(tags)[:,None]

      

Another way to calculate interval lengths would be to use directly idx

-



np.diff(np.r_[idx,dummies.shape[0]])

      

Again, we can avoid using np.unique

to get idx

, for example:

idx = np.r_[0,np.flatnonzero(groupIDs[1:] > groupIDs[:-1])+1]

      

+2


source


numpy_indexed package (disclaimer: I am the author of it) offers this type of functionality as a one-liner:

import numpy_indexed as npi
unique_groups, means = npi.group_by(groupIDs).mean(dummies)

      



In this case (already sorted keys) it offers linear and vector performance; albeit with additional added overhead than the custom solution posted by Divakar, which already has this speculation baked. But depending on how much you maintain portability, self-sufficiency, and generality, this might be the preferred alternative.

+1


source







All Articles