Using NumPy reduceat to calculate group averages
import numpy as np
import pandas as pd
dummies = np.array(pd.get_dummies(list('abdccadab'))) #categorical IV
groupIDs = np.array([10,10,10,10,20,20,30,30,30]) #groups(/strata)
_,idx,tags = np.unique(groupIDs, return_index=1, return_inverse=1)
I know we can do sums, multiplications, etc. per group, per column, for example.
np.multiply.reduceat(dummies,idx)[tags]
but is there a way to calculate the funds of these bins?
np.mean.reduceat
and np.average.reduceat
don't work because
AttributeError: 'function' object has no attribute 'reduceat'
source to share
Use np.add.reduceat
to get the sum of the columns of a data array dummies
based on interval shifts idx
and then divide them by the interval lengths computed with np.bincount
-
np.add.reduceat(dummies, idx, axis=0)/np.bincount(tags)[:,None]
Another way to calculate interval lengths would be to use directly idx
-
np.diff(np.r_[idx,dummies.shape[0]])
Again, we can avoid using np.unique
to get idx
, for example:
idx = np.r_[0,np.flatnonzero(groupIDs[1:] > groupIDs[:-1])+1]
source to share
numpy_indexed package (disclaimer: I am the author of it) offers this type of functionality as a one-liner:
import numpy_indexed as npi
unique_groups, means = npi.group_by(groupIDs).mean(dummies)
In this case (already sorted keys) it offers linear and vector performance; albeit with additional added overhead than the custom solution posted by Divakar, which already has this speculation baked. But depending on how much you maintain portability, self-sufficiency, and generality, this might be the preferred alternative.
source to share