Using NumPy reduceat to calculate group averages

Question

Using NumPy reduceat to calculate group averages

import numpy as np
import pandas as pd
dummies = np.array(pd.get_dummies(list('abdccadab'))) #categorical IV
groupIDs = np.array([10,10,10,10,20,20,30,30,30]) #groups(/strata)
_,idx,tags = np.unique(groupIDs, return_index=1, return_inverse=1)

I know we can do sums, multiplications, etc. per group, per column, for example.

np.multiply.reduceat(dummies,idx)[tags]

but is there a way to calculate the funds of these bins?

np.mean.reduceat

and np.average.reduceat

don't work because

AttributeError: 'function' object has no attribute 'reduceat'

+3

python numpy

Tony May 19 '17 at 20:16

source to share

2 answers

numpy_indexed package (disclaimer: I am the author of it) offers this type of functionality as a one-liner:

import numpy_indexed as npi
unique_groups, means = npi.group_by(groupIDs).mean(dummies)

In this case (already sorted keys) it offers linear and vector performance; albeit with additional added overhead than the custom solution posted by Divakar, which already has this speculation baked. But depending on how much you maintain portability, self-sufficiency, and generality, this might be the preferred alternative.

+1

Eelco hoogendoorn May 19 '17 at 21:43

source to share

Divakar · Accepted Answer · 2017-05-19T20:22:11+0000

Use np.add.reduceat

to get the sum of the columns of a data array dummies

based on interval shifts idx

and then divide them by the interval lengths computed with np.bincount

-

np.add.reduceat(dummies, idx, axis=0)/np.bincount(tags)[:,None]

Another way to calculate interval lengths would be to use directly idx

-

np.diff(np.r_[idx,dummies.shape[0]])

Again, we can avoid using np.unique

to get idx

, for example:

idx = np.r_[0,np.flatnonzero(groupIDs[1:] > groupIDs[:-1])+1]

Using NumPy reduceat to calculate group averages

More articles: