Weighted zscore within groups

Consider the following data block df

np.random.seed([3,1415])
df = pd.DataFrame(dict(
        S=np.random.rand(20),
        W=np.random.rand(20),
        G=np.random.choice(list('ABCD'), 20)
    ))

print(df)

    G         S         W
0   B  0.444939  0.278735
1   D  0.407554  0.609862
2   C  0.460148  0.085823
3   B  0.465239  0.836997
4   A  0.462691  0.739635
5   A  0.016545  0.866059
6   D  0.850445  0.691271
7   C  0.817744  0.377185
8   B  0.777962  0.225146
9   C  0.757983  0.435280
10  C  0.934829  0.700900
11  A  0.831104  0.700946
12  C  0.879891  0.796487
13  A  0.926879  0.018688
14  D  0.721535  0.700566
15  D  0.117642  0.900749
16  D  0.145906  0.764869
17  C  0.199844  0.253200
18  B  0.437564  0.548054
19  A  0.100702  0.778883

      

I want to perform a weighted zscore of a column 'S'

using weights 'W'

in each group defined'G'

To let us know what the definition of a weighted zscore is, this is how you can calculate it across the entire set:

(df.S - (df.S * df.W).mean()) / df.S.std()

      

Question (s)
What's the most elegant way to calculate this? What's the most efficient way to compute a key? What's the most time efficient way to calculate this?

I figured out the answer as

0     1.291729
1     0.288806
2     0.394302
3     1.414926
4     0.619677
5    -0.461462
6     1.625974
7     1.645083
8     3.312825
9     1.436054
10    2.054617
11    1.512449
12    1.862456
13    1.744537
14    1.236770
15   -0.586493
16   -0.501159
17   -0.516180
18    1.246969
19   -0.257527
dtype: float64

      

+3


source to share


2 answers


Here you go:

>>> df.groupby('G').apply(lambda x: (x.S - (x.S * x.W).mean()) / x.S.std())
G    
A  4     0.619677
   5    -0.461462
   11    1.512449
   13    1.744537
   19   -0.257527
B  0     1.291729
   3     1.414926
   8     3.312825
   18    1.246969
C  2     0.394302
   7     1.645083
   9     1.436054
   10    2.054617
   12    1.862456
   17   -0.516180
D  1     0.288806
   6     1.625974
   14    1.236770
   15   -0.586493
   16   -0.501159
Name: S, dtype: float64

      



First, we split each group into G

, then apply a weighted z-score to each group data frame.

+2


source


transform

P = df.S * df.W
m = P.groupby(df.G).transform('mean')
z = df.groupby('G').S.transform('std')
(df.S - m) / z

0     1.291729
1     0.288806
2     0.394302
3     1.414926
4     0.619677
5    -0.461462
6     1.625974
7     1.645083
8     3.312825
9     1.436054
10    2.054617
11    1.512449
12    1.862456
13    1.744537
14    1.236770
15   -0.586493
16   -0.501159
17   -0.516180
18    1.246969
19   -0.257527
dtype: float64

      

agg

+ join

+eval

f = dict(S=dict(Std='std'), P=dict(Mean='mean'))
stats = df.assign(P=df.S * df.W).groupby('G').agg(f)
stats.columns = stats.columns.droplevel()
df.join(stats, on='G').eval('(S - Mean) / Std')

0     1.291729
1     0.288806
2     0.394302
3     1.414926
4     0.619677
5    -0.461462
6     1.625974
7     1.645083
8     3.312825
9     1.436054
10    2.054617
11    1.512449
12    1.862456
13    1.744537
14    1.236770
15   -0.586493
16   -0.501159
17   -0.516180
18    1.246969
19   -0.257527
dtype: float64

      




naive time

enter image description here

0


source







All Articles