Weighted zscore within groups
Consider the following data block df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
S=np.random.rand(20),
W=np.random.rand(20),
G=np.random.choice(list('ABCD'), 20)
))
print(df)
G S W
0 B 0.444939 0.278735
1 D 0.407554 0.609862
2 C 0.460148 0.085823
3 B 0.465239 0.836997
4 A 0.462691 0.739635
5 A 0.016545 0.866059
6 D 0.850445 0.691271
7 C 0.817744 0.377185
8 B 0.777962 0.225146
9 C 0.757983 0.435280
10 C 0.934829 0.700900
11 A 0.831104 0.700946
12 C 0.879891 0.796487
13 A 0.926879 0.018688
14 D 0.721535 0.700566
15 D 0.117642 0.900749
16 D 0.145906 0.764869
17 C 0.199844 0.253200
18 B 0.437564 0.548054
19 A 0.100702 0.778883
I want to perform a weighted zscore of a column 'S'
using weights 'W'
in each group defined'G'
To let us know what the definition of a weighted zscore is, this is how you can calculate it across the entire set:
(df.S - (df.S * df.W).mean()) / df.S.std()
Question (s)
What's the most elegant way to calculate this? What's the most efficient way to compute a key? What's the most time efficient way to calculate this?
I figured out the answer as
0 1.291729
1 0.288806
2 0.394302
3 1.414926
4 0.619677
5 -0.461462
6 1.625974
7 1.645083
8 3.312825
9 1.436054
10 2.054617
11 1.512449
12 1.862456
13 1.744537
14 1.236770
15 -0.586493
16 -0.501159
17 -0.516180
18 1.246969
19 -0.257527
dtype: float64
source to share
Here you go:
>>> df.groupby('G').apply(lambda x: (x.S - (x.S * x.W).mean()) / x.S.std())
G
A 4 0.619677
5 -0.461462
11 1.512449
13 1.744537
19 -0.257527
B 0 1.291729
3 1.414926
8 3.312825
18 1.246969
C 2 0.394302
7 1.645083
9 1.436054
10 2.054617
12 1.862456
17 -0.516180
D 1 0.288806
6 1.625974
14 1.236770
15 -0.586493
16 -0.501159
Name: S, dtype: float64
First, we split each group into G
, then apply a weighted z-score to each group data frame.
source to share
transform
P = df.S * df.W
m = P.groupby(df.G).transform('mean')
z = df.groupby('G').S.transform('std')
(df.S - m) / z
0 1.291729
1 0.288806
2 0.394302
3 1.414926
4 0.619677
5 -0.461462
6 1.625974
7 1.645083
8 3.312825
9 1.436054
10 2.054617
11 1.512449
12 1.862456
13 1.744537
14 1.236770
15 -0.586493
16 -0.501159
17 -0.516180
18 1.246969
19 -0.257527
dtype: float64
agg
+ join
+eval
f = dict(S=dict(Std='std'), P=dict(Mean='mean'))
stats = df.assign(P=df.S * df.W).groupby('G').agg(f)
stats.columns = stats.columns.droplevel()
df.join(stats, on='G').eval('(S - Mean) / Std')
0 1.291729
1 0.288806
2 0.394302
3 1.414926
4 0.619677
5 -0.461462
6 1.625974
7 1.645083
8 3.312825
9 1.436054
10 2.054617
11 1.512449
12 1.862456
13 1.744537
14 1.236770
15 -0.586493
16 -0.501159
17 -0.516180
18 1.246969
19 -0.257527
dtype: float64
naive time
source to share