How to make a weighted sum when using groupBy in pandas
I put together an example because the context and details of my dataset might be too / unnecessary to explain in order to deliver my question. While my example may be silly, just know that the example illustrates what I hope to achieve (albeit on a much larger scale) and is very important to the problem at hand. For this example, imagine we have different users (represented by a letter in alphabetical order). Each user has multiple posts, and different users often share the same post. Then we come up with an importance score (either 0 or 1 indicating importance) and a reliability score (on a scale of 1 to 10). While it is completely irrelevant how these metrics are calculated for the sake of this question, imagine the importance, perhaps by analyzing content and contextual / current events,and reliability takes into account the previous performance of that source / user. It is unclear if there is a relationship between importance and reliability.
User Share Importance Reliability
A Carrots are 0 3
good for eyesight
B Apple Cider Vinegar 1 4
is good for pain
C Garlic is good for breadth 0 7
A Garlic is good for breadth 1 6
B Carrots are good for eyesight 1 9
The number may not make sense - apologies Regardless, I want to make some kind of weighted amount for each text that takes into account reliability and importance. To do this, I want to find each unique text (denoted by a shared column) and summarize the importance and reliability score for all users who shared that text. So I get something like:
A 6
B 13
C 0
I would love the sample code and suggestions on how to resolve this issue! Thanks in advance.
source to share
First a few columns mul
and then groupby
+ sum
:
Column preference groupby
as Series
not a temporary column.
df = pd.DataFrame({'User':['A','B','C','A','B'],
'Importance':[0,1,0,1,1],
'Reliability':[3,4,7,6,9]})
print (df)
Importance Reliability User
0 0 3 A
1 1 4 B
2 0 7 C
3 1 6 A
4 1 9 B
df1 = df.Importance.mul(df.Reliability).groupby(df['User']).sum().reset_index(name='col')
print (df1)
User col
0 A 6
1 B 13
2 C 0
source to share
It's only from
PROJECT
-------
KILL
Project Overkill ... just raise that you didn't get it.
And please don't accept this answer! It's just my pleasure. Yes, I believe it can be useful to many others. No, I don't think this is necessary. @Jezrael's answer is what you want.
Use numba
to prevent optimization for a very simple task
from numba import njit
import pandas as pd
import numpy as np
u = df.User.values
i = df.Importance.values
r = df.Reliability.values
f, q = pd.factorize(u)
@njit
def wghtd_sum(i, r, f):
o = np.zeros(f.max() + 1, dtype=np.int64)
for j in range(r.size):
o[f[j]] += r[j] * i[j]
return o
pd.DataFrame(dict(User=q, col=wghtd_sum(i, r, f)))
Timing
tiny data
%%timeit
u = df.User.values
i = df.Importance.values
r = df.Reliability.values
f, q = pd.factorize(u)
pd.DataFrame(dict(User=q, col=wghtd_sum(i, r, f)))
1000 loops, best of 3: 446 ยตs per loop
%timeit df.groupby('User').apply(lambda g: (g.Importance*g.Reliability).sum()).reset_index(name='col')
100 loops, best of 3: 2.51 ms per loop
%timeit df.Importance.mul(df.Reliability).groupby(df['User']).sum().reset_index(name='col')
1000 loops, best of 3: 1.19 ms per loop
big data
from string import ascii_uppercase
np.random.seed([3,1415])
df = pd.DataFrame(dict(
User=np.random.choice(list(ascii_uppercase), 100000),
Importance=np.random.randint(2, size=100000),
Reliability=np.random.randint(10, size=100000)
))
%%timeit
u = df.User.values
i = df.Importance.values
r = df.Reliability.values
f, q = pd.factorize(u)
pd.DataFrame(dict(User=q, col=wghtd_sum(i, r, f)))
100 loops, best of 3: 2.45 ms per loop
%timeit df.groupby('User').apply(lambda g: (g.Importance*g.Reliability).sum()).reset_index(name='col')
100 loops, best of 3: 14.1 ms per loop
%timeit df.Importance.mul(df.Reliability).groupby(df['User']).sum().reset_index(name='col')
100 loops, best of 3: 4.45 ms per loop
source to share
Just do:
df.groupby('User').apply(lambda g: (g.Importance*g.Reliability).sum())
Or you can create a product column and just sum it up:
df['Score'] = df.Importance * df.Reliability
df.groupby('User').Score.sum()
(both assume that the same user does not use the same article more than once.)
source to share
As for the wording of your question, I think you want to summarize the product Importance
and Reliability
for each unique post and each unique user .
Here's a sample dataframe similar to yours -
df = pd.DataFrame({'User':['A','B','C','A','B'],'Share':['Random Post 1','Random post 2','Random Post 3','Random Post 3','Random Post 1'], 'Importance':[0,1,0,1,1],'Reliability':[3,4,7,6,9]})
=>
Importance Reliability Share User
0 0 3 Random Post 1 A
1 1 4 Random post 2 B
2 0 7 Random Post 3 C
3 1 6 Random Post 3 A
4 1 9 Random post 1 B
First get a new column Product
-
df['Product'] = df.Importance.mul(df.Reliability)
=>
Importance Reliability Share User Product
0 0 3 Random Post 1 A 0
1 1 4 Random post 2 B 4
2 0 7 Random Post 3 C 0
3 1 6 Random Post 3 A 6
4 1 9 Random post 1 B 9
Now just group by Share
and User
and sum by Product
to get the desired result -
df.groupby(['Share','User'])['Product'].sum().reset_index(name='Score')
=>
Share User
Random Post 1 A 0
B 9
Random Post 3 A 6
C 0
Random post 2 B 4
source to share