Grouping subsets of rows in a dataframe in Python using Pandas
I have the following data file from a dataset containing 0.3 million rows:
CustomerID Revenue
0 17850.0 15.30
1 17850.0 11.10
2 13047.0 17.85
3 13047.0 17.85
4 17850.0 20.34
5 13047.0 12.60
6 13047.0 12.60
7 13047.0 31.80
8 17850.0 20.34
9 17850.0 15.30
10 13047.0 9.90
11 13047.0 30.00
12 13047.0 31.80
13 12583.0 40.80
14 12583.0 39.60
15 13047.0 14.85
16 13047.0 14.85
17 12583.0 15.60
18 12583.0 45.00
19 12583.0 70.80
CustomerID values ββare repeated in batches. For example, the CustomerID value 17850 in the first two rows may appear later at some point in the dataset. I am trying to group subsets of rows by one customer ID and sum the revenue for that group. The dataframe transformation I want to do should look like this:
CustomerID TotalRevenue
0 17850.0 26.40
1 13047.0 35.70
2 17850.0 20.34
3 13047.0 57.0
4 17850.0 35.64
5 13047.0 71.7
6 12583.0 80.4
7 13047.0 29.7
8 12583.0 131.4
The problem is, if I use the groupby method, it groups all rows with the same CustomerID value. Thus, it concatenates all 17850 CustomerID values ββin the entire dataframe, not just a bunch of the first two rows and then subsequent bunches of other CustomerID values.
Truly appreciate how to do this using Pandas. thank
+3
source to share
2 answers
df.groupby(['CustomerID',df.CustomerID.diff().ne(0).cumsum()],sort=False)['Revenue'].sum().rename_axis(['CustomerID','GID']).reset_index().drop('GID',axis=1)
Output:
CustomerID Revenue
0 17850.0 26.40
1 13047.0 35.70
2 17850.0 20.34
3 13047.0 57.00
4 17850.0 35.64
5 13047.0 71.70
6 12583.0 80.40
7 13047.0 29.70
8 12583.0 131.40
+3
source to share