Grouping subsets of rows in a dataframe in Python using Pandas

I have the following data file from a dataset containing 0.3 million rows:

    CustomerID  Revenue
0   17850.0     15.30
1   17850.0     11.10
2   13047.0     17.85
3   13047.0     17.85
4   17850.0     20.34
5   13047.0     12.60
6   13047.0     12.60
7   13047.0     31.80
8   17850.0     20.34
9   17850.0     15.30
10  13047.0     9.90
11  13047.0     30.00
12  13047.0     31.80
13  12583.0     40.80
14  12583.0     39.60
15  13047.0     14.85
16  13047.0     14.85
17  12583.0     15.60
18  12583.0     45.00
19  12583.0     70.80

      

CustomerID values ​​are repeated in batches. For example, the CustomerID value 17850 in the first two rows may appear later at some point in the dataset. I am trying to group subsets of rows by one customer ID and sum the revenue for that group. The dataframe transformation I want to do should look like this:

   CustomerID   TotalRevenue
0   17850.0      26.40
1   13047.0      35.70
2   17850.0      20.34
3   13047.0      57.0
4   17850.0      35.64
5   13047.0      71.7
6   12583.0      80.4
7   13047.0      29.7
8   12583.0     131.4

      

The problem is, if I use the groupby method, it groups all rows with the same CustomerID value. Thus, it concatenates all 17850 CustomerID values ​​in the entire dataframe, not just a bunch of the first two rows and then subsequent bunches of other CustomerID values.

Truly appreciate how to do this using Pandas. thank

+3


source to share


2 answers


df.groupby(['CustomerID',df.CustomerID.diff().ne(0).cumsum()],sort=False)['Revenue'].sum().rename_axis(['CustomerID','GID']).reset_index().drop('GID',axis=1)

      

Output:



   CustomerID  Revenue
0     17850.0    26.40
1     13047.0    35.70
2     17850.0    20.34
3     13047.0    57.00
4     17850.0    35.64
5     13047.0    71.70
6     12583.0    80.40
7     13047.0    29.70
8     12583.0   131.40

      

+3


source


import pandas as pd

# df <- I am assuming that df contais you data

result = df.groupby('CustomerID').sum().rename(columns={'Revenue': 'TotalRevenue'})

      



0


source







All Articles