Using custom groupby aggregate function to output binary result in pandas python
I have a dataset of a trader's transactions where the interest variable Buy/Sell
is binary and takes the value 1 f, the transaction was a buy and 0 if it is a sell. An example looks like this:
Trader Buy/Sell
A 1
A 0
B 1
B 1
B 0
C 1
C 0
C 0
I would like to calculate net Buy/Sell
for each trader, so that if a trader had more than 50% of transactions as a buy, he would have Buy/Sell
1, if he had less than 50% of a buy, then he would have Buy/Sell
of 0, and if he had exactly by 50%, it would have NA (and it was not taken into account in future calculations).
So, for trader A, the buy share (number of purchases) / (total number of traders) = 1/2 = 0.5, which gives NA.
For trader B, this is 2/3 = 0.67, which gives 1
For trader C, this is 1/3 = 0.33, which gives 0
The table should look like this:
Trader Buy/Sell
A NA
B 1
C 0
Ultimately, I want to calculate the total cumulative number of purchases, which in this case is 1, and the cumulative total number of transactions (excluding the NN), which in this case is 2. I'm not interested in the second table, I'm just interested in the aggregated number of purchases and the cumulative total (count) Buy/Sell
.
How do I do this in Pandas?
source to share
import numpy as np
import pandas as pd
df = pd.DataFrame({'Buy/Sell': [1, 0, 1, 1, 0, 1, 0, 0],
'Trader': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']})
grouped = df.groupby(['Trader'])
result = grouped['Buy/Sell'].agg(['sum', 'count'])
means = grouped['Buy/Sell'].mean()
result['Buy/Sell'] = np.select(condlist=[means>0.5, means<0.5], choicelist=[1, 0],
default=np.nan)
print(result)
gives
Buy/Sell sum count
Trader
A NaN 1 2
B 1 2 3
C 0 1 3
My original answer used a custom aggregator categorize
:
def categorize(x):
m = x.mean()
return 1 if m > 0.5 else 0 if m < 0.5 else np.nan
result = df.groupby(['Trader'])['Buy/Sell'].agg([categorize, 'sum', 'count'])
result = result.rename(columns={'categorize' : 'Buy/Sell'})
When calling a custom function can be handy, performance is often significantly slower when you use a custom function versus built-in aggregators (eg groupby/agg/mean
). Cythonized built-in aggregators, while UDFs reduce performance to a simple Python for-loop.
The difference in speed is especially important when the number of groups is large. For example, with a 10000 line DataFrame with 1000 groups,
import numpy as np
import pandas as pd
np.random.seed(2017)
N = 10000
df = pd.DataFrame({
'Buy/Sell': np.random.randint(2, size=N),
'Trader': np.random.randint(1000, size=N)})
def using_select(df):
grouped = df.groupby(['Trader'])
result = grouped['Buy/Sell'].agg(['sum', 'count'])
means = grouped['Buy/Sell'].mean()
result['Buy/Sell'] = np.select(condlist=[means>0.5, means<0.5], choicelist=[1, 0],
default=np.nan)
return result
def categorize(x):
m = x.mean()
return 1 if m > 0.5 else 0 if m < 0.5 else np.nan
def using_custom_function(df):
result = df.groupby(['Trader'])['Buy/Sell'].agg([categorize, 'sum', 'count'])
result = result.rename(columns={'categorize' : 'Buy/Sell'})
return result
using_select
exceeds 50 times faster than using_custom_function
:
In [69]: %timeit using_custom_function(df)
10 loops, best of 3: 132 ms per loop
In [70]: %timeit using_select(df)
100 loops, best of 3: 2.46 ms per loop
In [71]: 132/2.46
Out[71]: 53.65853658536585
source to share