Using custom groupby aggregate function to output binary result in pandas python

I have a dataset of a trader's transactions where the interest variable Buy/Sell

is binary and takes the value 1 f, the transaction was a buy and 0 if it is a sell. An example looks like this:

Trader     Buy/Sell
  A           1
  A           0
  B           1
  B           1
  B           0
  C           1
  C           0
  C           0

      

I would like to calculate net Buy/Sell

for each trader, so that if a trader had more than 50% of transactions as a buy, he would have Buy/Sell

1, if he had less than 50% of a buy, then he would have Buy/Sell

of 0, and if he had exactly by 50%, it would have NA (and it was not taken into account in future calculations).

So, for trader A, the buy share (number of purchases) / (total number of traders) = 1/2 = 0.5, which gives NA.

For trader B, this is 2/3 = 0.67, which gives 1

For trader C, this is 1/3 = 0.33, which gives 0

The table should look like this:

Trader     Buy/Sell
  A           NA
  B           1
  C           0 

      

Ultimately, I want to calculate the total cumulative number of purchases, which in this case is 1, and the cumulative total number of transactions (excluding the NN), which in this case is 2. I'm not interested in the second table, I'm just interested in the aggregated number of purchases and the cumulative total (count) Buy/Sell

.

How do I do this in Pandas?

+3


source to share


1 answer


import numpy as np
import pandas as pd

df = pd.DataFrame({'Buy/Sell': [1, 0, 1, 1, 0, 1, 0, 0],
                   'Trader': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']})

grouped = df.groupby(['Trader'])
result = grouped['Buy/Sell'].agg(['sum', 'count'])
means = grouped['Buy/Sell'].mean()
result['Buy/Sell'] = np.select(condlist=[means>0.5, means<0.5], choicelist=[1, 0], 
    default=np.nan)
print(result)

      

gives

        Buy/Sell  sum  count
Trader                      
A            NaN    1      2
B              1    2      3
C              0    1      3

      


My original answer used a custom aggregator categorize

:

def categorize(x):
    m = x.mean()
    return 1 if m > 0.5 else 0 if m < 0.5 else np.nan
result = df.groupby(['Trader'])['Buy/Sell'].agg([categorize, 'sum', 'count'])
result = result.rename(columns={'categorize' : 'Buy/Sell'})

      



When calling a custom function can be handy, performance is often significantly slower when you use a custom function versus built-in aggregators (eg groupby/agg/mean

). Cythonized built-in aggregators, while UDFs reduce performance to a simple Python for-loop.

The difference in speed is especially important when the number of groups is large. For example, with a 10000 line DataFrame with 1000 groups,

import numpy as np
import pandas as pd
np.random.seed(2017)
N = 10000
df = pd.DataFrame({
    'Buy/Sell': np.random.randint(2, size=N),
    'Trader': np.random.randint(1000, size=N)})

def using_select(df):
    grouped = df.groupby(['Trader'])
    result = grouped['Buy/Sell'].agg(['sum', 'count'])
    means = grouped['Buy/Sell'].mean()
    result['Buy/Sell'] = np.select(condlist=[means>0.5, means<0.5], choicelist=[1, 0], 
        default=np.nan)
    return result

def categorize(x):
    m = x.mean()
    return 1 if m > 0.5 else 0 if m < 0.5 else np.nan

def using_custom_function(df):
    result = df.groupby(['Trader'])['Buy/Sell'].agg([categorize, 'sum', 'count'])
    result = result.rename(columns={'categorize' : 'Buy/Sell'})
    return result

      

using_select

exceeds 50 times faster than using_custom_function

:

In [69]: %timeit using_custom_function(df)
10 loops, best of 3: 132 ms per loop

In [70]: %timeit using_select(df)
100 loops, best of 3: 2.46 ms per loop

In [71]: 132/2.46
Out[71]: 53.65853658536585

      

+9


source







All Articles