Grouping by multiple criteria in pandas

I have a pandas data structure like:

>>> df
        Benny  Daniel   Doris   Eric   Jack    Zoe
Age        75      30      95     25     28     23
Salary   2000    9000  100000  10000  12000  20000 

      

I would like to find the average age and salary for several different groups where each is a subset of the columns and they can overlap, for example this dictionary, for example:

{'Parrot lovers': ['Doris', 'Benny'], 'Tea Drinkers': ['Doris', 'Zoe'],\
 'Maintainance': ['Benny', 'Jack'], 'Coffee Drinkers': ['Benny', 'Eric'],\
 'Senior Management': ['Doris', 'Zoe', 'Jack']}

      

How can I create a groupby function that will do this?

+3


source to share


2 answers


This is how I fixed the problem ...

import StringIO
import pandas as pd

df = """index  Benny  Daniel   Doris   Eric   Jack    Zoe
Age        75      30      95     25     28     23
Salary   2000    9000  100000  10000  12000  20000"""
df = pd.read_csv(StringIO.StringIO(df),sep="\s+").set_index('index')
d = {'Parrot lovers': ['Doris', 'Benny'], 'Tea Drinkers': ['Doris', 'Zoe'],\
 'Maintainance': ['Benny', 'Jack'], 'Coffee Drinkers': ['Benny', 'Eric'],\
 'Senior Management': ['Doris', 'Zoe', 'Jack']}

      



For a Just Use solution .loc

and iteration through a dictionary ...

averages = {k:df.loc[:,v].mean(axis=1) for k,v in d.iteritems()}
print pd.DataFrame(averages).T #gives the nice printout...

index                    Age  Salary
Coffee Drinkers    50.000000    6000
Maintainance       51.500000    7000
Parrot lovers      85.000000   51000
Senior Management  48.666667   44000
Tea Drinkers       59.000000   60000

      

+4


source


There are probably several ways to do this, here is one way.

Move your data and add a True / False column for the category:

In [20]: group_map = {'Parrot lovers': ['Doris', 'Benny'], 
                      'Tea Drinkers': ['Doris', 'Zoe'],
                      'Maintainance': ['Benny', 'Jack'], 
                      'Coffee Drinkers': ['Benny', 'Eric'], 
                      'Senior Management': ['Doris', 'Zoe', 'Jack']}
In [22]: df = df.T
In [23]: for k in group_map:
    ...:     df[k] = df.index.isin(group_map[k])

      

Now you can group any category to get:



In [24]: df.groupby('Parrot lovers')['Salary'].mean()
Out[24]: 
Parrot lovers
False            12750
True             51000
Name: Salary, dtype: int64

      

Or, iterate over the columns to get the average for each category.

In [24]: means = {}
    ...: for k in group_map:
    ...:     means[k] = df.groupby(k)['Salary'].mean()[True]
    ...: means
    ...: 
Out[24]: 
{'Coffee Drinkers': 6000,
 'Maintainance': 7000,
 'Parrot lovers': 51000,
 'Senior Management': 44000,
 'Tea Drinkers': 60000}

      

+1


source







All Articles