Grouping by multiple criteria in pandas

Question

Grouping by multiple criteria in pandas

I have a pandas data structure like:

>>> df
        Benny  Daniel   Doris   Eric   Jack    Zoe
Age        75      30      95     25     28     23
Salary   2000    9000  100000  10000  12000  20000

I would like to find the average age and salary for several different groups where each is a subset of the columns and they can overlap, for example this dictionary, for example:

{'Parrot lovers': ['Doris', 'Benny'], 'Tea Drinkers': ['Doris', 'Zoe'],\
 'Maintainance': ['Benny', 'Jack'], 'Coffee Drinkers': ['Benny', 'Eric'],\
 'Senior Management': ['Doris', 'Zoe', 'Jack']}

How can I create a groupby function that will do this?

+3

python pandas data-analysis

Zeevi 25 Aug 14 at 16:33

source to share

2 answers

There are probably several ways to do this, here is one way.

Move your data and add a True / False column for the category:

In [20]: group_map = {'Parrot lovers': ['Doris', 'Benny'], 
                      'Tea Drinkers': ['Doris', 'Zoe'],
                      'Maintainance': ['Benny', 'Jack'], 
                      'Coffee Drinkers': ['Benny', 'Eric'], 
                      'Senior Management': ['Doris', 'Zoe', 'Jack']}
In [22]: df = df.T
In [23]: for k in group_map:
    ...:     df[k] = df.index.isin(group_map[k])

Now you can group any category to get:

In [24]: df.groupby('Parrot lovers')['Salary'].mean()
Out[24]: 
Parrot lovers
False            12750
True             51000
Name: Salary, dtype: int64

Or, iterate over the columns to get the average for each category.

In [24]: means = {}
    ...: for k in group_map:
    ...:     means[k] = df.groupby(k)['Salary'].mean()[True]
    ...: means
    ...: 
Out[24]: 
{'Coffee Drinkers': 6000,
 'Maintainance': 7000,
 'Parrot lovers': 51000,
 'Senior Management': 44000,
 'Tea Drinkers': 60000}

+1

chrisb 25 Aug 14 at 17:01

source to share

ZJS · Accepted Answer · 2014-08-25T17:05:49+0000

This is how I fixed the problem ...

import StringIO
import pandas as pd

df = """index  Benny  Daniel   Doris   Eric   Jack    Zoe
Age        75      30      95     25     28     23
Salary   2000    9000  100000  10000  12000  20000"""
df = pd.read_csv(StringIO.StringIO(df),sep="\s+").set_index('index')
d = {'Parrot lovers': ['Doris', 'Benny'], 'Tea Drinkers': ['Doris', 'Zoe'],\
 'Maintainance': ['Benny', 'Jack'], 'Coffee Drinkers': ['Benny', 'Eric'],\
 'Senior Management': ['Doris', 'Zoe', 'Jack']}

For a Just Use solution .loc

and iteration through a dictionary ...

averages = {k:df.loc[:,v].mean(axis=1) for k,v in d.iteritems()}
print pd.DataFrame(averages).T #gives the nice printout...

index                    Age  Salary
Coffee Drinkers    50.000000    6000
Maintainance       51.500000    7000
Parrot lovers      85.000000   51000
Senior Management  48.666667   44000
Tea Drinkers       59.000000   60000

Grouping by multiple criteria in pandas

More articles: