Flatten hierarchically indexed pandas.DataFrame from groupby and multiple aggregation
I am grouping a dataframe by multiple columns and aggregating to get multiple statistics. How do I get a completely flat structure with every possible combination of group keys listed as rows and each statistic presented as columns?
import numpy as np
import pandas as pd
cities = ['Berlin', 'Oslo']
days = ['Monday', 'Friday']
data = pd.DataFrame({
'city': np.random.choice(cities, 12),
'day': np.random.choice(days, 12),
'people': np.random.normal(loc=10, size=12),
'cats': np.random.normal(loc=6, size=12)})
grouped = data.groupby(['city', 'day']).agg([np.mean, np.std])
This way I get:
cats people
mean std mean std
city day
Berlin Friday 6.146924 0.721263 10.445606 0.730992
Monday 5.239267 NaN 9.022811 NaN
Oslo Friday 6.322276 0.866899 11.579813 0.114341
Monday 5.028919 0.815674 10.458439 1.182689
I need it to be flat:
city day cats_mean cats_std people_mean people_std
Berlin Friday 6.146924 0.721263 10.445606 0.730992
Berlin Monday 5.239267 NaN 9.022811 NaN
Oslo Friday 6.322276 0.866899 11.579813 0.114341
Oslo Monday 5.028919 0.815674 10.458439 1.182689
+3
source to share
3 answers
In [36]: grouped.columns = grouped.columns.map('_'.join)
In [37]: grouped = grouped.reset_index()
In [38]: grouped
Out[38]:
city day cats_mean cats_std people_mean people_std
0 Berlin Friday 5.852991 1.085163 11.078541 0.839688
1 Berlin Monday 6.978343 0.630983 9.876106 1.846204
2 Oslo Friday 6.096773 1.278176 9.710216 0.691672
+5
source to share
You can do a list comprehension at the column levels and join the underscore character, then call reset_index
:
In [39]:
grouped.columns= ['_'.join(x) for x in list(zip(grouped.columns.get_level_values(0), grouped.columns.get_level_values(1)))]
grouped = grouped.reset_index()
grouped
Out[39]:
city day cats_mean cats_std people_mean people_std
0 Berlin Friday 6.140710 0.555981 10.187634 0.359724
1 Berlin Monday 6.420175 0.986568 10.134376 0.963938
2 Oslo Friday 6.978572 0.573297 11.345484 1.454762
3 Oslo Monday 4.594814 NaN 10.842988 NaN
+2
source to share
You can use a dictionary in .agg to rename your columns, then uncheck the column level and reset_index (): See SOA post
import numpy as np
import pandas as pd
cities = ['Berlin', 'Oslo']
days = ['Monday', 'Friday']
data = pd.DataFrame({
'city': np.random.choice(cities, 12),
'day': np.random.choice(days, 12),
'people': np.random.normal(loc=10, size=12),
'cats': np.random.normal(loc=6, size=12)})
grouped = data.groupby(['city', 'day']).agg({'cats':{'cats_mean':np.mean,'cats_std':np.std},'people':{'people_mean':np.mean,'people_std':np.std}})
grouped.columns = grouped.columns.droplevel()
grouped.reset_index()
city day people_mean people_std cats_std cats_mean
0 Berlin Friday 9.645190 0.699684 0.973866 6.478510
1 Berlin Monday 9.556898 0.126810 0.336654 6.624288
2 Oslo Friday 11.593491 NaN NaN 6.206595
3 Oslo Monday 10.202183 1.058651 0.657939 6.019748
+1
source to share