Pandas: join key grouping after aggregation
What do I have
I have a pandas frame like this:
df1 = pd.DataFrame({
'date': ['31-05-2017', '31-05-2017', '31-05-2017', '31-05-2017', '01-06-2017', '01-06-2017'],
'tag': ['A', 'B', 'B', 'B', 'A', 'A'],
'metric1': [0, 0, 0, 1, 1, 1],
'metric2': [0, 1, 1, 0, 1, 0]
})
df2 = pd.DataFrame({
'date': ['31-05-2017', '31-05-2017', '01-06-2017'],
'tag': ['A', 'B', 'A'],
'metric3': [25, 3, 7,]
})
What I want
1) I want to sum metric
both metric_2
for each combination date
andtag
2) calculate the percentage of elements 1
inmetric_2
3) combine the grouped df1 with df2 so that I have metric_3
for each date
andtag
date | tag | metric1_sum | metric2_sum | metric2_percentage| metric 3
-----------|-----|-------------|-------------|-------------------|---------
31-05-2017 | A | 0 | 0 | 0 | 25
31-05-2017 | B | 1 | 2 | 0.667 | 3
01-06-2017 | A | 1 | 0 | 0.5 | 7
Attempts
(1) group and summary work
>>> g = df1.groupby(['date', 'tag']).agg(sum)
>>> g
metric1 metric2
date tag
01-06-2017 A 2 1
31-05-2017 A 0 0
B 1 2
(2) calculating the percentage works, but adding it as a column doesn't
I used the method shown here to calculate the percentage.
>>> g2 = df1.groupby(['date', 'tag']).agg({'metric2': 'sum'})
>>> g2.groupby(level=0).apply(lambda x: x/float(x.sum()))
metric2
date tag
01-06-2017 A 1.0
31-05-2017 A 0.0
B 1.0
But how do I now assign this grouped metric2
column metric2_percentage
in my groups g
or my df1
?
(3) failed merge
Group merge doesn't seem to work:
>>> pd.merge(g, df2, how='left', on=['date', 'tag'])
KeyError: 'date'
How can I then reduce df1
to one line per group so that I can concatenate it with df2
?
source to share
g
has date, tag
as an index and merge
expects columns, you need to reset_index to g
:
pd.merge(g.reset_index(), df2, how='left', on=['date', 'tag'])
Or specify left_index = True
:
pd.merge(g, df2, how='left', left_index=True, right_on=['date', 'tag'])
Both give results as (with slightly different column order):
# date tag metric1 metric2 metric3
#0 01-06-2017 A 2 1 7
#1 31-05-2017 A 0 0 25
#2 31-05-2017 B 1 2 3
Here's an alternative that makes your job work with one smaller connection:
(df1.groupby(['date', 'tag']).apply(
lambda g: pd.Series({'metric1_sum': g.metric1.sum(),
'metric2_sum': g.metric2.sum(),
'metric2_percentage': g.metric2.mean()})
# assumed here you have only 1 and 0 in metric 2 column if not use your own lambda function
).reset_index().merge(df2, how='left', on=['date', 'tag']))
# date tag metric1_sum metric2_percentage metric2_sum metric3
#0 01-06-2017 A 2.0 0.500000 1.0 7
#1 31-05-2017 A 0.0 0.000000 0.0 25
#2 31-05-2017 B 1.0 0.666667 2.0 3
source to share
Use agg
. mean
ones and zeros will be the same as percentage.
cols = ['date', 'tag']
d1 = df1.groupby(cols).agg(
dict(metric1='sum', metric2=['sum', 'mean'])
)
d1.columns = d1.columns.map('_'.join)
d1.join(df2.set_index(cols))
date tag metric1_sum metric2_sum metric2_mean metric3
0 01-06-2017 A 2 1 0.500000 7
1 31-05-2017 A 0 0 0.000000 25
2 31-05-2017 B 1 2 0.666667 3
A super-engineering solution for a one-liner
from collections import OrderedDict
df1.groupby(['date', 'tag']).agg(
dict(metric1='sum', metric2=['sum', 'mean'])
).pipe(
lambda d: pd.DataFrame(OrderedDict({'_'.join(k): v for k, v in d.iteritems()}))
).join(df2.set_index(['date', 'tag'])).reset_index()
date tag metric1_sum metric2_sum metric2_mean metric3
0 01-06-2017 A 2 1 0.500000 7
1 31-05-2017 A 0 0 0.000000 25
2 31-05-2017 B 1 2 0.666667 3
source to share