Superman level - Pandas DataFrame shape change due to duplication
Do you like riddles that only superhumans can solve? This is the final test to prove this ability.
One company may receive different levels of funding (seed as well) from several banks, possibly at different times.
Look at the data, then at the history for a better picture.
import pandas as pd
data = {'id':[1,2,2,3,4],'company':['alpha','beta','beta','alpha','alpha'],'bank':['z', 'x', 'y', 'z', 'j'],
'rd': ['seed', 'seed', 'seed', 'a', 'a'], 'funding': [100, 200, 200, 300, 50],
'date': ['2006-12-01', '2004-09-01', '2004-09-01', '2007-05-01', '2007-09-01']}
df = pd.DataFrame(data, columns = ['id','company', 'round', 'bank', 'funding', 'date'])
df
Productivity:
id company rd bank funding date
0 1 alpha seed z 100 2006-12-01
1 2 beta seed x 200 2004-09-01
2 2 beta seed y 200 2004-09-01
3 3 alpha a z 300 2007-05-01
4 4 alpha a j 50 2007-09-01
Desired output:
company bank_seed funding_seed date_seed bank_a funding_a date_a
0 alpha z 100 2006-12-01 [z,j] 350 2007-09-01
1 beta [x,y] 200 2004-09-01 None None None
As you can see, I am not superhuman, but I will try to explain my thought process.
Look at alpha
Alpha first received its $ 100 seed money from bank z in late 2006. A few months later, their investors were very happy with their progress, so bank z gave them money (another $ 300!). However, alpha needed a little more money, but had to go to some random Swiss bank j to stay alive. Bank j reluctantly gave another $ 50. Hooray! They now have $ 350 for their September 2007 renewed round.
The beta version of the company is fairly new. They received funding totaling $ 200 from two different banks. But wait ... there is nothing about their round "a" here. The good thing is that we are going to set it to None for now, and remember it later.
Problem is, alpha company sucks and got money from Swiss ... This is my non-working code that worked on a subset of my data - it won't work here.
import itertools
unique_company = df.company.unique()
df_indexed = df.set_index(['company', 'rd'])
index = pd.MultiIndex.from_tuples(list(itertools.product(unique_company, list(df.rd.unique()))))
reindexed = df_indexed.reindex(index, fill_value=0)
reindexed = reindexed.unstack().applymap(lambda cell: 0 if '1970-01-01' in str(cell) else cell)
working_df = pd.DataFrame(reindexed.iloc[:,
reindexed.columns.get_level_values(0).isin(['company', 'funding'])].to_records())
If you know how to solve a part of this problem, go to it and put it below. Thanks in advance for taking the time to look at this! :)
Finally, if you want to see how my code works. Then do it, but you will lose such valuable information ...
df = df.drop_duplicates(subset='id')
df = df.drop_duplicates(subset='rd')
source to share
Make a pre-treatment step in order to distribute the funding through the records with the same 'id'
, and'date'
df.funding /= df.groupby(['id', 'date']).funding.transform('count')
Then the process
d1 = df.groupby(['company', 'round']).agg(
dict(bank=lambda x: tuple(x), funding='sum', date='last')
).unstack().sort_index(1, 1)
d1.columns = d1.columns.to_series().map('{0[0]}_{0[1]}'.format)
d1
bank funding date bank funding date
round a a a seed seed seed
company
alpha (z, j) 350.0 2007-09-01 (z,) 100.0 2006-12-01
beta None NaN NaT (x, y) 200.0 2004-09-01
source to share
Grouping, aggregation and non-stationarity will bring you closer to what you want
df.groupby(['company', 'round']).agg({'bank': lambda x: ','.join(x), 'funding': 'sum', 'date': 'max'}).unstack().reset_index()
df.columns = ['_'.join(col).strip() for col in df.columns.values]
You get
company_ bank_a bank_seed funding_a funding_seed date_a date_seed
0 alpha z,j z 350.0 100.0 2007-09-01 2006-12-01
1 beta None x,y NaN 400.0 None 2004-09-01
source to share