Superman level - Pandas DataFrame shape change due to duplication

Do you like riddles that only superhumans can solve? This is the final test to prove this ability.

One company may receive different levels of funding (seed as well) from several banks, possibly at different times.

Look at the data, then at the history for a better picture.

import pandas as pd
data = {'id':[1,2,2,3,4],'company':['alpha','beta','beta','alpha','alpha'],'bank':['z', 'x', 'y', 'z', 'j'], 
    'rd': ['seed', 'seed', 'seed', 'a', 'a'], 'funding': [100, 200, 200, 300, 50],
   'date': ['2006-12-01', '2004-09-01', '2004-09-01', '2007-05-01', '2007-09-01']}
df = pd.DataFrame(data, columns = ['id','company', 'round', 'bank', 'funding', 'date'])
df

      

Productivity:

   id  company        rd   bank    funding        date
0   1    alpha      seed      z        100  2006-12-01
1   2     beta      seed      x        200  2004-09-01
2   2     beta      seed      y        200  2004-09-01
3   3    alpha         a      z        300  2007-05-01
4   4    alpha         a      j         50  2007-09-01

      

Desired output:

   company     bank_seed   funding_seed      date_seed    bank_a  funding_a      date_a 
0    alpha             z            100     2006-12-01     [z,j]        350  2007-09-01
1     beta         [x,y]            200     2004-09-01      None       None        None

      

As you can see, I am not superhuman, but I will try to explain my thought process.

Look at alpha

Alpha first received its $ 100 seed money from bank z in late 2006. A few months later, their investors were very happy with their progress, so bank z gave them money (another $ 300!). However, alpha needed a little more money, but had to go to some random Swiss bank j to stay alive. Bank j reluctantly gave another $ 50. Hooray! They now have $ 350 for their September 2007 renewed round.

The beta version of the company is fairly new. They received funding totaling $ 200 from two different banks. But wait ... there is nothing about their round "a" here. The good thing is that we are going to set it to None for now, and remember it later.

Problem is, alpha company sucks and got money from Swiss ... This is my non-working code that worked on a subset of my data - it won't work here.

import itertools

unique_company = df.company.unique()
df_indexed = df.set_index(['company', 'rd'])
index = pd.MultiIndex.from_tuples(list(itertools.product(unique_company, list(df.rd.unique()))))
reindexed = df_indexed.reindex(index, fill_value=0)

reindexed = reindexed.unstack().applymap(lambda cell: 0 if '1970-01-01' in str(cell) else cell)

working_df = pd.DataFrame(reindexed.iloc[:, 
reindexed.columns.get_level_values(0).isin(['company', 'funding'])].to_records())

      

If you know how to solve a part of this problem, go to it and put it below. Thanks in advance for taking the time to look at this! :)

Finally, if you want to see how my code works. Then do it, but you will lose such valuable information ...

 df = df.drop_duplicates(subset='id')
 df = df.drop_duplicates(subset='rd')

      

+3


source to share


2 answers


Make a pre-treatment step in order to distribute the funding through the records with the same 'id'

, and'date'

df.funding /= df.groupby(['id', 'date']).funding.transform('count')

      



Then the process

d1 = df.groupby(['company', 'round']).agg(
    dict(bank=lambda x: tuple(x), funding='sum', date='last')
).unstack().sort_index(1, 1)

d1.columns = d1.columns.to_series().map('{0[0]}_{0[1]}'.format)

d1


           bank funding       date    bank funding       date
round         a       a          a    seed    seed       seed
company                                                      
alpha    (z, j)   350.0 2007-09-01    (z,)   100.0 2006-12-01
beta       None     NaN        NaT  (x, y)   200.0 2004-09-01

      

+4


source


Grouping, aggregation and non-stationarity will bring you closer to what you want

df.groupby(['company', 'round']).agg({'bank': lambda x: ','.join(x), 'funding': 'sum', 'date': 'max'}).unstack().reset_index()

df.columns = ['_'.join(col).strip() for col in df.columns.values]

      



You get

    company_    bank_a  bank_seed   funding_a   funding_seed  date_a    date_seed
0   alpha       z,j     z           350.0       100.0         2007-09-01 2006-12-01
1   beta        None    x,y         NaN         400.0         None        2004-09-01

      

+3


source







All Articles