Pandas: extending DataFrame by number of observations in a column

Stata has an expand function that adds rows to the database that match the values ​​in a specific column. For example:

I have:

df = pd.DataFrame({"A":[1, 2, 3], 
                   "B":[3,4,5]})

   A  B
0  1  3
1  2  4
2  3  5

      

What I need:

df2 = pd.DataFrame({"A":[1, 2, 3, 2, 3, 3], 
                    "B":[3,4,5, 4, 5, 5]})

   A  B
0  1  3
1  2  4
2  3  5
3  2  4
4  3  5
6  3  5

      

The value in df.loc [0, 'A'] is 1, so no extra row is added to the end of the DataFrame since B = 3 is only assumed once.

The value in df.loc [1, 'A'] is 2, so one observation was added to the end of the DataFrame, making the total B = 4 to 2.

The value in df.loc [2, 'A'] is 3, so two observations are added to the end of the DataFrame, making the total B = 5 to 3.

I have looked through the previous questions for something to get me started, but no luck. Any help is appreciated.

+3


source to share


1 answer


There are many possibilities built around np.repeat

:

def using_reindex(df):
    return df.reindex(np.repeat(df.index, df['A'])).reset_index(drop=True)

def using_dictcomp(df):
    return  pd.DataFrame({col:np.repeat(df[col].values, df['A'], axis=0) 
                          for col in df})

def using_df_values(df):
    return pd.DataFrame(np.repeat(df.values, df['A'], axis=0), columns=df.columns)

def using_loc(df):
    return df.loc[np.repeat(df.index.values, df['A'])].reset_index(drop=True)

      


For example,

In [219]: df = pd.DataFrame({"A":[1, 2, 3], "B":[3,4,5]})
In [220]: df.reindex(np.repeat(df.index, df['A'])).reset_index(drop=True)
Out[220]: 
   A  B
0  1  3
1  2  4
2  2  4
3  3  5
4  3  5
5  3  5

      


Here is a reference 1000-row DataFrame; the result is a DataFrame of about 500k rows:

In [208]: df = make_dataframe(1000)

In [210]: %timeit using_dictcomp(df)
10 loops, best of 3: 23.6 ms per loop

In [218]: %timeit using_reindex(df)
10 loops, best of 3: 35.8 ms per loop

In [211]: %timeit using_df_values(df)
10 loops, best of 3: 31.3 ms per loop

In [212]: %timeit using_loc(df)
1 loop, best of 3: 275 ms per loop

      



This is the code I used to create df

:

import numpy as np
import pandas as pd

def make_dataframe(nrows=100):
    df = pd.DataFrame(
        {'A': np.arange(nrows),
         'float': np.random.randn(nrows),
         'str': np.random.choice('Lorem ipsum dolor sit'.split(), size=nrows),
         'datetime64': pd.date_range('20000101', periods=nrows)},
        index=pd.date_range('20000101', periods=nrows))
    return df

df = make_dataframe(1000)

      


If there are only a few columns, using_dictcomp

is the fastest. But note that using_dictcomp

it assumes it df

has unique column names. A dict's notion using_dictcomp

will not repeat duplicate column names. However, other alternatives will work with duplicate column names.

Both using_reindex

and using_loc

assume that df

has a unique index.


using_reindex

came from cᴏʟᴅsᴘᴇᴇᴅ using_loc

, the entry is (unfortunately) now deleted. cᴏʟᴅsᴘᴇᴇᴅ showed that there is no need to manually iterate over all values ​​- you only need to iterate over the index and then let df.loc

(or df.reindex

) repeat all lines for you. It also avoids accessing df.values

, which might generate an intermediate NumPy array from object

dtype if it df

contains multiple column types.

+4


source







All Articles