Pandas - aggregate column part to new value in new column

Question

Pandas - aggregate column part to new value in new column

I have a large pandas

dataframe df

with inventory data showing the number of items received.

Represent the relevant part of the structure as:

Date         SKU    received
2017-05-29   sku1   0
2017-05-30   sku1   0
2017-05-31   sku1   0
2017-06-01   sku1   0
2017-06-02   sku1   6
2017-06-03   sku1   2
2017-05-29   sku2   4
2017-05-30   sku2   4
2017-05-31   sku2   0
2017-06-01   sku2   0
2017-06-02   sku2   0
2017-06-03   sku2   24

From here I would like to restore the ordering process. I know the stock level is checked on Monday, based on the stock level, new orders are placed. Orders arrive at the warehouse in about a week, sometimes split into several shipments.

I was thinking of creating an extra column for weekdays ( df["Weekday"]

) and for placed orders ( df["Order"]

). Depending on the weekday, I would like to aggregate the "received" column data over the next 4-11 days, limited to the corresponding SKU.

The result might look like this:

Date         SKU    received    Weekday    Order
2017-05-29   sku1   0           0          8
2017-05-30   sku1   0           1          0
2017-05-31   sku1   0           2          0  
2017-06-01   sku1   0           3          0
2017-06-02   sku1   6           4          0
2017-06-03   sku1   2           5          0
2017-05-29   sku2   4           0          24
2017-05-30   sku2   4           1          0
2017-05-31   sku2   0           2          0
2017-06-01   sku2   0           3          0
2017-06-02   sku2   0           4          0
2017-06-03   sku2   24          5          0

Here is the code I tried:

import pandas as pd

# 0 is Monday, 1 is Tuesday, etc
df["Weekday"] = df["Date"].dt.dayofweek

# create new column for the orders
df["Order"] = 0

min_days = 4
max_days = min_days + 7

for i in range(len(df)):
    if df.loc[i, "Weekday"] == 0:
        df.loc[i, "Order"] = df.loc[(df.Date >= df.loc[i, "Date"] + pd.to_timedelta(min_days, unit="D")) &
                                    (df.Date < df.loc[i, "Date"] + pd.to_timedelta(max_days, unit="D")) &
                                    (df.SKU == df.loc[i, "SKU"]), "received"].sum()

He seems to be doing the job, but slowly. Maybe someone can help me find a more pythonic / pandas approach to save some computation time.

Thank you for your help.

+3

python pandas

Axel June 20. 17 at 8:58

source to share

1 answer

FLab · Accepted Answer · 2017-06-20T12:14:19+0000

Here is a possible solution that uses pandas groupby and transform.

The first idea is that you can achieve a count between two days by taking the difference between the amounts. Also, note the trick of returning the order ( [::-1]

) twice so that the number of collection days is scheduled in the future.

def count_between(ts, min_days, max_days):
    return ts[::-1].pipe(lambda y: y.rolling(max_days,1).sum() - y.rolling(min_days-1,1).sum())[::-1]

This function will give you results for each day, so you limit the results to Mondays only by setting all other records to 0 (using [.where][1]

).

Once installed Date

as an index, you can do the following:

order = df.groupby('SKU')\
          .transform(lambda x: count_between(x, min_days, max_days)\
                               .where(lambda y: y.index.dayofweek==0, other = 0))
order.columns = ['Order']

This gives the expected output:

pd.concat([df, order], axis = 1)
Out[319]: 
             SKU  received  Order
Date                             
2017-05-29  sku1         0    8.0
2017-05-30  sku1         0    0.0
2017-05-31  sku1         0    0.0
2017-06-01  sku1         0    0.0
2017-06-02  sku1         6    0.0
2017-06-03  sku1         2    0.0
2017-05-29  sku2         4   24.0
2017-05-30  sku2         4    0.0
2017-05-31  sku2         0    0.0
2017-06-01  sku2         0    0.0
2017-06-02  sku2         0    0.0
2017-06-03  sku2        24    0.0

Pandas - aggregate column part to new value in new column

More articles: