Pandas - aggregate column part to new value in new column
I have a large pandas
dataframe df
with inventory data showing the number of items received.
Represent the relevant part of the structure as:
Date SKU received
2017-05-29 sku1 0
2017-05-30 sku1 0
2017-05-31 sku1 0
2017-06-01 sku1 0
2017-06-02 sku1 6
2017-06-03 sku1 2
2017-05-29 sku2 4
2017-05-30 sku2 4
2017-05-31 sku2 0
2017-06-01 sku2 0
2017-06-02 sku2 0
2017-06-03 sku2 24
From here I would like to restore the ordering process. I know the stock level is checked on Monday, based on the stock level, new orders are placed. Orders arrive at the warehouse in about a week, sometimes split into several shipments.
I was thinking of creating an extra column for weekdays ( df["Weekday"]
) and for placed orders ( df["Order"]
). Depending on the weekday, I would like to aggregate the "received" column data over the next 4-11 days, limited to the corresponding SKU.
The result might look like this:
Date SKU received Weekday Order
2017-05-29 sku1 0 0 8
2017-05-30 sku1 0 1 0
2017-05-31 sku1 0 2 0
2017-06-01 sku1 0 3 0
2017-06-02 sku1 6 4 0
2017-06-03 sku1 2 5 0
2017-05-29 sku2 4 0 24
2017-05-30 sku2 4 1 0
2017-05-31 sku2 0 2 0
2017-06-01 sku2 0 3 0
2017-06-02 sku2 0 4 0
2017-06-03 sku2 24 5 0
Here is the code I tried:
import pandas as pd
# 0 is Monday, 1 is Tuesday, etc
df["Weekday"] = df["Date"].dt.dayofweek
# create new column for the orders
df["Order"] = 0
min_days = 4
max_days = min_days + 7
for i in range(len(df)):
if df.loc[i, "Weekday"] == 0:
df.loc[i, "Order"] = df.loc[(df.Date >= df.loc[i, "Date"] + pd.to_timedelta(min_days, unit="D")) &
(df.Date < df.loc[i, "Date"] + pd.to_timedelta(max_days, unit="D")) &
(df.SKU == df.loc[i, "SKU"]), "received"].sum()
He seems to be doing the job, but slowly. Maybe someone can help me find a more pythonic / pandas approach to save some computation time.
Thank you for your help.
source to share
Here is a possible solution that uses pandas groupby and transform.
The first idea is that you can achieve a count between two days by taking the difference between the amounts. Also, note the trick of returning the order ( [::-1]
) twice so that the number of collection days is scheduled in the future.
def count_between(ts, min_days, max_days):
return ts[::-1].pipe(lambda y: y.rolling(max_days,1).sum() - y.rolling(min_days-1,1).sum())[::-1]
This function will give you results for each day, so you limit the results to Mondays only by setting all other records to 0 (using [.where][1]
).
Once installed Date
as an index, you can do the following:
order = df.groupby('SKU')\
.transform(lambda x: count_between(x, min_days, max_days)\
.where(lambda y: y.index.dayofweek==0, other = 0))
order.columns = ['Order']
This gives the expected output:
pd.concat([df, order], axis = 1)
Out[319]:
SKU received Order
Date
2017-05-29 sku1 0 8.0
2017-05-30 sku1 0 0.0
2017-05-31 sku1 0 0.0
2017-06-01 sku1 0 0.0
2017-06-02 sku1 6 0.0
2017-06-03 sku1 2 0.0
2017-05-29 sku2 4 24.0
2017-05-30 sku2 4 0.0
2017-05-31 sku2 0 0.0
2017-06-01 sku2 0 0.0
2017-06-02 sku2 0 0.0
2017-06-03 sku2 24 0.0
source to share