Pandas counts the number of cases in the last N days, meeting certain conditions
Suppose I have the following dataframe and I want to count the number of occurrences of "True" in the same category in the last 14 days, how can I do that? For example, the following information frame will create a column with the values: 0,1,1,0,2,0,1,0
Date Category has_egg
2017-01-01 Lunch True
2017-01-02 Lunch True
2017-01-02 Lunch False
2017-01-02 Dinner True
2017-01-12 Lunch False
2017-01-13 Breakfast False
2017-01-13 Dinner False
2017-02-04 Lunch True
I tried to use the group but couldn't figure out the exact command
df.groupby("Category").has_egg.count_number_of_True(time_delta(-14d)) ?
source to share
I think you can get a pretty general solution just by combining resample
and rolling
with groupby
. (Note that the code below assumes your index is the correct python / pandas datetime. If not, you need to convert it with first pd.to_datetime
.)
df.groupby('Category').resample('d').sum().fillna(0).\
groupby('Category').rolling(14,min_periods=1).sum()
The line resample
just corrects for the fact that you can have more or less one line for a date / category. Then you can use in a rolling
very simple way.
Here's part of the result:
Lunch Lunch 2017-01-01 1.0
2017-01-02 2.0
. . .
2017-01-14 2.0
2017-01-15 1.0
2017-01-16 0.0
Alternatively, for brevity, here's what it looks like at the weekly level:
df.groupby('Category').resample('w').sum().fillna(0).\
groupby('Category').rolling(2,min_periods=1).sum()
has_egg
Category Category Date
Breakfast Breakfast 2017-01-15 0.0
Dinner Dinner 2017-01-08 1.0
2017-01-15 1.0
Lunch Lunch 2017-01-01 1.0
2017-01-08 2.0
2017-01-15 1.0
2017-01-22 0.0
2017-01-29 0.0
2017-02-05 1.0
I think this way should be pretty fast, although not memory efficient, as it expands your data to one row for each date / category combination. If there is a memory issue, you need to look at some alternative approaches (which are likely to be somewhat slower and less elegant, so I wouldn't bother with that unless your data was large enough). In the meantime, there is no need to worry about it. ”
Also note, I believe this code should also work fine if you have more than one True value for a unique date / category, even if your example data did not include that case. You can edit the sample data for this feature if it is important for you to be able to handle it.
source to share
Well, that might be an inefficient way, but you can do it, iterate over each row and build mask
or others dataframe
that fit the requirements and count them to update to a new one column
.
# converting to pandas datetime
df['Date'] = pd.to_datetime(df['Date']).dt.date
print(df)
Result df
:
Date Category has_egg
0 2017-01-01 Lunch True
1 2017-01-02 Lunch True
2 2017-01-02 Lunch False
3 2017-01-02 Dinner True
4 2017-01-12 Lunch False
5 2017-01-13 Breakfast False
6 2017-01-13 Dinner False
7 2017-02-04 Lunch True
Now swipe down each line and find the ones that meet all the requirements and summarize them:
for index, row in df.iterrows():
mask = (df.Category == row.Category) & (df.Date > (row.Date - pd.Timedelta(days=14))) & (df.Date < row.Date) & (df.has_egg == True)
df.loc[index, 'values'] = sum(mask) # insert to the new column
print(df)
Output:
Date Category has_egg values
0 2017-01-01 Lunch True 0.0
1 2017-01-02 Lunch True 1.0
2 2017-01-02 Lunch False 1.0
3 2017-01-02 Dinner True 0.0
4 2017-01-12 Lunch False 2.0
5 2017-01-13 Breakfast False 0.0
6 2017-01-13 Dinner False 1.0
7 2017-02-04 Lunch True 0.0
source to share