Pandas counts the number of cases in the last N days, meeting certain conditions

Question

Pandas counts the number of cases in the last N days, meeting certain conditions

Suppose I have the following dataframe and I want to count the number of occurrences of "True" in the same category in the last 14 days, how can I do that? For example, the following information frame will create a column with the values: 0,1,1,0,2,0,1,0

Date         Category   has_egg
2017-01-01   Lunch      True
2017-01-02   Lunch      True 
2017-01-02   Lunch      False
2017-01-02   Dinner     True
2017-01-12   Lunch      False
2017-01-13   Breakfast  False  
2017-01-13   Dinner     False
2017-02-04   Lunch      True

I tried to use the group but couldn't figure out the exact command

df.groupby("Category").has_egg.count_number_of_True(time_delta(-14d)) ?

+3

pandas

ArMonk June 24. 17 at 17:37

source to share

2 answers

Well, that might be an inefficient way, but you can do it, iterate over each row and build mask

or others dataframe

that fit the requirements and count them to update to a new one column

.

# converting to pandas datetime
df['Date'] = pd.to_datetime(df['Date']).dt.date
print(df)

Result df

:

         Date   Category has_egg
0  2017-01-01      Lunch    True
1  2017-01-02      Lunch    True
2  2017-01-02      Lunch   False
3  2017-01-02     Dinner    True
4  2017-01-12      Lunch   False
5  2017-01-13  Breakfast   False
6  2017-01-13     Dinner   False
7  2017-02-04      Lunch    True

Now swipe down each line and find the ones that meet all the requirements and summarize them:

for index, row in df.iterrows():
    mask = (df.Category == row.Category) & (df.Date > (row.Date - pd.Timedelta(days=14))) & (df.Date < row.Date) & (df.has_egg == True)
    df.loc[index, 'values'] = sum(mask) # insert to the new column

print(df)

Output:

         Date   Category has_egg  values
0  2017-01-01      Lunch    True     0.0
1  2017-01-02      Lunch    True     1.0
2  2017-01-02      Lunch   False     1.0
3  2017-01-02     Dinner    True     0.0
4  2017-01-12      Lunch   False     2.0
5  2017-01-13  Breakfast   False     0.0
6  2017-01-13     Dinner   False     1.0
7  2017-02-04      Lunch    True     0.0

+2

0p3n5ourcE June 24. 17 at 20:07

source to share

JohnE · Accepted Answer · 2017-06-24T19:18:41+0000

I think you can get a pretty general solution just by combining resample

and rolling

with groupby

. (Note that the code below assumes your index is the correct python / pandas datetime. If not, you need to convert it with first pd.to_datetime

.)

df.groupby('Category').resample('d').sum().fillna(0).\
   groupby('Category').rolling(14,min_periods=1).sum()

The line resample

just corrects for the fact that you can have more or less one line for a date / category. Then you can use in a rolling

very simple way.

Here's part of the result:

Lunch     Lunch     2017-01-01      1.0
                    2017-01-02      2.0
                    . . .

                    2017-01-14      2.0
                    2017-01-15      1.0
                    2017-01-16      0.0

Alternatively, for brevity, here's what it looks like at the weekly level:

df.groupby('Category').resample('w').sum().fillna(0).\
   groupby('Category').rolling(2,min_periods=1).sum()

                                has_egg
Category  Category  Date               
Breakfast Breakfast 2017-01-15      0.0
Dinner    Dinner    2017-01-08      1.0
                    2017-01-15      1.0
Lunch     Lunch     2017-01-01      1.0
                    2017-01-08      2.0
                    2017-01-15      1.0
                    2017-01-22      0.0
                    2017-01-29      0.0
                    2017-02-05      1.0

I think this way should be pretty fast, although not memory efficient, as it expands your data to one row for each date / category combination. If there is a memory issue, you need to look at some alternative approaches (which are likely to be somewhat slower and less elegant, so I wouldn't bother with that unless your data was large enough). In the meantime, there is no need to worry about it. ”

Also note, I believe this code should also work fine if you have more than one True value for a unique date / category, even if your example data did not include that case. You can edit the sample data for this feature if it is important for you to be able to handle it.

Pandas counts the number of cases in the last N days, meeting certain conditions

More articles: