Count Number of rows between two dates BY ID in Pandas DataBram GroupBy

I have the following DataFrame test:

import random
from datetime import timedelta
import pandas as pd
import datetime

#create test range of dates
rng=pd.date_range(datetime.date(2015,1,1),datetime.date(2015,7,31))
rnglist=rng.tolist()
testpts = range(100,121)
#create test dataframe
d={'jid':[i for i in range(100,121)], 'cid':[random.randint(1,2) for _ in testpts],
    'stdt':[rnglist[random.randint(0,len(rng))] for _ in testpts]}
df=pd.DataFrame(d)
df['enddt'] = df['stdt']+timedelta(days=random.randint(2,32))

      

Which gives a dataframe as below with a company id column 'cid', a unique id column 'jid', start date 'stdt' and enddt 'enddt'.

   cid  jid       stdt      enddt
0    1  100 2015-07-06 2015-07-13
1    1  101 2015-07-15 2015-07-22
2    2  102 2015-07-12 2015-07-19
3    2  103 2015-07-07 2015-07-14
4    2  104 2015-07-14 2015-07-21
5    1  105 2015-07-11 2015-07-18
6    1  106 2015-07-12 2015-07-19
7    2  107 2015-07-01 2015-07-08
8    2  108 2015-07-10 2015-07-17
9    2  109 2015-07-09 2015-07-16

      

What I need to do is this: count the amount of jid that the cid occurs for each date (newdate) between min (stdt) and max (enddt) where newdate is between stdt and enddt.

The resulting dataset must be a data frame that has, for each cid, a range of date ranges (newdate) that lies between the minimum (stdt) and max (enddt) defined for each cid, and count (cnt) from the number of jid that the novelty is between min (stdt) and max (enddt). This final DataFrame should look like (it's just 1 cid using the data above):

cid newdate cnt
1   2015-07-06  1
1   2015-07-07  1
1   2015-07-08  1
1   2015-07-09  1
1   2015-07-10  1
1   2015-07-11  2
1   2015-07-12  3
1   2015-07-13  3
1   2015-07-14  2
1   2015-07-15  3
1   2015-07-16  3
1   2015-07-17  3
1   2015-07-18  3
1   2015-07-19  2
1   2015-07-20  1
1   2015-07-21  1
1   2015-07-22  1

      

I believe there must be a way to use pandas groupby (groupby cid) as well as some form of lambda (?) To create a pirated new framework.

I am currently running a loop that, for each cid (I am cutting the cid lines from the master df), in the loop determines the corresponding date range (min stdt and max enddt for each cid frame, then for each of these new items (range mindate-maxdate ) it counts the number of jids where newdate is between stdt and enddt of each jid. When I add each result dataset to a new dataframe that looks like above.

But it is very expensive in terms of resources and time. Doing this on millions of jids for thousands of cids literally takes an entire day. I hope there is a simple (r) pandas solution here.

+3


source to share


2 answers


My usual approach to these problems is to drill down and think about the events that change the battery. Every new "stdt" we see adds +1 to the count; every enddt we see adds -1. (Adds -1 the next day, at least if I interpret "between" as you do. On some days I think we should ban the word from being too ambiguous.)

IOW if we turn your frame into something like

>>> df.head()
    cid  jid  change       date
0     1  100       1 2015-01-06
1     1  101       1 2015-01-07
21    1  100      -1 2015-01-16
22    1  101      -1 2015-01-17
17    1  117       1 2015-03-01

      

then we just want the cumulative sum change

(after the corresponding rearrangement.) For example, something like



df["enddt"] += timedelta(days=1)
df = pd.melt(df, id_vars=["cid", "jid"], var_name="change", value_name="date")
df["change"] = df["change"].replace({"stdt": 1, "enddt": -1})
df = df.sort(["cid", "date"])

df = df.groupby(["cid", "date"],as_index=False)["change"].sum()
df["count"] = df.groupby("cid")["change"].cumsum()

new_time = pd.date_range(df.date.min(), df.date.max())

df_parts = []
for cid, group in df.groupby("cid"):
    full_count = group[["date", "count"]].set_index("date")
    full_count = full_count.reindex(new_time)
    full_count = full_count.ffill().fillna(0)
    full_count["cid"] = cid
    df_parts.append(full_count)

df_new = pd.concat(df_parts)

      

which gives me something like

>>> df_new.head(15)
            count  cid
2015-01-03      0    1
2015-01-04      0    1
2015-01-05      0    1
2015-01-06      1    1
2015-01-07      2    1
2015-01-08      2    1
2015-01-09      2    1
2015-01-10      2    1
2015-01-11      2    1
2015-01-12      2    1
2015-01-13      2    1
2015-01-14      2    1
2015-01-15      2    1
2015-01-16      1    1
2015-01-17      0    1

      

There may be some differences depending on your expectations; you may have different ideas on how you should handle multiple overlapping ones jid

in the same time window (they will count as 2 here); but the basic idea of ​​working with events should be useful even if you need to tweak the details.

+6


source


Here is the solution I came across (this will loop over the permutations of the unique cid and date range getting your invoices):

from itertools import product
df_new_date=pd.DataFrame(list(product(df.cid.unique(),pd.date_range(df.stdt.min(), df.enddt.max()))),columns=['cid','newdate'])
df_new_date['cnt']=df_new_date.apply(lambda row:df[(df['cid']==row['cid'])&(df['stdt']<=row['newdate'])&(df['enddt']>=row['newdate'])]['jid'].count(),axis=1)

>>> df_new_date.head(20) 
    cid    newdate  cnt
0     1 2015-07-01    0
1     1 2015-07-02    0
2     1 2015-07-03    0
3     1 2015-07-04    0
4     1 2015-07-05    0
5     1 2015-07-06    1
6     1 2015-07-07    1
7     1 2015-07-08    1
8     1 2015-07-09    1
9     1 2015-07-10    1
10    1 2015-07-11    2
11    1 2015-07-12    3
12    1 2015-07-13    3
13    1 2015-07-14    2
14    1 2015-07-15    3
15    1 2015-07-16    3
16    1 2015-07-17    3
17    1 2015-07-18    3
18    1 2015-07-19    2
19    1 2015-07-20    1

      

Then you can drop the zeros if you don't want to. However, I don't think it will be much better than your original solution.



I'd like to suggest that you use the following improvement in the loop provided by @DSM's solution:

df_parts=[]
for cid in df.cid.unique():
    full_count=df[(df.cid==cid)][['cid','date','count']].set_index("date").asfreq("D", method='ffill')[['cid','count']].reset_index()
    df_parts.append(full_count[full_count['count']!=0])

df_new = pd.concat(df_parts)

>>> df_new
         date  cid  count
0  2015-07-06    1      1
1  2015-07-07    1      1
2  2015-07-08    1      1
3  2015-07-09    1      1
4  2015-07-10    1      1
5  2015-07-11    1      2
6  2015-07-12    1      3
7  2015-07-13    1      3
8  2015-07-14    1      2
9  2015-07-15    1      3
10 2015-07-16    1      3
11 2015-07-17    1      3
12 2015-07-18    1      3
13 2015-07-19    1      2
14 2015-07-20    1      1
15 2015-07-21    1      1
16 2015-07-22    1      1
0  2015-07-01    2      1
1  2015-07-02    2      1
2  2015-07-03    2      1
3  2015-07-04    2      1
4  2015-07-05    2      1
5  2015-07-06    2      1
6  2015-07-07    2      2
7  2015-07-08    2      2
8  2015-07-09    2      2
9  2015-07-10    2      3
10 2015-07-11    2      3
11 2015-07-12    2      4
12 2015-07-13    2      4
13 2015-07-14    2      5
14 2015-07-15    2      4
15 2015-07-16    2      4
16 2015-07-17    2      3
17 2015-07-18    2      2
18 2015-07-19    2      2
19 2015-07-20    2      1
20 2015-07-21    2      1

      

The only real improvement over the one provided by @DSM is that this will avoid creating a groubby object for the loop, and it will also provide you with all the minimum stdt and max enddt values ​​per cid number without null values.

+1


source







All Articles