Create data timeframes in pandas using start / end date
I am trying to complete a cohort analysis in pandas
. I would like to be able to view the user's activity in Months 1, Month 2, Month n and group them when the user signed up. You can use groupby
to accomplish this, but if the user is inactive for a certain month, then they won't actually show up in the monthly data.
Ideally, here dataframe
's the one I'm trying to build:
| UserID | SignUpDate | VisitMonth | Visits |
|--------|------------|------------|--------|
| 1 | 2014-03 | 2014-03 | 2 |
| 1 | 2014-03 | 2014-04 | 1 |
| 1 | 2014-03 | 2014-05 | 0 |
| 1 | 2014-03 | 2014-06 | 0 |
| 1 | 2014-03 | 2014-07 | 0 |
| 1 | 2014-03 | 2014-08 | 0 |
| 1 | 2014-03 | 2014-09 | 0 |
| 1 | 2014-03 | 2014-10 | 1 |
| 1 | 2014-03 | 2014-11 | 0 |
| 1 | 2014-03 | 2014-12 | 0 |
| 1 | 2014-03 | 2015-01 | 2 |
| 1 | 2014-03 | 2015-02 | 1 |
| 1 | 2014-03 | 2015-03 | 0 |
| 2 | 2015-01 | 2015-01 | 2 |
| 2 | 2015-01 | 2015-02 | 0 |
| 2 | 2015-01 | 2015-03 | 2 |
Here are some sample data:
#Sample of users
users = [{'UserID': 'U1', 'SignUpDate': '2014-03-15'}, {'UserID': 'U2', 'SignUpDate': '2015-01-10'}]
#Create dataframe with users
usersDF = pd.DataFrame(data=users)
#Convert SignUpDate from Object to date
usersDF['SignUpDate'] = pd.to_datetime(usersDF['SignUpDate'])
#Add new column to build the month-only
usersDF['SignUpMonth'] = usersDF['SignUpDate'].dt.to_period('M')
#Sample of visits
visits =[{'UserID': 'U1', 'VisitDate': '2014-03-15'}, {'UserID': 'U1', 'VisitDate': '2014-03-28'}, {'UserID': 'U1', 'VisitDate': '2014-04-10'}, {'UserID': 'U1', 'VisitDate': '2014-10-28'}, {'UserID': 'U1', 'VisitDate': '2015-01-02'}, {'UserID': 'U2', 'VisitDate': '2015-01-10'}, {'UserID': 'U2', 'VisitDate': '2015-01-11'}, {'UserID': 'U1', 'VisitDate': '2014-01-20'}, {'UserID': 'U1', 'VisitDate': '2014-02-15'}, {'UserID': 'U2', 'VisitDate': '2015-03-01'}, {'UserID': 'U2', 'VisitDate': '2015-03-18'}]
#Create dataframe of visits
visitDF = pd.DataFrame(data=visits)
#Convert VisitDate from object to date
visitDF['VisitDate'] = pd.to_datetime(visitDF['VisitDate'])
#Add new column to build the month-only
visitDF['VisitMonth'] = visitDF['VisitDate'].dt.to_period('M')
#Group by to get the visits by user by month
visits_grouped = visitDF.groupby(['UserID', 'VisitMonth']).agg(['count'])
#Remove the index on UserID so can `merge`
visits_grouped.reset_index(inplace=True)
#Grouped by dataframe
cohortDF = visits_grouped.merge(usersDF[['UserID', 'SignUpMonth']], on='UserID', how='left')
This groupby / merge command only creates a dataframe where there are visits where I want to be able to aggregate other fields whether there is a visit or not.
DF grouped output:
UserID VisitMonth VisitDate
count
0 U1 2014-01 1
1 U1 2014-02 1
2 U1 2014-03 2
3 U1 2014-04 1
4 U1 2014-10 1
5 U1 2015-01 1
6 U2 2015-01 2
7 U2 2015-03 2
How I tried to solve this problem
I figured it for loop
would work for this, but I am unable to paginate the DF / datatypes DF. Here's some rough (doesn't work at all) code of how I thought about it:
Note: THIS CODE DOES NOT WORK
for user in range(0, len(usersDF)):
TheUserID = usersDF['UserID'][user]
TheSignUpMonth = usersDF['SignUpMonth'][user]
AnalysisMonth = pd.to_datetime(datetime.datetime(2015,03,31)).to_period('M')
numperiods = AnalysisMonth - TheSignUpMonth
months = pd.date_range(TheSignUpMonth, periods=numperiods, freq='M')
for month in months:
df.append([TheUserID, TheSignUpMonth, month])
There are several problems with this approach.
- Using a for loop doesn't seem very "panda -ish" to me.
- the numperiods part doesn't work. I know there are ways with
datetime
determining the number of days / months / etc between two dates, but trying to calculate the difference between monthly series doesn't work. - Building the dataframe in stages (per user id per month) gave me a bit of a problem since we're dealing with strings and panda timesery objects.
How can I build the dataframe I am looking for?
source to share
Here's a usage example dataframe
:
add a helper column for aggregation (necessary because all other columns are used in pivot table indexes)
visitDF['counter'] = 1
then create a pivot table so that the column is VisitDate
in the index on its own and the other dimensions are in the other.
df =pd.pivot_table(visitDF,index = 'VisitDate', columns=['VisitMonth','UserID'],\
values='counter', aggfunc='sum').
Then reformat the index for a day. This creates values NAN
for days without values
df =df.resample('D',how='sum')
Then unbind the frame, reset the index and fill with NAN
0 to get the frame you want.
df.unstack().reset_index().fillna(0)
source to share