Create data timeframes in pandas using start / end date

I am trying to complete a cohort analysis in pandas

. I would like to be able to view the user's activity in Months 1, Month 2, Month n and group them when the user signed up. You can use groupby

to accomplish this, but if the user is inactive for a certain month, then they won't actually show up in the monthly data.

Ideally, here dataframe

's the one I'm trying to build:

| UserID | SignUpDate | VisitMonth | Visits |
|--------|------------|------------|--------|
| 1      | 2014-03    | 2014-03    | 2      |
| 1      | 2014-03    | 2014-04    | 1      |
| 1      | 2014-03    | 2014-05    | 0      |
| 1      | 2014-03    | 2014-06    | 0      |
| 1      | 2014-03    | 2014-07    | 0      |
| 1      | 2014-03    | 2014-08    | 0      |
| 1      | 2014-03    | 2014-09    | 0      |
| 1      | 2014-03    | 2014-10    | 1      |
| 1      | 2014-03    | 2014-11    | 0      |
| 1      | 2014-03    | 2014-12    | 0      |
| 1      | 2014-03    | 2015-01    | 2      |
| 1      | 2014-03    | 2015-02    | 1      |
| 1      | 2014-03    | 2015-03    | 0      |
| 2      | 2015-01    | 2015-01    | 2      |
| 2      | 2015-01    | 2015-02    | 0      |
| 2      | 2015-01    | 2015-03    | 2      |

      

Here are some sample data:

#Sample of users
users = [{'UserID': 'U1', 'SignUpDate': '2014-03-15'}, {'UserID': 'U2', 'SignUpDate': '2015-01-10'}]

#Create dataframe with users
usersDF = pd.DataFrame(data=users)

#Convert SignUpDate from Object to date
usersDF['SignUpDate'] = pd.to_datetime(usersDF['SignUpDate'])

#Add new column to build the month-only 
usersDF['SignUpMonth'] = usersDF['SignUpDate'].dt.to_period('M')

#Sample of visits
visits =[{'UserID': 'U1', 'VisitDate': '2014-03-15'}, {'UserID': 'U1', 'VisitDate': '2014-03-28'}, {'UserID': 'U1', 'VisitDate': '2014-04-10'}, {'UserID': 'U1', 'VisitDate': '2014-10-28'}, {'UserID': 'U1', 'VisitDate': '2015-01-02'}, {'UserID': 'U2', 'VisitDate': '2015-01-10'}, {'UserID': 'U2', 'VisitDate': '2015-01-11'}, {'UserID': 'U1', 'VisitDate': '2014-01-20'}, {'UserID': 'U1', 'VisitDate': '2014-02-15'}, {'UserID': 'U2', 'VisitDate': '2015-03-01'}, {'UserID': 'U2', 'VisitDate': '2015-03-18'}]

#Create dataframe of visits
visitDF = pd.DataFrame(data=visits)

#Convert VisitDate from object to date
visitDF['VisitDate'] = pd.to_datetime(visitDF['VisitDate'])

#Add new column to build the month-only 
visitDF['VisitMonth'] = visitDF['VisitDate'].dt.to_period('M')

#Group by to get the visits by user by month
visits_grouped = visitDF.groupby(['UserID', 'VisitMonth']).agg(['count'])

#Remove the index on UserID so can `merge`
visits_grouped.reset_index(inplace=True)

#Grouped by dataframe
cohortDF = visits_grouped.merge(usersDF[['UserID', 'SignUpMonth']], on='UserID', how='left')

      

This groupby / merge command only creates a dataframe where there are visits where I want to be able to aggregate other fields whether there is a visit or not.

DF grouped output:

  UserID VisitMonth VisitDate
                        count
0     U1    2014-01         1
1     U1    2014-02         1
2     U1    2014-03         2
3     U1    2014-04         1
4     U1    2014-10         1
5     U1    2015-01         1
6     U2    2015-01         2
7     U2    2015-03         2

      

How I tried to solve this problem

I figured it for loop

would work for this, but I am unable to paginate the DF / datatypes DF. Here's some rough (doesn't work at all) code of how I thought about it:

Note: THIS CODE DOES NOT WORK

for user in range(0, len(usersDF)):
    TheUserID = usersDF['UserID'][user]
    TheSignUpMonth = usersDF['SignUpMonth'][user]
    AnalysisMonth = pd.to_datetime(datetime.datetime(2015,03,31)).to_period('M')
    numperiods = AnalysisMonth - TheSignUpMonth
    months = pd.date_range(TheSignUpMonth, periods=numperiods, freq='M')
    for month in months:
        df.append([TheUserID, TheSignUpMonth, month])

      

There are several problems with this approach.

  • Using a for loop doesn't seem very "panda -ish" to me.
  • the numperiods part doesn't work. I know there are ways with datetime

    determining the number of days / months / etc between two dates, but trying to calculate the difference between monthly series doesn't work.
  • Building the dataframe in stages (per user id per month) gave me a bit of a problem since we're dealing with strings and panda timesery objects.

How can I build the dataframe I am looking for?

+3


source to share


1 answer


Here's a usage example dataframe

:

add a helper column for aggregation (necessary because all other columns are used in pivot table indexes)

visitDF['counter'] = 1

      

then create a pivot table so that the column is VisitDate

in the index on its own and the other dimensions are in the other.

df =pd.pivot_table(visitDF,index = 'VisitDate', columns=['VisitMonth','UserID'],\
               values='counter', aggfunc='sum').

      



Then reformat the index for a day. This creates values NAN

for days without values

df =df.resample('D',how='sum')

      

Then unbind the frame, reset the index and fill with NAN

0 to get the frame you want.

df.unstack().reset_index().fillna(0)

      

0


source







All Articles