Rolling Unique Sum for 3 previous months in python
Below is the dataset I am looking at.
Input:-
Date Name
01/01/2017 A
01/03/2017 B
02/05/2017 A
03/17/2017 C
04/08/2017 D
05/10/2017 B
06/12/2017 D
Output:-
Date Unique Count
Jan 2017 2
Feb 2017 2
Mar 2017 3
Apr 2017 3
May 2017 3
Jun 2017 2
I want to get unique "Name" counts for the previous 3 months based on rental. For example, as of 06/12/2017 the previous 3 months including April, May, June. So April had a D, May had a B, and June had a D. Thus, the unique number of June months is 2. The same is for all other months.
I am looking for a pandas function that could help me with this. Or any custom code that could implement this.
Any help is appreciated.
source to share
Let's start by creating a DataFrame and setting dates as an index:
df= pd.DataFrame({'Date': ['01-01-2017', '01-03-2017', '02-05-2017', '03-17-2017', '04-08-2017', '05-10-2017', '06-12-2017'],
'Name': ['A','B', 'A', 'C', 'D', 'B', 'D']})
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
First, we group by month, so that later we can do rolling counts per month:
groups = df.groupby(pd.TimeGrouper(freq='M'))
Now we need a way to save all the names that we saw each month. We can put them on a list.
all_names_per_month = groups['Name'].apply(list)
It looks like this:
Date
2017-01-31 [A, B]
2017-02-28 [A]
2017-03-31 [C]
2017-04-30 [D]
2017-05-31 [B]
2017-06-30 [D]
Freq: M, Name: Name, dtype: object
Next, ideally, we would like to use all_names_per_month.rolling(3).apply(...)
, but unfortunately apply
does not work with non-numeric values, so we can instead set up a custom rolling function to get the values ββwe want: / p>
def get_values(window_len, df):
values = []
for i in range(1, len(df)+1):
if i < window_len:
values.append(len(set(itertools.chain.from_iterable(all_names_per_month.iloc[0: i]))))
else:
values.append(len(set(itertools.chain.from_iterable(all_names_per_month.iloc[i-3:i]))))
return values
values = get_values(3, all_names_per_month)
This gives us:
[2, 2, 3, 3, 3, 2]
Finally, we can put these values ββin the DataFrame at the appropriate index, which we then modify to look like you indicated above:
result = pd.DataFrame(data=values, columns=['Unique Count'], index=all_names_per_month.index)
result.index = result.index.strftime('%B %Y')
result
Unique Count
January 2017 2
February 2017 2
March 2017 3
April 2017 3
May 2017 3
June 2017 2
source to share