Count the number of days in multiple ranges
I have a line, represents a range ( from
→ to
). Here is a subset of the data.
df = DataFrame({'from': ['2015-08-24','2015-08-24'], 'to': ['2015-08-26','2015-08-31']})
from to
0 2015-08-24 2015-08-26
1 2015-08-24 2015-08-31
I want to count the number of working days for each day in the ranges. Here is my code.
# Creating a business time index by taking min an max values from the ranges
b_range = pd.bdate_range(start=min(df['from']), end=max(df['to']))
# Init of a new DataFrame with this index and the count at 0
result = DataFrame(0, index=b_range, columns=['count'])
# Iterating over the range to select the index in the result and update the count column
for index, row in df.iterrows():
result.loc[pd.bdate_range(row['from'],row['to']),'count'] += 1
print(result)
count
2015-08-24 2
2015-08-25 2
2015-08-26 2
2015-08-27 1
2015-08-28 1
2015-08-31 1
This works, but does anyone know of a more pythonic way to do this (i.e. without a loop for
)?
source to share
I was not completely satisfied with these decisions. So I kept looking and I think I found a pretty elegant and quick solution. This inspired the section "Rotating the Long" Wide "Format, explained in Wes McKinney's book:" Python for Data Analysis ".
I have put a lot of comments in my code, but I find it preferable to print out each step to understand it.
df = DataFrame({'from': ['2015-08-24','2015-08-24'], 'to': ['2015-08-26','2015-08-31']})
# Convert boundaries to datetime
df['from'] = pd.to_datetime(df['from'], format='%Y-%m-%d')
df['to'] = pd.to_datetime(df['to'], format='%Y-%m-%d')
# Reseting index to create a row id named index
df = df.reset_index(level=0)
# Pivoting data to obtain 'from' as row index and row id ('index') as column,
# each cell cointaining the 'to' date
# In consequence each range (from - to pair) is split into as many columns.
pivoted = df.pivot('from', 'index', 'to')
# Reindexing the table with a range of business dates (i.e. working days)
pivoted = pivoted.reindex(index=pd.bdate_range(start=min(df['from']),
end=max(df['to'])))
# Filling the NA values forward to copy the to date
# now each row of each column contains the corresponding to date
pivoted = pivoted.fillna(method='ffill')
# Computing the basically 'from' - 'to' for each column and each row and converting the result in days
# to obtain the number of days between the date in the index and the 'to' date
# Note: one day is added to include the right side of the interval
pivoted = pivoted.apply(lambda x: (x + Day() - x.index) / np.timedelta64(1, 'D'),
axis=0)
# Clipping value lower than 0 (not in the range) to 0
# and values upper than 0 to 1 (only one by day and by id)
pivoted = pivoted.clip_lower(0).clip_upper(1)
# Summing along the columns and that it
pivoted.sum(axis=1)
source to share
Caveat, I kind of hate this answer, but on this tiny framework it is over 2x faster, so I'll post it there as a workable, if not neat alternative.
df2 = df.apply( lambda x: [ pd.bdate_range( x['from'], x['to'] ) ], axis=1 )
arr = np.unique( np.hstack( df2.values ), return_counts=True )
result = pd.DataFrame( arr[1], index=arr[0] )
Basically all I do here is make a column with all the dates in it, and then use numpy unique
(pandas analog value_counts
) to add everything. I was hoping to come up with something more elegant and readable, but this is what I have at the moment.
source to share
Here is the method that uses cumsum()
. It should be faster than for-loop if you have a large range:
import pandas as pd
df = pd.DataFrame({
'from': ['2015-08-24','2015-08-24'],
'to': ['2015-08-26','2015-08-31']})
df = df.apply(pd.to_datetime)
from_date = min(df['from'])
to_date = max(df['to'])
b_range = pd.bdate_range(start=from_date, end=to_date)
d_range = pd.date_range(start=from_date, end=to_date)
s = pd.Series(0, index=d_range)
from_count = df["from"].value_counts()
to_count = df["to"].value_counts()
s.add(from_count, fill_value=0).sub(to_count.shift(freq="D"), fill_value=0).cumsum().reindex(b_range)
source to share