Check if the date column contains all hours in each year

I often have to download hourly historical data from a website in the following format

`            date      A     B     C
 2011/01/01 00:00    100   200   300
 2011/01/01 01:00    105   210   330
 .....
 2012/12/31 23:00    200   400   500'

      

some problem i'm running into is that online data is missing a couple of hours / days a year multiple times. I need to check how many and what dates are missing in order to decide if the data is useful.

I usually do df.groupby(by = df['date'].dt.yr)['dt'].count()

and see if each year has 8760 (8784 for leap years) and check which days are missing manually. I wonder if anyone has had a similar problem and knows how to write a piece of code to tell me which year the number of hours is missing and which hours are missing.

+3


source to share


1 answer


Use asfreq

anddifference



df.asfreq('H').index.difference(df.index)

DatetimeIndex(['2011-01-01 02:00:00', '2011-01-01 03:00:00',
               '2011-01-01 04:00:00', '2011-01-01 05:00:00',
               '2011-01-01 06:00:00', '2011-01-01 07:00:00',
               '2011-01-01 08:00:00', '2011-01-01 09:00:00',
               '2011-01-01 10:00:00', '2011-01-01 11:00:00',
               ...
               '2012-12-31 13:00:00', '2012-12-31 14:00:00',
               '2012-12-31 15:00:00', '2012-12-31 16:00:00',
               '2012-12-31 17:00:00', '2012-12-31 18:00:00',
               '2012-12-31 19:00:00', '2012-12-31 20:00:00',
               '2012-12-31 21:00:00', '2012-12-31 22:00:00'],
              dtype='datetime64[ns]', name='date', length=17541, freq='H')

      

+7


source







All Articles