Check if the date column contains all hours in each year
I often have to download hourly historical data from a website in the following format
` date A B C 2011/01/01 00:00 100 200 300 2011/01/01 01:00 105 210 330 ..... 2012/12/31 23:00 200 400 500'
some problem i'm running into is that online data is missing a couple of hours / days a year multiple times. I need to check how many and what dates are missing in order to decide if the data is useful.
I usually do df.groupby(by = df['date'].dt.yr)['dt'].count()
and see if each year has 8760 (8784 for leap years) and check which days are missing manually. I wonder if anyone has had a similar problem and knows how to write a piece of code to tell me which year the number of hours is missing and which hours are missing.
source to share
Use asfreq
anddifference
df.asfreq('H').index.difference(df.index)
DatetimeIndex(['2011-01-01 02:00:00', '2011-01-01 03:00:00',
'2011-01-01 04:00:00', '2011-01-01 05:00:00',
'2011-01-01 06:00:00', '2011-01-01 07:00:00',
'2011-01-01 08:00:00', '2011-01-01 09:00:00',
'2011-01-01 10:00:00', '2011-01-01 11:00:00',
...
'2012-12-31 13:00:00', '2012-12-31 14:00:00',
'2012-12-31 15:00:00', '2012-12-31 16:00:00',
'2012-12-31 17:00:00', '2012-12-31 18:00:00',
'2012-12-31 19:00:00', '2012-12-31 20:00:00',
'2012-12-31 21:00:00', '2012-12-31 22:00:00'],
dtype='datetime64[ns]', name='date', length=17541, freq='H')
source to share