Pandas read_hdf query by date and time

Question

Pandas read_hdf query by date and time

I have a question regarding how to filter the results in the pd.read_hdf function. So here's the setup, I have a pandas dataframe (indexed np.datetime64) that I have nested in my hdf5. Nothing interesting here, so no need to use hierarchy or anything else (perhaps I could include it?). Here's an example:

                              Foo          Bar
TIME                                         
2014-07-14 12:02:00            0            0
2014-07-14 12:03:00            0            0
2014-07-14 12:04:00            0            0
2014-07-14 12:05:00            0            0
2014-07-14 12:06:00            0            0
2014-07-15 12:02:00            0            0
2014-07-15 12:03:00            0            0
2014-07-15 12:04:00            0            0
2014-07-15 12:05:00            0            0
2014-07-15 12:06:00            0            0
2014-07-16 12:02:00            0            0
2014-07-16 12:03:00            0            0
2014-07-16 12:04:00            0            0
2014-07-16 12:05:00            0            0
2014-07-16 12:06:00            0            0

I now store this in .h5 using the following command:

store = pd.HDFStore('qux.h5')
#generate df
store.append('data', df)
store.close()

Then I will have another process that accesses this data and I would like to use this data for date / time. So let's say I want dates between 2014-07-14 and 2014-07-15, and only between 12:02:00 and 12:04:00. I am currently using the following command to get this:

pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715').between_time(start_time=datetime.time(12,2), end_time=datetime.time(12,4))

As far as I know, someone please correct me if I'm wrong, but the whole original dataset is not read into memory if I use "where". So in other words:

It:

pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715')

They are not the same:

pd.read_hdf('qux.h5', 'data')['20140714':'20140715']

While the end result is exactly the same, what is being done in the background is not. So my question is, is there a way to include this time interval filter (i.e., between_time ()) in the where clause? Or if there is another way, should I create my hdf5 file? Maybe store a table for every day?

Thank!

EDIT:

When it comes to using hierarchy, I know that the structure should be highly dependent on how I will use the data. However, assuming I am defining a table for a date (eg "df / date_20140714", "df / date_20140715", ...). Again, I may be wrong here, but using my time / range request example; I will probably incur a performance penalty as I will have to read each table and concatenate them if I want the consolidated result to be correct?

+3

python pandas hdf5

ast4 05 Sep 14 at 8:09

source to share

2 answers

DURING CONSTRUCTION, CHECK BACK IN 10 MINUTES:

Let's take a look at the source code step by step (because I don't know the answer yet and it's a good place to follow) to see what happens when we use where

the pandas.read_hdf

function argument .

"Signature" (arguments) read_hdf

:

def read_hdf(path_or_buf, key=None, mode='r', **kwargs):

So the where

dictionary is kwargs

.

Spoiler: This part is irrelevant because nothing is done with where

.

This dictionary is first passed to the constructor / __ init__ of the HDFStore function:

store = HDFStore(path_or_buf, mode=mode, **kwargs)

and kwargs is passed to the public function HDFStore:

self.open(mode=mode, **kwargs)

which then passes it

self._handle = tables.open_file(self._path, self._mode, **kwargs)

where tables is the pytables library, and open_file is a function that creates an object pytables.File

passing kwargs to the File parameter list:

kwargs = dict([(k.upper(), v) for k, v in six.iteritems(kwargs)])
    params.update(kwargs)

However, this does not put where

into the parameter list because the parameter list does not have "where" as a key ( dict.update

does not add keys to the dictionary), it is only for pytables performance tuning . So nothing really happens with where

.

After the HDFStore

object HDFStore

named "store" is kwargs

then passed:

return store.select(key, auto_close=auto_close, **kwargs)

0

OrangeSherbet 01 Feb 19 at 21:44

source to share

Jeff · Accepted Answer · 2014-09-05T12:09:27+0000

See example selection with where mask

Here's an example

In [50]: pd.set_option('max_rows',10)

In [51]: df = DataFrame(np.random.randn(1000,2),index=date_range('20130101',periods=1000,freq='H'))

In [52]: df
Out[52]: 
                            0         1
2013-01-01 00:00:00 -0.467844  1.038375
2013-01-01 01:00:00  0.057419  0.914379
2013-01-01 02:00:00 -1.378131  0.187081
2013-01-01 03:00:00  0.398765 -0.122692
2013-01-01 04:00:00  0.847332  0.967856
...                       ...       ...
2013-02-11 11:00:00  0.554420  0.777484
2013-02-11 12:00:00 -0.558041  1.833465
2013-02-11 13:00:00 -0.786312  0.501893
2013-02-11 14:00:00 -0.280538  0.680498
2013-02-11 15:00:00  1.533521 -1.992070

[1000 rows x 2 columns]

In [53]: store = pd.HDFStore('test.h5',mode='w')

In [54]: store.append('df',df)

In [55]: c = store.select_column('df','index')

In [56]: where = pd.DatetimeIndex(c).indexer_between_time('12:30','4:00')

In [57]: store.select('df',where=where)
Out[57]: 
                            0         1
2013-01-01 00:00:00 -0.467844  1.038375
2013-01-01 01:00:00  0.057419  0.914379
2013-01-01 02:00:00 -1.378131  0.187081
2013-01-01 03:00:00  0.398765 -0.122692
2013-01-01 04:00:00  0.847332  0.967856
...                       ...       ...
2013-02-11 03:00:00  0.902023  1.416775
2013-02-11 04:00:00 -1.455099 -0.766558
2013-02-11 13:00:00 -0.786312  0.501893
2013-02-11 14:00:00 -0.280538  0.680498
2013-02-11 15:00:00  1.533521 -1.992070

[664 rows x 2 columns]

In [58]: store.close()

A few points to note. This is read all over the index to begin with. This is usually not a burden. If it is you can just chunk read it (provide start / stop though its a bit of a manual for this ATM). Current select_column

I do not believe I can accept the request.

You can potentially iterate over the days (and do individual queries) if you have a gigantic amount of data (tens of millions of rows that are wide), which might be more efficient.

Recombined data is relatively cheap (via concat

), so don't be afraid of the subquery (although it can drag too much and perform too much).

Pandas read_hdf query by date and time

More articles: