Iterator and chunksize in HDFStore. Select: "Memory error"
As far as I understand it HDFStore.select
is a tool that can be used to select from large datasets. However, when trying to iterate over chunks using chunksize
and iterator=True
, the iterator itself becomes a very large object when the underlying dataset is large enough, and I don't understand why the iterator object is large and what information contains that it should become so large.
I have a very large structure HDFStore
(7 bits of strings, 420 GB on disk) that I would like to iterate over in chunks:
iterator = HDFStore.select('df', iterator=True, chunksize=chunksize)
for i, chunk in enumerate(iterator):
# some code to apply to each chunk
When I run this code on a relatively small file everything works fine. However, when I try to apply it to a 7 bit string database, I get Memory Error
when the iterator is calculated. I have 32 GB of RAM.
I would like to have a generator to create chunks on the fly that doesn't store that much in RAM, like:
iteratorGenerator = lambda: HDFStore.select('df', iterator=True, chunksize=chunksize)
for i, chunk in enumerate(iteratorGenerator):
# some code to apply to each chunk
but iteratorGenerator
doesn't repeat itself, so this doesn't work either.
I could loop the lines HDFStore.select
over start
and over stop
, but I thought there must be a more elegant way to iterate.
source to share
I had the same problem with (only) a 30GB file and apparently you can fix it by making the garbage collector do its job ... collect!: P PS: You don't need a lambda for that, the select call will return an iterator , just flip it over like you did the first block of code.
with pd.HDFStore(file_path, mode='a') as store:
# All you need is the chunksize
# not the iterator=True
iterator = store.select('df', chunksize=chunksize)
for i, chunk in enumerate(iterator):
# some code to apply to each chunk
# magic line, that solved my memory problem
# You also need "import gc" for this
gc.collect()
source to share