Iterator and chunksize in HDFStore. Select: "Memory error"

As far as I understand it HDFStore.select

is a tool that can be used to select from large datasets. However, when trying to iterate over chunks using chunksize

and iterator=True

, the iterator itself becomes a very large object when the underlying dataset is large enough, and I don't understand why the iterator object is large and what information contains that it should become so large.

I have a very large structure HDFStore

(7 bits of strings, 420 GB on disk) that I would like to iterate over in chunks:

iterator = HDFStore.select('df', iterator=True, chunksize=chunksize)

for i, chunk in enumerate(iterator):
    # some code to apply to each chunk

      

When I run this code on a relatively small file everything works fine. However, when I try to apply it to a 7 bit string database, I get Memory Error

when the iterator is calculated. I have 32 GB of RAM.

I would like to have a generator to create chunks on the fly that doesn't store that much in RAM, like:

iteratorGenerator = lambda: HDFStore.select('df', iterator=True, chunksize=chunksize)

for i, chunk in enumerate(iteratorGenerator):
    # some code to apply to each chunk

      

but iteratorGenerator

doesn't repeat itself, so this doesn't work either.

I could loop the lines HDFStore.select

over start

and over stop

, but I thought there must be a more elegant way to iterate.

+3


source to share


1 answer


I had the same problem with (only) a 30GB file and apparently you can fix it by making the garbage collector do its job ... collect!: P PS: You don't need a lambda for that, the select call will return an iterator , just flip it over like you did the first block of code.



with pd.HDFStore(file_path, mode='a') as store:
    # All you need is the chunksize
    # not the iterator=True
    iterator = store.select('df', chunksize=chunksize)

    for i, chunk in enumerate(iterator):

        # some code to apply to each chunk

        # magic line, that solved my memory problem
        # You also need "import gc" for this
        gc.collect()

      

+2


source







All Articles