H2o: Iterating over more memory data without loading all data into memory
Is there a way that I can use H2O to iterate over data in excess of the cluster's total memory? I have a big data set that I need to run in batches and feed to Tensorflow for gradient descent. At the moment I only need to load one batch (or several) into memory. Is there a way that I can configure H2O to do this kind of iteration without loading the entire dataset into memory?
Here is a related question that was answered over a year ago but does not solve my problem: Loading data is larger than memory size in h2o
source to share
The short answer is not something that H2O designed. So, unfortunately, the answer today is no.
Longer answer ... (Assuming the purpose of the question is related to model training in H2O-3.x ...)
I can think of at least two ways that one could use H2O in this way: single pass pass and replace.
Think of single-pass streaming as a continuous stream of data and the data is constantly acting and then discarded (or transmitted).
Think of a replacement as the equivalent of computer hardware swap where there is fast storage (memory) and slow storage (disk), and algorithms are constantly sweeping data and flushing (swapping) data from disk to memory.
Moving only gets worse and worse in terms of performance the more data is. H2O has never been tested this way and you are on your own. Maybe you can figure out how to enable unsupported swap mode from the tips / hints in the other stackoverflow question mentioned (or in the source code), but no one ever runs that path and you are on your own. H2O was designed to be fast for machine learning by storing data in memory. Machine learning algorithms are repeated over and over. If every touch of data hits disk, it's just not the experience that was designed for the H2O-3 in-memory platform.
The streaming use case, especially for some algorithms like Deep Learning and DRF, definitely matters more for H2O. H2O supports checkpoints, and you can imagine a scenario where you read some data, train a model, then clear that data and read new data, and continue training from a checkpoint. In the case of deep learning, you will update the neural network weights with new data. With DRF, you add new trees based on new data.
source to share