Memory management in the Densetet Tensorflow API

I have a training dataset that is too large to fit into memory, so my code only reads 1000 records from disk at a time. Now I would like to use the Tensorflow new Dataset API . Does the Dataset API let you specify the number of records to keep in memory, or does Tensorflow automatically manage the memory so that I don't need to?

+3


source to share


3 answers


Yes. Example from the official guide (Using the Dataset API for TensorFlow Inlet Piping, https://www.tensorflow.org/programmers_guide/datasets )



filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.contrib.data.TFRecordDataset(filenames)
dataset = dataset.map(...) ## Parsing data with a user specified function
dataset = dataset.shuffle(buffer_size=10000) ## 10000: size of sample/record pool for random selection
dataset = dataset.batch(32) ## 32: number of samples/records per batch (to be read into memory)
dataset = dataset.repeat() ## None: keep repeating

      

+2


source


If you specify the number of records using batch_size . In this case, TF will only grab the batch_size items from the file. You can also specify shuffle , and this will ensure that it is in maximum buffer_size

elements all the time in memory .



I checked it in the tfrecords files. I have 100 tfrecords files, each of which is ~ 10GB (which is more than the memory on my laptop). And everything works fine.

+1


source


dataset = dataset.prefetch (buffer_size) I guess prefetch will do this? If buffer_size is set large enough then are all tfrcords kept in memory? Buffer_size value in Dataset.map, Dataset.prefetch and Dataset.shuffle

0


source







All Articles