TensorFlow GPU

I need help optimizing a custom TensorFlow model. I have a 40GB ZLIB compressed .TFRecords file containing my training data. Each sample consists of two 384x512x3 images and a 384x512x2 vector field. I am loading my data like this:

    num_threads = 16
    reader_kwargs = {'options': tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB)}
    data_provider = slim.dataset_data_provider.DatasetDataProvider(
                        dataset,
                        num_readers=num_threads,
                        reader_kwargs=reader_kwargs)
    image_a, image_b, flow = data_provider.get(['image_a', 'image_b', 'flow'])

    image_as, image_bs, flows = tf.train.batch(
        [image_a, image_b, flow],
        batch_size=dataset_config['BATCH_SIZE'], # 8
        capacity=dataset_config['BATCH_SIZE'] * 10,
        num_threads=num_threads,
        allow_smaller_final_batch=False)

      

However, I am getting 0.25 to 0.30 global steps / second. (SLOW!)

Here is my TensorBoard silence for a parallel reader. It is at 99% -100%. enter image description here

I was using GPU with time (% per second). It looks like the data is starving, but I'm not sure how to fix this. I've tried increasing / decreasing the number of threads, but it doesn't seem to make any difference. I am training on an NVIDIA K80 GPU with 4 processors and 61GB of RAM.

GPU usage

How can I make this train faster?

+3


source to share


1 answer


If your examples are small, then using the DataSetProvider will not lead to satisfying results. He only reads one example at a time, which might be the neck of a bottle. I have already added feature request to github .

At the same time, you will have to collapse your own input queue which uses read_up_to

:



  batch_size = 10000
  num_tfrecords_at_once = 1024
  reader = tf.TFRecordReader()
  # Here where the magic happens:
  _, records = reader.read_up_to(filename_queue, num_tfrecords_at_once)

  # Batch records with 'enqueue_many=True'
  batch_serialized_example = tf.train.shuffle_batch(
      [records],
      num_threads=num_threads,
      batch_size=batch_size,
      capacity=10 * batch_size,
      min_after_dequeue=2 * batch_size,
      enqueue_many=True)

  parsed = tf.parse_example(
      batch_serialized_example,
      features=whatever_features_you_have)
  # Use parsed['feature_name'] etc. below

      

0


source







All Articles