Tensor flow: inlet piping is very slow / does not scale

Question

Tensor flow: inlet piping is very slow / does not scale

I'm trying to set up a Tensorflow input pipeline to feed images to an AlexNet function extraction (not for learning, that's one of the things). Since AlexNet is quite small, it is imperative to provide input data at a high rate to achieve acceptable performance (~ 1000 images per second).

My images are 400x300 JPEGs with 24KB per image on average.

Unfortunately, it seems that the Tensorflow input pipeline can't keep up with the GTX1080 running AlexNet.

My input pipeline is simple: upload the file, decode the image, resize it, and upload.

I created a small test to show the problem:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import time
import glob
import os

IMAGE_DIR = 'images'
EPOCHS = 1


def main():
    print('batch_size\tnum_threads\tms/image')

    for batch_size in [16, 32, 64, 128]:
        for num_threads in [1, 2, 4, 8]:
            run(batch_size, num_threads)


def run(batch_size, num_threads):
    filenames = glob.glob(os.path.join(IMAGE_DIR, '*.jpg'))
    (filename,) = tf.train.slice_input_producer(
        [filenames],
        capacity=2 * batch_size * num_threads,
        num_epochs=EPOCHS)

    raw = tf.read_file(filename)
    decoded = tf.image.decode_jpeg(raw, channels=3)
    resized = tf.image.resize_images(decoded, [227, 227])

    batch = tf.train.batch(
        [resized],
        batch_size,
        num_threads,
        2 * batch_size * num_threads,
        enqueue_many=True)

    init_op = tf.group(
        tf.global_variables_initializer(),
        tf.local_variables_initializer())

    with tf.Session() as sess:
        sess.run(init_op)

        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(sess=sess, coord=coord)

        t = time.time()
        try:
            while not coord.should_stop():
                sess.run(batch)
        except tf.errors.OutOfRangeError:
            pass
        finally:
            coord.request_stop()

        tpe = (time.time() - t) / (len(filenames) * EPOCHS) * 1000
        print('{: <11}\t{: <10}\t{: <8}'
              .format(batch_size, num_threads, tpe))

        coord.join(threads)


if __name__ == "__main__":
    main()

Doing this on a MacBook Pro (Early 2015, 2.9GHz Intel Core i5) gives the following results:

batch_size      num_threads     ms/image
16              1               4.81571793556
16              2               3.00584602356
16              4               2.94281005859
16              8               2.94555711746
32              1               3.51123785973
32              2               1.82255005836
32              4               1.85884213448
32              8               1.88741898537
64              1               2.9537730217
64              2               1.58108997345
64              4               1.57125210762
64              8               1.57615303993
128             1               2.71797513962
128             2               1.67120599747
128             4               1.6521999836
128             8               1.6885869503

It shows overall poor performance far from 1 / ms per image. In addition, it does not scale beyond two threads, which is expected in this case as it is only a dual core processor.

Running the same benchmark on a 2.5 GHz AMD Opteron 6180 SE with 24 cores gives the following:

batch_size      num_threads     ms/image
16              1               13.983194828
16              2               6.80965399742
16              4               6.67097783089
16              8               6.63090395927
32              1               12.0395629406
32              2               5.72535085678
32              4               4.94155502319
32              8               4.99696803093
64              1               10.9073989391
64              2               4.96317911148
64              4               3.76832485199
64              8               3.82816386223
128             1               10.2617599964
128             2               5.20488095284
128             4               3.16122984886
128             8               3.51550602913

Single-threaded / overall performance is very poor here too, and it doesn't scale beyond 2/4 threads.

In any case, systems are not connected to either IO or CPU. For both systems, loading and resizing images with OpenCV gives much better numbers (~ 0.86ms / image on a MacBook, which in this case is CPU bound and up to ~ 0.22ms / image on a server, which in this case is related with IO).

What's going on with Tensorflow here? How can I speed this up?

I already tried to collect a batch of images by hand and use enqueue_many for batch processing, which made things even worse. I tried adding a little sleep before starting the loop to make sure the queues are full - but no luck.

Any help is greatly appreciated.

+3

multithreading batch-processing tensorflow

thertweck 09 Apr 17 at 15:36

source to share