Is using a batch size as "power of 2" faster on tensorflow?

I read somewhere that if you choose a batch size equal to a power of 2, learning will be faster. What is this rule? Does this apply to other applications? Can you provide a background document?

+5


source to share


3 answers


Algorithmically speaking, using large mini-batches allows you to reduce the variance of your stochastic gradient updates (by averaging the gradients in the mini-batches), and this in turn allows you to take large step sizes, which means that the optimization algorithm will speed up progress.

However, the amount of work performed (in terms of the number of gradient calculations) to achieve a certain accuracy in the object will be the same: with a mini-batch size n, the deviation of the update direction will be reduced by a factor n, so the theory allows you to take step sizes that are n times larger, so that one step will get you closer to the same precision as n SGD steps with mini batch size 1.



Regarding TensorFlow, I found no evidence for your claim and his question, which was closed on github: https://github.com/tensorflow/tensorflow/issues/4132

Note that an image being resized to two values ​​makes sense (since merging is usually done on 2X2 windows), but this is quite different.

+3


source


I heard that too. Here is a white paper on CIFAR-10 training, where some Intel researchers state:

In general, processor performance is better when batch size is 2.



(See: https://software.intel.com/en-us/articles/cifar-10-classification-using-intel-optimization-for-tensorflow .)

However, it is not clear how great the advantage is that the authors do not provide any data on training duration: /

+1


source


The idea comes from aligning computations ( C

) on physical processors ( PP

) GPU.

Since PP is often a power of 2 , using a number C

other than a power of 2 will degrade performance.

You can see the mapping C

on PP

as a bunch of chunks in size PP

. Let's say you have 16 PP

. You can map 16 C

to them: 1 C

maps to 1 PP

. You can put 32 on them C

: 2 cuts of 16 C

, 1 PP

will be responsible for 2C

This is due to the SIMD paradigm used by GPUs. This is often referred to as data parallelism: everyone is PP

doing the same thing at the same time, but on different data.

+1


source







All Articles