Is using a batch size as "power of 2" faster on tensorflow?
Algorithmically speaking, using large mini-batches allows you to reduce the variance of your stochastic gradient updates (by averaging the gradients in the mini-batches), and this in turn allows you to take large step sizes, which means that the optimization algorithm will speed up progress.
However, the amount of work performed (in terms of the number of gradient calculations) to achieve a certain accuracy in the object will be the same: with a mini-batch size n, the deviation of the update direction will be reduced by a factor n, so the theory allows you to take step sizes that are n times larger, so that one step will get you closer to the same precision as n SGD steps with mini batch size 1.
Regarding TensorFlow, I found no evidence for your claim and his question, which was closed on github: https://github.com/tensorflow/tensorflow/issues/4132
Note that an image being resized to two values ββmakes sense (since merging is usually done on 2X2 windows), but this is quite different.
source to share
The idea comes from aligning computations (
C
) on physical processors (PP
) GPU.
Since PP is often a power of 2 , using a number C
other than a power of 2 will degrade performance.
You can see the mapping C
on PP
as a bunch of chunks in size PP
. Let's say you have 16 PP
. You can map 16 C
to them: 1 C
maps to 1 PP
. You can put 32 on them C
: 2 cuts of 16 C
, 1 PP
will be responsible for 2C
This is due to the SIMD paradigm used by GPUs. This is often referred to as data parallelism: everyone is PP
doing the same thing at the same time, but on different data.
source to share