Where are the workers and parameter servers in distributed TensorFlow?

This post stated that:

Also, there is no built-in distinction between slaves and ps devices - it's just a convention for moving variables to ps and ops devices assigned to slaves.

This post stated that:

TL; DR: TensorFlow doesn't know anything about "parameter servers", but instead, it supports the execution of graphs on multiple devices in different processes. Some of these processes have devices whose names start with "/job:ps"

, and they hold variables. The workers lead the learning process, and when they start train_op

, they work on devices "/job:ps"

that will update the shared variables.

Questions:

  • Are there variables in ps

    on the cpu or gpu? Also, is there any performance boost if it "/job:ps"

    is on a CPU or GPU?
  • Do the lower-level libraries decide where to place a variable or operation?
+3


source to share


1 answer


Are there variables in ps on cpu or gpu? Also, is there any performance gain if "/ job: ps" is on a cpu or gpu?

You can bind a job ps

to one of them (with exceptions, see below), but binding to a GPU is not practical. ps

is really a repository of parameters and operating systems for updating it. A processor unit can have much more memory (i.e. main RAM) than a GPU, and it updates parameters fairly quickly as gradients approach. In most cases, matrix multiplications, convolutions, and other expensive operations are performed by workers, hence placing a worker

on the GPU makes sense. Placing ps

on a GPU is a waste of resources, unless the work ps

is doing something very specific and costly.

But: Tensorflow does not currently have a GPU core for integer variables, so the following code will not work when Tensorflow tries to put the variable i

on GPU # 0:

with tf.device("/gpu:0"):
  i = tf.Variable(3)

with tf.Session() as sess:
  sess.run(i.initializer)   # Fails!

      

with the following message:

Could not satisfy explicit device specification '/device:GPU:0' 
because no supported kernel for GPU devices is available.

      



This is the case when there is no device selection for the parameter and therefore for the parameter server: CPU only.

Do the lower-level libraries decide where to place a variable or operation?

If I ask this question correctly, the node placement rules are pretty simple:

The Tensorflow white paper also describes dynamic scattering, which is more complex, but not part of the open source version of tensorflow right now.

0


source







All Articles