Where are the workers and parameter servers in distributed TensorFlow?
This post stated that:
Also, there is no built-in distinction between slaves and ps devices - it's just a convention for moving variables to ps and ops devices assigned to slaves.
This post stated that:
TL; DR: TensorFlow doesn't know anything about "parameter servers", but instead, it supports the execution of graphs on multiple devices in different processes. Some of these processes have devices whose names start with
"/job:ps"
, and they hold variables. The workers lead the learning process, and when they starttrain_op
, they work on devices"/job:ps"
that will update the shared variables.
Questions:
- Are there variables in
ps
on the cpu or gpu? Also, is there any performance boost if it"/job:ps"
is on a CPU or GPU? - Do the lower-level libraries decide where to place a variable or operation?
source to share
Are there variables in ps on cpu or gpu? Also, is there any performance gain if "/ job: ps" is on a cpu or gpu?
You can bind a job ps
to one of them (with exceptions, see below), but binding to a GPU is not practical. ps
is really a repository of parameters and operating systems for updating it. A processor unit can have much more memory (i.e. main RAM) than a GPU, and it updates parameters fairly quickly as gradients approach. In most cases, matrix multiplications, convolutions, and other expensive operations are performed by workers, hence placing a worker
on the GPU makes sense. Placing ps
on a GPU is a waste of resources, unless the work ps
is doing something very specific and costly.
But: Tensorflow does not currently have a GPU core for integer variables, so the following code will not work when Tensorflow tries to put the variable i
on GPU # 0:
with tf.device("/gpu:0"):
i = tf.Variable(3)
with tf.Session() as sess:
sess.run(i.initializer) # Fails!
with the following message:
Could not satisfy explicit device specification '/device:GPU:0'
because no supported kernel for GPU devices is available.
This is the case when there is no device selection for the parameter and therefore for the parameter server: CPU only.
Do the lower-level libraries decide where to place a variable or operation?
If I ask this question correctly, the node placement rules are pretty simple:
- If a node was already placed on a device in the previous graph, it remains on that device.
- Otherwise, if the user has bound a node to a device via
tf.device
, placer puts it on that device. - Else, it defaults to GPU # 0 or CPU if no GPU is available.
The Tensorflow white paper also describes dynamic scattering, which is more complex, but not part of the open source version of tensorflow right now.
source to share