Does it matter where I submit the work orders?

Does it have any measurable impact on resources, am I submitting a bunch of howop jobs from different client servers, or all from one? I would not have thought, since all the work is done in the cluster. Is it correct?

+3


source to share


3 answers


It doesn't matter where you submit your assignments from. The client itself doesn't do much, it uses the RPC protocol to communicate with the services and then just sits around until the job is done.



Also, the most important thing is which scheduler you use to allocate resources, which will probably make the most significant difference and decide which resources to allocate to which work. Read more about work scheduling here .

+3


source


The only thing that is resource intensive for a client submitting a Hadoop cluster is calculating the input splits. When the inputs are huge, or when too many jobs are submitted from the same client, the computation of the input-wise may slow down the job submission slightly.



I can't remember the Hadoop edition or setting, but a configurable setting was configured to move the computation of input splits from the client submitting the job to the Hadoop cluster.

+6


source


I don't think you can move the input split computation to Job Tracker in the "Classic" version. In YARN, you can move it with

"yarn.app.mapreduce.am.compute-splits-into-cluster"

My guess is that the Hadoop people didn't want to overload the Job tracker with creating input sections. Similar to the design non-compliance solution, there is too much work for the Namenode in HDFS.

In YARN, each job gets its own Application Master, so there is no need to worry about overloading the SPOF / bottleneck master such as tracking work.

In relation to the original question, the client job would have to refer to the naming to get the location of the blocks (I can see parts of the code in the data storage class of the class calling the node data for some metadata ... don't make sure this happens during the creation of the input split or in the task manager node). This can become a problem if you are processing multiple jobs on the same node client. If you are using YARN, there will be a slight performance gain if all of these messages happen inside the cluster.

It is necessary to check how Oozie handles this problem.

Hope this helps! Arun

+1


source







All Articles