Hadoop: how are reducer nodes selected?

Question

Hadoop: how are reducer nodes selected?

I just started learning Hadoop but don't understand how datanode becomes a node reducer.

Once the map task is finished, the contents of the sort buffer will be flushed to the local disk after the KV pairs are sorted and partitioned
The jobtracker is then notified of the spilled sections.
After that, the reducers begin to request data from a specific section.

But how does jobtracker decide which node becomes the reducer node? I am reading the Hadoop Definitive manual, but this step is not mentioned in the book.

Thanks, Bruckwald

+3

mapreduce hadoop

Bruckwald 08 Sep 14 at 16:54

source to share

1 answer

Thomas jungblut · Accepted Answer · 2014-09-08T17:45:46+0000

To a great extent first-come, first-serve

. Tasks are assigned by beats, so if Tasktracker pings Jobtracker that it is alive, it will receive a response that may contain a new task to run:

List<Task> tasks = getSetupAndCleanupTasks(taskTrackerStatus);
if (tasks == null ) {
   tasks = taskScheduler.assignTasks(taskTrackerStatus);
}
if (tasks != null) {
   for (Task task : tasks) {
     expireLaunchingTasks.addNewTask(task.getTaskID());
     LOG.debug(trackerName + " -> LaunchTask: " + task.getTaskID());
     actions.add(new LaunchTaskAction(task));
   }
}

Here is the corresponding Jobtracker source code . Thus, in addition to the tasktracker being in the first place, the taskcheduler will check the resource conditions (for example, if there is a free slot, or one node is not overloaded).

The relevant code can be found here (which is not particularly interesting):

//
// Same thing, but for reduce tasks
// However we _never_ assign more than 1 reduce task per heartbeat
//
final int trackerCurrentReduceCapacity = 
  Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity), 
           trackerReduceCapacity);
final int availableReduceSlots = 
  Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1);
boolean exceededReducePadding = false;
if (availableReduceSlots > 0) {
  exceededReducePadding = exceededPadding(false, clusterStatus, 
                                          trackerReduceCapacity);
  synchronized (jobQueue) {
    for (JobInProgress job : jobQueue) {
      if (job.getStatus().getRunState() != JobStatus.RUNNING ||
          job.numReduceTasks == 0) {
        continue;
      }

      Task t = job.obtainNewReduceTask(taskTracker, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts());                             
      if (t != null) {
        assignedTasks.add(t);
        break;
      }

      // Don't assign reduce tasks to the hilt!
      // Leave some free slots in the cluster for future task-failures,
      // speculative tasks etc. beyond the highest priority job
      if (exceededReducePadding) {
        break;
      }
    }
  }

Basically, the first tasktracker that accepts Jobtracker bits and has enough slots available will have pruning problems.

Hadoop: how are reducer nodes selected?

More articles: