Why is Hadoop Spilling happening?

I am very new to Hadoop system and learning phase.

One thing I noticed in the Shuffle and Sort phase is that Spill will fire whenever the MapOutputBuffer reaches 80% (I think this can be customized as well).

Now why is the spill phase needed?

Is it because MapOutputBuffer is a circular buffer, and if we don't empty it, it could cause data overwrites and memory leaks?

+3


source to share


1 answer


I wrote a nice article that covers this topic: http://0x0fff.com/hadoop-mapreduce-comprehensive-description/

Generally:



  • A spill occurs when there is not enough memory to fit all of the output. The amount of available memory for this is setmapreduce.task.io.sort.mb

  • This happens when 80% of the buffer space is occupied, because the spill is performed on a separate thread so as not to interfere with the Mapper. If the buffer reaches 100% usage, then the mapper thread should stop and wait for the thread's thread to free space. To avoid this, a threshold of 80% is chosen.
  • The spill occurs at least once when the handler is finished, because the handler's output must be sorted and saved to disk for the reducer processes to read it. And it makes no sense to invent a separate function until the last "save to disk", because in general it performs the same task
+6


source







All Articles