Why is Hadoop considered I / O intensive?

I have read some literature on Hadoop Map / Reduce and there seems to be a general topic: Hadoop Jobs - I / O Rate (example: Sorting with Map / Reduce).

What makes these I / O intensive (given the fact that Hadoop is pushing computation towards data)? Example: Why is Hadoop sorting more intensive?

My intuition: It seems that after the map phase, the intermediate pairs are sent to the reducers. Is this the reason for the huge I / O?

+3


source to share


1 answer


Hadoop is used to perform computations on large amounts of data. Your jobs can be limited by IO (intensive I / O, as you call it), CPU and network resources. In the classic Hadoop case, you perform local computations on huge amounts of input, returning a relatively small set of results, making your task more CPU and network intensive, but highly dependent on the work itself. Here are some examples:

  • IO Intensive work . You are reading a lot of data on the card side, but the result of your card task is not that great. An example is calculating the number of lines in the input text, calculating the sum over some column from an RCfile, getting the result of a Hive query over one table with a group of columns with relatively low cardinality. This would mean that what your job is doing is basically reading the data and simply processing it.
  • CPU Intensive job . When you need to do some complex calculations on a map or shrink a side. For example, you are doing some kind of NLP (natural language processing) like tokenization, part of tags with comments, creation, etc. In addition, if you store data in highly compressed formats, then data decompression can become a bottleneck in the process (here an example from Facebook , where they were looking for a balance between CPU and IO)
  • Network intensity . Usually, if you see high network utilization in a cluster, it means that someone missed this point and completed a job that is transferring a lot of data over the network. For the wordcount example, imagine processing 1PB of the input in this job with just a map and reducer, without joining. Thus, the amount of data moved between map and pruning tasks will be even larger than the input dataset, and it will all be sent over the network. It could also mean that you are not using intermediate data compression (mapred.compress.map.output and mapred.map.output.compression.codec) and the original map output is sent over the network.


You can refer to this guide for initial cluster setup. So why is sorting intensive? First, you read data from disks. Further, when sorting the amount of data created by cartographers, this same amount that was read means that it will most likely not fit into memory and must be passed to disks. It then switched to gearboxes and spilled onto discs again. And then it was processed by the gearbox and hit the discs again. Although the processor required for sorting is relatively small, especially if the sort key is a number and can be easily parsed from the input.

+5


source







All Articles