What are the possible causes of imbalance of files stored on HDFS?

Sometimes blocks of data are stored in an unbalanced way across node data. Based on HDFS block allocation policy, the first copy will be stored in the record node (i.e. the client node), then the second replica is stored on the remote rack and the third on the local rack.What are the use cases for data blocks that are not balanced across data nodes under this placement policy? one possible reason is that if the write nodes are few, then these nodes will store one copy of the data blocks. Are there any other reasons?

+3


source to share


1 answer


Some potential causes of data corruption include:

  • If some of the DataNodes are unavailable for some time (not accepting requests / writes), the cluster may become unbalanced.
  • TaskTrackers are evenly distributed among DataNodes across the cluster nodes. If we write data through MapReduce in this situation, the cluster may be unbalanced, because nodes that host both the TaskTracker and the DataNode will be preferred.
  • Same as above, but with RegionServers from HBase.
  • Large deletion of data can lead to an imbalanced cluster depending on the location of the deleted blocks.
  • Adding new DataNodes will not automatically rebalance existing blocks in the cluster.


The "hdfs balancer" command allows administrators to balance the cluster. Additionally, https://issues.apache.org/jira/browse/HDFS-1804 has added a new block retention policy that takes into account the free space remaining on the volume.

+4


source







All Articles