Storing a file on Hadoop when not all of its replicas can be stored in the cluster

Can someone tell me what happens if my Hadoop cluster (replication factor = 3) is left with only 15 GB of space and I try to save a 6 GB file?

hdfs dfs -put 6gbfile.txt /some/path/on/hadoop

      

Would the operation put

failing (possibly full cluster) or would it keep two copies of the 6GB file and mark the blocks it cannot store on the cluster as insufficiently replicated and thus taking up the entire 15GB holdover?

+3


source to share


2 answers


You should be able to store the file.

It will try to accommodate as many replicas as possible. When it fails to save all replicas, it issues a warning, but does not interrupt. As a result, you will find yourself in under-replicated blocks.



The warning you will see

WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place enough replicas 

      

+2


source


When you run the put command:

The dfs utility behaves like a client here.

the client will first contact the namenode, then the namenode will direct the client where it will write the blocks and store the metadata for this file, and then its client's responsibility to break the data in the block according to the specified configuration.



The client will then establish a direct connection to different data nodes where he has to write different blocks according to the manode's answer.

The first copy of the data will be written by the client only on the data nodes, subsequent copies of the data node will be created on top of each other using the instructions from the namenode.

So you should be able to accommodate a 6GB file if it is 15GB, because initially the original copies are created on hadoop, later, as soon as the replication process starts, the problem will arise.

+2


source







All Articles