HDFS behavior on a large number of small files and a block size of 128 MB
I have many (up to hundreds of thousands) small files, each 10-100 KB. I have an HDFS block size of 128 MB. I have a replication factor of 1.
Are there any disadvantages to allocating an HDFS block for a small file?
I've seen some rather conflicting answers:
- The answer that says the smallest file takes up the whole block
- An answer that said HDFS is smart enough and a small file will accept small_file_size + 300 bytes of metadata
I did a test like in this answer and it proves that the second option is correct - HDFS doesn't allocate the entire block for small files.
But what about batch reading 10,000 small files from HDFS? Will it slow down due to 10,000 blocks and metadata? Is there a reason to keep multiple small files in one block?
Update: my use case
I only have one use case for small files, 1,000 to 500,000. I figured out these files once, saved them and read them all at once.
1) As far as I understand, the NameNode namespace issue is not an issue for me. 500,000 is the absolute maximum, I will never have more. If each small file is 150 bytes per NN, then for me the absolute maximum is 71.52 MB, which is acceptable.
2) Does Apache Spark fix the MapReduce issue? Will the sequence or HAR files help me solve the problem? As I understand it, Spark should not depend on Hadoop MR, but it is still too slow. 490 files take 38 seconds to read, 3420 files take 266 seconds.
sparkSession
.read()
.parquet(pathsToSmallFilesCollection)
.as(Encoders.kryo(SmallFileWrapper.class))
.coalesce(numPartitions);
source to share
As you may have noticed, the HDFS file does not take up more space than it needs, but there are other drawbacks to having small files in an HDFS cluster. Open problems first without taking dosing into account:
- Memory consumption NameNode (NN). I don't know about Hadoop 3 (which is currently under development), but in previous versions, NN is the only point of failure (you can add a secondary NN, but it won't replace or improve on the primary NN at the end). NN is responsible for keeping the file system structure in memory and on disk and has limited resources. Each entry in a file system object supported by NN counts as 150 bytes ( check out this blog post ). More files = more RAM consumed by NN.
- MapReduce paradigm (and as far as I know, Spark suffers from the same symptoms). In Hadoop, Mappers are allocated for splitting (which is block by default), which means that for every small file you have, a new Mapper will need to be started to process its contents. The problem is that for small files, Hadoop actually takes a lot more to get Mapper up and running than it does to process the contents of the file. Basically, you will be doing unnecessary work of starting / stopping Mappers instead of actually processing the data. It is for this reason that Hadoop processes a very fast 1 128MBytes file (with 128MBytes chunks) rather than 128 1MBytes files (with the same chunk size).
Now, if we are talking about batch processing, there are several options you have: HAR, Sequence File, Avro schemas, etc. It depends on the use case to provide accurate answers to your questions. Let's say you don't want to merge files, in which case you can use HAR files (or any other efficient archiving and indexing solution). In this case, the NN problem is solved, but the number of Mappers will still equal the number of splits. In case combining files into a large option is an option, you can use Sequence File, which basically combines small files into larger ones, solving both problems to some extent. In both scenarios, although you cannot actually update / delete information directly as you might do with small files,therefore, more complex mechanisms are required to manage these structures.
In general, the main reason for maintaining a large number of small files is to try to do fast reads, I would suggest looking at various systems like HBase that were built for fast data access rather than batch processing.
source to share