Are multiple files stored in the same block?

When I store many small files in HDFS, will they be stored in one block?

In my opinion, these small files should be stored in a single block as per this discussion: HDFS block size with actual file size

+1


source to share


4 answers


Quoting from Hadoop is the definitive guide :

HDFS does not store small files efficiently, as each file is stored in a block and locks the metadata stored in memory using a namenode. Thus, a large number of small files can eat up a lot of memory on a change. (Note, however, that small files do not take more than is required to store the file's raw content. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.) Hadoop archives or HAR files is a file archiving tool that packs files into HDFS locks more efficiently, thereby reducing the memory usage of the people while maintaining transparent file access.



Conclusion: Each file will be saved in a separate block.

+8


source


Below is what the Hadoop Definitive Guide states:

Unlike a file system for a single disk, a file in HDFS that is less than one block does not take up full blocks of the underlying storage



For example, if you have a 30MB file and your block size is 64MB, then this file will be stored in one block logically, but in the physical file system HDFS only uses 30MB to store the file. The remaining 30 MB will be free.

+1


source


each block belongs to only one file, just do the following: 1.use fsck to get information about a file block:

hadoop fsck /gavial/data/OB/AIR/PM25/201709/01/15_00.json -files -blocks

      

out looks like this:

    /gavial/data/OB/AIR/PM25/201709/01/15_00.json 521340 bytes, 1 block(s):  OK
0. BP-1004679263-192.168.130.151-1485326068364:blk_1074920015_1179253 len=521340 repl=3

Status: HEALTHY
 Total size:    521340 B
 Total dirs:    0
 Total files:   1
 Total symlinks:        0
 Total blocks (validated):  1 (avg. block size 521340 B)
 Minimally replicated blocks:   1 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)

      

block id
blk_1074920015

      

2. Use fsck command to display block status, outside of this

hdfs fsck -blockId blk_1074920015

Block Id: blk_1074920015
Block belongs to: /gavial/data/OB/AIR/PM25/201709/01/15_00.json
No. of Expected Replica: 3
No. of live Replica: 3
No. of excess Replica: 0
No. of stale Replica: 0
No. of decommission Replica: 0
No. of corrupted Replica: 0
Block replica on datanode/rack: datanode-5/default-rack is HEALTHY
Block replica on datanode/rack: datanode-1/default-rack is HEALTHY

      

obviously the block belongs to only one file

0


source


Yes. when you store a large number of small files, they are stored in one block until the block has equal nesting space. But the inefficiency arises because an indexing entry (filename, block, offset) will be created for each of these small files, created in the namenode for each small file. This gets rid of the memory reserved for metadata in the namenode if we have many small files instead of a few very large files.

-1


source







All Articles