HBase cluster with damaged region file on HDFS

We have this HBase cluster: more than 30 nodes, 48 ​​tables, 40 + TB at HDFS level, replication factor 2. Due to disk failure on two nodes, we have a corrupted file on HDFS.

Current state of HDFS

Output output hdfs fsck /

that shows corrupted HBase area file:

/user/hbase/table_foo_bar/295cff9c67379c1204a6ddd15808af0b/n/ae0fdf7d0fa24ad1914ca934d3493e56: 
 CORRUPT blockpool BP-323062689-192.168.12.45-1357244568924 block blk_9209554458788732793
/user/hbase/table_foo_bar/295cff9c67379c1204a6ddd15808af0b/n/ae0fdf7d0fa24ad1914ca934d3493e56:
 MISSING 1 blocks of total size 134217728 B

  CORRUPT FILES:        1
  MISSING BLOCKS:       1
  MISSING SIZE:         134217728 B
  CORRUPT BLOCKS:       1

The filesystem under path '/' is CORRUPT

      

Lost data is not recovered (disks are broken).

Current state of HBase

According to HBase, on the other hand, everything is fine and dandy

hbase hbck

is talking:

Version: 0.94.6-cdh4.4.0
...
 table_foo_bar is okay.
   Number of regions: 1425
   Deployed on:  ....
...
0 inconsistencies detected.
Status: OK   

      

Also, it seems that we can still request data from the unallocated chunks of the corrupt region file (as far as I think I was able to check based on the start and end line key in the region).

Next steps

  • Since the data of the file block cannot be recovered, it seems the only way to delete the complete damaged file (using hadoop fs -rm

    or hadoop fsck -delete /

    ). This will "fix" HDFS corruption.
  • However, I am afraid that deleting the HDFS file will lead to corruption at the HBase level as the complete region file will disappear.
  • I've considered hadoop fsck -move /

    to move the corrupted file over /lost+found

    and see how HBase takes that, but going to is /lost+found

    not reversible as it seems , so I hesitate about it.

Concrete questions:

Should I just delete the file? (Loss of data corresponding to this region is good enough for us.) What bad things happen when you manually delete the HBase region file in HDFS? Is it just deleting data or injecting ugly metadata corruption in HBase that needs to be taken care of as well?

Or can we actually leave the situation as it is, which seems to be working for the moment (HBase does not complain / sees corruption)?

+3


source to share


3 answers


We had similar situations: 5 missing blocks, 5 corrupted files for the HBase table.
HBase version: 0.94.15
distro: CDH 4.7
OS: CentOS 6.4

Recovery instructions:

  • switch to user hbase: su hbase

  • hbase hbck -details

    to understand the scale of the problem.
  • hbase hbck -fix

    to try to recover from regional inconsistencies.
  • hbase hbck -repair

    tried to auto repair but actually increased the inconsistencies by 1
  • hbase hbck -fixMeta -fixAssignments

  • hbase hbck -repair

    these time tables have been restored
  • hbase hbck -details

    to confirm the correction.

At this point HBase was healthy, added an extra region and removed the corrupted files. However, HDFS still had 5 corrupted files. Since they no longer referenced HBase, we removed them:



  • switch to hdfs user: su hdfs

  • hdfs fsck /

    to understand the scope of the problem.
  • hdfs fsck / -delete

    delete only corrupted files
  • hdfs fsck /

    to confirm the state of health

NOTE: it is important to completely stop the stack before resetting the caches
(stop all thrift, hbase, zoo keeper, hdfs services and start them again in the reverse order).

[1] Cloudera page for hbck command:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_hbck_poller.html

+7


source


FYI: I decided to bite the bullet and manually removed the corrupted file from HDFS with:

hdfs dfs -rm /user/hbase/table_foo_bar/295cff9c67379c1204a6dd....

      

( hdfs fsck -move

doesn't work for me, don't know why)

After that I checked the HBase health with the help hbck

but no inconsistencies were found



$ hbase hbck
...
0 inconsistencies detected.
Status: OK

      

So, in our case, manually deleting the realm file did not damage HBase, if I understand correctly, which is nice but confusing. (I hope this will not backfire and corruption will not show up at a later point in time)

issue closed

Your mileage may vary.

+2


source


If inconsistencies are found at the scope level, use the -fix argument to direct hbck to try to fix them. The following sequence of steps is performed:

$ ./bin/hbase hbck -fix

      

-fix includes

  • A standard non-conformance check is performed.
  • If necessary, repair work is carried out in the tables
  • If necessary, repair work is carried out in the regions. Regions are closed during repairs.

So, before running -fix, if you want to fix individual inconsistencies at the region level separately

-fixAssignments (equivalent to 0.90 -fix option) restores unassigned, misassigned, or reassigned regions.

-fixMeta, which removes meta lines when the corresponding regions are not present in HDFS, and adds new meta lines if they are present in HDFS and not META.

-fix includes {-fixAssignments and -fixMeta}

 $ ./bin/hbase hbck -fixAssignments
 $ ./bin/hbase hbck -fixAssignments -fixMeta

      

There are several classes of table integrity problems that are classified as low-risk repairs. The first two are degenerate (startkey == endkey) regions and back regions (startkey> endkey). They are automatically processed by passing data to a temporary directory (/ hbck / xxxx). The third low risk class is hdfs area holes. This can be restored using:

-fixHdfsHoles to create new empty areas in the file system. If holes are found, you can use -fixHdfsHoles and must include -fixMeta and -fixAssignments to make the new region consistent.

 $ ./bin/hbase hbck -fixAssignments -fixMeta -fixHdfsHoles

      

-repairHoles include {-fixAssignments -fixMeta -fixHdfsHoles}

 $ ./bin/hbase hbck -repairHoles

      

0


source







All Articles